NVIDIA research team has recently released a new open-source AI framework called Polar. This framework aims to help existing agent frameworks (such as Codex, Claude Code, and Qwen Code) integrate a training method called Generalized Relative Policy Optimization (GRPO), without affecting their original tool calling, context organization, and patch submission methods. This innovation will greatly enhance the performance of code agents.

image.png

GRPO is an optimization technique for reinforcement learning that adjusts model policies through reward signals, helping models learn better behaviors in multi-step decision-making tasks. In this study, GRPO is mainly used for training code agents, aiming to continuously improve the model's performance in actual tool calling and patch submission processes.

Research shows that the reinforcement learning of agents is gradually shifting from single-step tasks to more complex long-process tasks, such as working with code repositories, browser operations, and interactions with operating systems. These tasks often rely on existing execution frameworks, involving multiple rounds of calls, tool usage, and context management. Therefore, directly rewriting these frameworks into traditional reinforcement learning environment interfaces is very difficult and may lead to the loss of critical training signals.

The NVIDIA Polar framework does not attempt to rewrite agent frameworks but instead places agents at the boundary of the model API, keeping the original operational logic unchanged. Polar acts as a model agent between the execution framework and the reasoning server, supporting various request styles, recording key data, and converting it into information usable for training.

From a system architecture perspective, Polar includes functions such as task submission, session scheduling, and state persistence. By optimizing the initialization, execution, and post-processing workflows, it significantly improves training efficiency. According to experimental results, agents trained using Polar and GRPO have shown significant performance improvements in the SWE-Bench Verified test, with Codex's pass@1 score increasing from 3.8% to 26.4%, a growth of 594.74%.

In addition, the framework also demonstrates excellent efficiency, reducing training time by approximately 5.39 times and significantly improving the average GPU utilization, providing stronger support for future agent training.

Key Points:   

🛠️ NVIDIA has released the open-source AI framework Polar, helping frameworks like Codex to adopt new training methods.  

📈 The performance of Codex has significantly improved in the latest tests, with a 594.74% increase in the pass@1 score.  

⚙️ Polar optimizes training efficiency, significantly reducing training time and resource consumption.