[GRPO] why doesn't GRPO trainer use old policy for ration when num_iterations=1? #3721

JVP15 · 2025-07-11T15:31:45Z

JVP15
Jul 11, 2025

When num_generations=1 (and steps_per_generation <= gradient_accumulation_steps), the GRPO trainer just sets old_per_token_logps to per_token_logps.detach(), so the ratio is always 1. I understand it means we don't have to keep the weights from the previous step in memory, but other than that, why do we do this and why does it work?

JVP15 · 2025-07-15T15:34:10Z

JVP15
Jul 15, 2025
Author

Is this a case where (Pi_old) is a little ambiguous? As in, Pi_old in PPO is just the policy used to collect the data (instead of the policy weights from the previous step). So you try not to diverge too much from the policy that you used to generate data (and thus when you are only doing one optimization step on that data, you'll never need the policy ratio in the first place)?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPO] why doesn't GRPO trainer use old policy for ration when num_iterations=1? #3721

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[GRPO] why doesn't GRPO trainer use old policy for ration when num_iterations=1? #3721

Uh oh!

JVP15 Jul 11, 2025

Replies: 1 comment

Uh oh!

JVP15 Jul 15, 2025 Author

JVP15
Jul 11, 2025

JVP15
Jul 15, 2025
Author