Replies: 1 comment
-
Is this a case where (Pi_old) is a little ambiguous? As in, Pi_old in PPO is just the policy used to collect the data (instead of the policy weights from the previous step). So you try not to diverge too much from the policy that you used to generate data (and thus when you are only doing one optimization step on that data, you'll never need the policy ratio in the first place)? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
When
num_generations=1
(andsteps_per_generation <= gradient_accumulation_steps
), the GRPO trainer just setsold_per_token_logps
toper_token_logps.detach()
, so the ratio is always 1. I understand it means we don't have to keep the weights from the previous step in memory, but other than that, why do we do this and why does it work?Beta Was this translation helpful? Give feedback.
All reactions