-
Notifications
You must be signed in to change notification settings - Fork 288
Open
Description
I’d like to request a feature to support splitting a n-turn trajectory into n training samples as described in the Kevin paper, as a way to improve sample efficiency.
As described in the paper:
In each multi-turn training step:
- For each task, we sample m parallel trajectories with n refinement turns. To improve sample efficiency, each refinement turn (CoT + response) in a trajectory becomes a single training sample. The response of the model after the CoT consists of a kernel and a CoT summary.
- We construct the context of a sample by including the history of previous responses, which include generated kernels along with their summarized CoTs, and evaluation feedback.
- We evaluate the generated kernel and compute its score as shown in Section 3.2. The reward of each turn (CoT + response) is the discounted sum of current and subsequent scores, which we elaborate in Section 4.3.
- For each task, we normalize the rewards across the mn samples for advantage calculation. Then we compute the GRPO loss over the entire batch.
This feature could also be used in order to summarize previous steps of the trajectory allowing for better management of context size, where the LLM would be able to see just the env response of the previous step.
franciscogaluppo
Metadata
Metadata
Assignees
Labels
No labels