Skip to content

Support getting each turn in a multi-turn task as single training sample #170

@LuanBrt

Description

@LuanBrt

I’d like to request a feature to support splitting a n-turn trajectory into n training samples as described in the Kevin paper, as a way to improve sample efficiency.

As described in the paper:
In each multi-turn training step:

  1. For each task, we sample m parallel trajectories with n refinement turns. To improve sample efficiency, each refinement turn (CoT + response) in a trajectory becomes a single training sample. The response of the model after the CoT consists of a kernel and a CoT summary.
  2. We construct the context of a sample by including the history of previous responses, which include generated kernels along with their summarized CoTs, and evaluation feedback.
  3. We evaluate the generated kernel and compute its score as shown in Section 3.2. The reward of each turn (CoT + response) is the discounted sum of current and subsequent scores, which we elaborate in Section 4.3.
  4. For each task, we normalize the rewards across the mn samples for advantage calculation. Then we compute the GRPO loss over the entire batch.

This feature could also be used in order to summarize previous steps of the trajectory allowing for better management of context size, where the LLM would be able to see just the env response of the previous step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions