Support getting each turn in a multi-turn task as single training sample

I’d like to request a feature to support splitting a n-turn trajectory into n training samples as described in the [Kevin paper](https://arxiv.org/pdf/2507.11948), as a way to improve sample efficiency.

As described in the paper:
In each multi-turn training step:
1. For each task, we sample m parallel trajectories with n refinement turns. To improve sample efficiency, each refinement turn (CoT + response) in a trajectory becomes a single training sample. The response of the model after the CoT consists of a kernel and a CoT summary.
2. We construct the context of a sample by including the history of previous responses, which include generated kernels along with their summarized CoTs, and evaluation feedback.
3. We evaluate the generated kernel and compute its score as shown in Section 3.2. The reward of each turn (CoT + response) is the discounted sum of current and subsequent scores, which we elaborate in Section 4.3.
4. For each task, we normalize the rewards across the mn samples for advantage calculation. Then we compute the GRPO loss over the entire batch.

This feature could also be used in order to summarize previous steps of the trajectory allowing for better management of context size, where the LLM would be able to see just the env response of the previous step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support getting each turn in a multi-turn task as single training sample #170

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support getting each turn in a multi-turn task as single training sample #170

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions