How can I train on a dynamically generated dataset (based on the model’s own outputs) using GRPO Trainer? #3542

xtitx0327 · 2025-06-05T08:31:29Z

xtitx0327
Jun 5, 2025

I’m new to RL and deep learning, so my question might seem simple. I would greatly appreciate any advice!

I’m using LLM to solve a multi-step sequence generation task. At each step:

The model receives (a) a system prompt and (b) the current sequence generated so far.
The model must choose one of three actions:

Append something new to the sequence
Remove/modify something already in the current sequence
Terminate the generation process

I have implemented multiple reward functions to evaluate each action (append/remove/modify/terminate). Thus, the RL loop is:

Model generates an action.
Reward functions evaluate that action and assign a reward.
Model updates itself based on these rewards.

Conceptually, this fits the GRPO training loop. The problem is, my training data is not a fixed (“static”) dataset—instead, it’s generated on‐the‐fly from the model’s own past outputs. According to #3213 , the current GRPO Trainer does not support IterableDataset.

Question: What’s the recommended way to handle a dynamically generated dataset with GRPO Trainer? Is there a workaround, or do I need to implement a custom training loop? Thank you for any pointers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I train on a dynamically generated dataset (based on the model’s own outputs) using GRPO Trainer? #3542

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How can I train on a dynamically generated dataset (based on the model’s own outputs) using GRPO Trainer? #3542

Uh oh!

xtitx0327 Jun 5, 2025

Replies: 0 comments

xtitx0327
Jun 5, 2025