Skip to content

Conversation

wensun
Copy link
Collaborator

@wensun wensun commented Aug 27, 2025

(1) Add the sequence level RL algorithm Squared Mirror Descent (SMD) used by Kimi 2 and kimi 1.5.

Two sample runs with different hyperparameters:

single-controller-hackathon-smd-B73Na5
single-controller-hackathon-smd-qBJppK

Both have 84% on math 500 and 78% on math hard.

(2) Added decoupled GRPO (importance weighting using logp problem vllm engine), and the importance weighted SMD. Mlflow runs

Copy link
Collaborator

@bowenyang008 bowenyang008 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments, maybe worth another algo expert (@jdchang1 or Jailu) to cross check it?

Copy link
Collaborator

@jdchang1 jdchang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well! Thanks for the PR

@wensun wensun merged commit e9ba16f into single-controller-hackathon Sep 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants