Single controller hackathon smd #149

wensun · 2025-08-27T20:45:52Z

(1) Add the sequence level RL algorithm Squared Mirror Descent (SMD) used by Kimi 2 and kimi 1.5.

Two sample runs with different hyperparameters:

single-controller-hackathon-smd-B73Na5
single-controller-hackathon-smd-qBJppK

Both have 84% on math 500 and 78% on math hard.

(2) Added decoupled GRPO (importance weighting using logp problem vllm engine), and the importance weighted SMD. Mlflow runs

compose_rl/algorithms/online/generation_utils/generation_utils.py

compose_rl/algorithms/online/model_methods.py

test_single_controller_ppo.py

compose_rl/utils/ray_utils.py

compose_rl/algorithms/online/model_methods.py

bowenyang008

LGTM with some minor comments, maybe worth another algo expert (@jdchang1 or Jailu) to cross check it?

jdchang1

LGTM as well! Thanks for the PR

wensun added 30 commits August 25, 2025 20:20

.

c5f4b5c

.

60f016a

.

50b3a70

.

bed1179

.

c69e228

.

665b80c

.

8ae2b01

.

752995b

.

83e8a8f

.

4a3a1ac

.

0c45113

.

689fd19

.

f27e5e8

.

3cae5d1

.

3ad15d3

.

3faab3c

.

70730bc

.

d309886

.

9090acd

.

a1d5dfe

.

808dcd1

.

746310a

.

6434316

.

b578daf

.

fa59b25

.

1f6af37

.

1a8adc8

.

8cb130e

.

b65384f

.

68b9bfb

wensun added 17 commits September 1, 2025 09:15

.

e11b2a5

delete ray debug, not useful

a693b1d

remove some debug print

fec31ba

.

d810e04

first draft of decoupled ppo

8736cb0

.

3b6d7f2

.

e3212ef

.

ef2b7f4

.

f047c33

.

869923e

add importance weight option

d6a9922

.

28c10bb

clean up logging

c917afb

more comments

ed9e1f0

revert the yamls back but added importance weight

0fd0458

include all math evals

1641314

.

70b40a3