-
Notifications
You must be signed in to change notification settings - Fork 282
Open
Description
Thanks for the great work!
I wanted to run the examples/train_math_python.py on an 8xA100 machine with a smaller model.
My Setup
Training script
import verifiers as vf
vf_env = vf.load_environment(env_id="math-python")
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model, tokenizer = vf.get_model_and_tokenizer(model_name)
run_name = "math-python_qwen2.5-1.5b"
training_args = vf.grpo_defaults(run_name=run_name)
training_args.per_device_train_batch_size = 4
training_args.num_generations = 8
training_args.gradient_accumulation_steps = 8
training_args.max_tokens = 2048
training_args.max_seq_len = 4096
training_args.max_steps = 200
training_args.mask_env_responses = True
training_args.max_grad_norm = 0.1
training_args.beta = 0.1
trainer = vf.GRPOTrainer(
model=model,
processing_class=tokenizer,
env=vf_env,
args=training_args,
)
trainer.train()
Commands (following docs and example)
# Shell 1 - vLLM server
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 vf-vllm --model Qwen/Qwen2.5-1.5B-Instruct \
--data-parallel-size 6 --enforce-eager --disable-log-requests
# Shell 2 - Training
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num-processes 2 \
--config-file verifiers/configs/zero3.yaml train.py
However I get an NCCL error. With NCCL_DEBUG=INFO
I find that vf-vllm
consistently creates 7 NCCL ranks when --data-parallel-size 6
is specified.
ncclCommInitRank comm 0x62b7b22d6c10 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 7000
This leads to
Cuda failure 'invalid argument'
RuntimeError: NCCL error: unhandled cuda error
I looked all the closed issues related to NCCL here, and tried the following
NCCL_P2P_DISABLE=1
NCCL_CUMEM_ENABLE=1
- Tried on multiple providers
My machine's details are this
- Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35
- Python version: 3.11.13
- TRL version: 0.21.0
- PyTorch version: 2.7.1+cu126+cu126
- CUDA devices: 8x NVIDIA A100-SXM4-80GB
- Transformers version: 4.55.4
- Accelerate version: 1.10.1
- DeepSpeed version: 0.17.5
- vLLM version: 0.10.1.1+cu126
I started off of a fresh environment like this
uv init
uv add 'verifiers[all]'
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation`
Tried this on different providers to see if it reproduces, and it does. Not sure what the issue is. Is there a known issue with vf-vllm
creating n+1
ranks?
nvidia-smi:

Metadata
Metadata
Assignees
Labels
No labels