Skip to content

NCCL error when running vf-vllm with multiple GPUs #239

@hrdkbhatnagar

Description

@hrdkbhatnagar

Thanks for the great work!

I wanted to run the examples/train_math_python.py on an 8xA100 machine with a smaller model.

My Setup

Training script

import verifiers as vf

vf_env = vf.load_environment(env_id="math-python")
model_name = "Qwen/Qwen2.5-1.5B-Instruct"  
model, tokenizer = vf.get_model_and_tokenizer(model_name)

run_name = "math-python_qwen2.5-1.5b"
training_args = vf.grpo_defaults(run_name=run_name)

training_args.per_device_train_batch_size = 4 
training_args.num_generations = 8  
training_args.gradient_accumulation_steps = 8
training_args.max_tokens = 2048
training_args.max_seq_len = 4096
training_args.max_steps = 200
training_args.mask_env_responses = True
training_args.max_grad_norm = 0.1
training_args.beta = 0.1

trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=vf_env,
    args=training_args,
)
trainer.train()

Commands (following docs and example)

# Shell 1 - vLLM server
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 vf-vllm --model Qwen/Qwen2.5-1.5B-Instruct \
    --data-parallel-size 6 --enforce-eager --disable-log-requests

# Shell 2 - Training
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num-processes 2 \
    --config-file verifiers/configs/zero3.yaml train.py

However I get an NCCL error. With NCCL_DEBUG=INFO I find that vf-vllm consistently creates 7 NCCL ranks when --data-parallel-size 6 is specified.

ncclCommInitRank comm 0x62b7b22d6c10 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 7000

This leads to

Cuda failure 'invalid argument'
RuntimeError: NCCL error: unhandled cuda error

I looked all the closed issues related to NCCL here, and tried the following

  • NCCL_P2P_DISABLE=1
  • NCCL_CUMEM_ENABLE=1
  • Tried on multiple providers

My machine's details are this

- Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35
- Python version: 3.11.13
- TRL version: 0.21.0
- PyTorch version: 2.7.1+cu126+cu126
- CUDA devices: 8x NVIDIA A100-SXM4-80GB
- Transformers version: 4.55.4
- Accelerate version: 1.10.1
- DeepSpeed version: 0.17.5
- vLLM version: 0.10.1.1+cu126

I started off of a fresh environment like this

uv init 
uv add 'verifiers[all]' 
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation`

Tried this on different providers to see if it reproduces, and it does. Not sure what the issue is. Is there a known issue with vf-vllm creating n+1 ranks?

nvidia-smi:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions