NCCL error when running vf-vllm with multiple GPUs

Thanks for the great work! 

I wanted to run the [examples/train_math_python.py](https://github.com/willccbb/verifiers/blob/main/examples/grpo/train_math_python.py) on an 8xA100 machine with a smaller model. 

## My Setup

Training script 
```python
import verifiers as vf

vf_env = vf.load_environment(env_id="math-python")
model_name = "Qwen/Qwen2.5-1.5B-Instruct"  
model, tokenizer = vf.get_model_and_tokenizer(model_name)

run_name = "math-python_qwen2.5-1.5b"
training_args = vf.grpo_defaults(run_name=run_name)

training_args.per_device_train_batch_size = 4 
training_args.num_generations = 8  
training_args.gradient_accumulation_steps = 8
training_args.max_tokens = 2048
training_args.max_seq_len = 4096
training_args.max_steps = 200
training_args.mask_env_responses = True
training_args.max_grad_norm = 0.1
training_args.beta = 0.1

trainer = vf.GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    env=vf_env,
    args=training_args,
)
trainer.train()
```

Commands (following [docs](https://verifiers.readthedocs.io/en/latest/training.html#) and [example](https://github.com/willccbb/verifiers/blob/main/examples/grpo/train_math_python.py))

```bash
# Shell 1 - vLLM server
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 vf-vllm --model Qwen/Qwen2.5-1.5B-Instruct \
    --data-parallel-size 6 --enforce-eager --disable-log-requests

# Shell 2 - Training
CUDA_VISIBLE_DEVICES=6,7 accelerate launch --num-processes 2 \
    --config-file verifiers/configs/zero3.yaml train.py
```

However I get an NCCL error. With `NCCL_DEBUG=INFO` I find that `vf-vllm` consistently creates 7 NCCL ranks when `--data-parallel-size 6` is specified. 

```
ncclCommInitRank comm 0x62b7b22d6c10 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 7000
```
This leads to 
```
Cuda failure 'invalid argument'
RuntimeError: NCCL error: unhandled cuda error
```

I looked all the closed issues related to NCCL here, and tried the following 
- `NCCL_P2P_DISABLE=1`
- `NCCL_CUMEM_ENABLE=1`
- Tried on multiple providers 


My machine's details are this 
```
- Platform: Linux-6.8.0-60-generic-x86_64-with-glibc2.35
- Python version: 3.11.13
- TRL version: 0.21.0
- PyTorch version: 2.7.1+cu126+cu126
- CUDA devices: 8x NVIDIA A100-SXM4-80GB
- Transformers version: 4.55.4
- Accelerate version: 1.10.1
- DeepSpeed version: 0.17.5
- vLLM version: 0.10.1.1+cu126
```

I started off of a fresh environment like this 
```
uv init 
uv add 'verifiers[all]' 
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation`
```

Tried this on different providers to see if it reproduces, and it does. Not sure what the issue is. Is there a known issue with `vf-vllm` creating `n+1` ranks? 

nvidia-smi: 

<img width="653" height="769" alt="Image" src="https://github.com/user-attachments/assets/27299c50-bf8f-4cff-892d-4c0268bf8037" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL error when running vf-vllm with multiple GPUs #239

My Setup

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NCCL error when running vf-vllm with multiple GPUs #239

Description

My Setup

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions