-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
[Bugfix] Initialize attention bias on the same device as Query/Key/Value #13468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
f466344
to
87824ed
Compare
The pre-commit CI passed once, but failed after I signed off and force-pushed. I'm not sure why. |
This could solve issues like huggingface/open-r1#278 and facebookresearch/xformers#1064 (comment) |
Signed-off-by: Junlin Zhou <[email protected]>
87824ed
to
275d082
Compare
using vllm==0.7.3, still having this issue |
same question, how to solve it ? |
You need to either install from the main branch, or wait for a release. |
…lue (vllm-project#13468) Signed-off-by: Louis Ulmer <[email protected]>
The attention bias in vLLM's xformers backend is currently initialized on the default device, rather than the device of the Q/K/V tensors:
vllm/vllm/attention/backends/xformers.py
Lines 676 to 677 in b53d799
And here is how xformers decide which device to use:
https://github.com/facebookresearch/xformers/blob/8d91ce05a2f6a5ae059593922a631b9ff325b134/xformers/ops/fmha/attn_bias.py#L742:
https://github.com/facebookresearch/xformers/blob/8d91ce05a2f6a5ae059593922a631b9ff325b134/xformers/ops/fmha/attn_bias.py#L90
This becomes problematic when vLLM is used in conjunction with libraries like
trl
for GRPO training. In such cases, vLLM might be assigned to run on a specific GPU (e.g., the next available GPU after those used for training, which is the default behaviour oftrl
).For example, if I have 8 GPUs and use
cuda:0
tocuda:6
for GRPO training, vLLM will then be assigned tocuda:7
. However, the current attention bias initialization will place the bias oncuda:0
, leading to the following error:This PR will probably solve this issue.