Can we support multi GPUs when doing inference?

Now I tried run this 27b Gemma3 model on 2*40G VRAM A100 GPUs. If I run this script directly, it will shows out of memory, because bf16 model is 55G and one A100 can't afford this. I use DeepSpeed to modify the script, and finally I succeed run this model on 2 A100 GPUs. But the time cost is too much for each run, especially when I set large output_len like 1500, and this will cost about 2 hours for 1 prompt run with 27b bf16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can we support multi GPUs when doing inference? #87

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can we support multi GPUs when doing inference? #87

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions