Multi-gpu in Zero3 mode, no parameter fragmentation is observed #7416
Replies: 1 comment
-
GPU utilization: |===============================+======================+======================| +-----------------------------------------------------------------------------+ |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I tried to deploy Qwen2.514B on two gpus and split the parameters between them using deepspeed's Zero3 to reduce the footprint of each gpus. But once I initialized deepspeed and got the engine, each GPU seemed to load the entire model parameter count. I tried to add a print to see how the model was sharded for each rank, but found that it was un sharded. Please help me to analyze it in detail. Thank you
print code:
if rank == 0:
total_params = 0
available_params = 0
Logs:
module.model.layers.4.input_layernorm.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
ZeroParamStatus.NOT_AVAILABLE
module.model.layers.4.post_attention_layernorm.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
ZeroParamStatus.NOT_AVAILABLE
module.model.layers.5.self_attn.q_proj.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
...................
.....................
Beta Was this translation helpful? Give feedback.
All reactions