Multi-gpu in Zero3 mode, no parameter fragmentation is observed #7416

1227600298 · 2025-07-07T03:47:56Z

1227600298
Jul 7, 2025

I tried to deploy Qwen2.514B on two gpus and split the parameters between them using deepspeed's Zero3 to reduce the footprint of each gpus. But once I initialized deepspeed and got the engine, each GPU seemed to load the entire model parameter count. I tried to add a print to see how the model was sharded for each rank, but found that it was un sharded. Please help me to analyze it in detail. Thank you

print code:
if rank == 0:
total_params = 0
available_params = 0

for name, param in engine.named_parameters():
    total_params += 1
    # 关键：检查参数是否具有 'ds_status' 属性，这是 ZeRO-3 参数的标志
    if hasattr(param, 'ds_status'):
        # 如果参数的物理分片此刻就在当前设备上，则为 AVAILABLE
        print(param.ds_status)
        if param.ds_status == deepspeed.runtime.zero.ZeroParamStatus.AVAILABLE:
            status = "✅ AVAILABLE (物理存在于 Rank 0)"
            available_params += 1
        else:
            # 否则，它就是 NOT_AVAILABLE，物理上在别的 GPU 上
            status = "❌ NOT_AVAILABLE (物理上在其他 Rank)"
        
        print(f"{name:<50} | Status: {status} | Shape: {param.shape} | Device (逻辑): {param.device}")
    else:
        # 如果没有 ds_status，说明这个参数可能没被 ZeRO-3 管理（例如外部参数）
        print(f"{name:<50} | Status: 'Not a ZeRO parameter'")

Logs:

module.model.layers.4.input_layernorm.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
ZeroParamStatus.NOT_AVAILABLE
module.model.layers.4.post_attention_layernorm.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
ZeroParamStatus.NOT_AVAILABLE
module.model.layers.5.self_attn.q_proj.weight | Status: ❌ NOT_AVAILABLE (物理上在其他 Rank) | Shape: torch.Size([0]) | Device (逻辑): cuda:0
...................
.....................

1227600298 · 2025-07-07T03:49:43Z

1227600298
Jul 7, 2025
Author

GPU utilization:

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-gpu in Zero3 mode, no parameter fragmentation is observed #7416

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multi-gpu in Zero3 mode, no parameter fragmentation is observed #7416

Uh oh!

1227600298 Jul 7, 2025

Replies: 1 comment

Uh oh!

1227600298 Jul 7, 2025 Author

1227600298
Jul 7, 2025

1227600298
Jul 7, 2025
Author