Introduce eRDMA support for StepMesh #32

GuangguanWang · 2025-08-28T08:18:09Z

Introduce eRDMA support for StepMesh

niehao100 · 2025-08-28T11:30:32Z

@GuangguanWang Thanks for your contribution, we will test it soon.

Signed-off-by: Guangguan Wang <[email protected]>

GuangguanWang · 2025-08-29T03:37:16Z

force push to fix potential incorrect data len and send_flags in wr when inline

niehao100 · 2025-09-04T11:26:34Z

Merge into eRDMA branch.

niehao100 · 2025-09-04T13:20:35Z

Hi @GuangguanWang

Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

Only the second tensor is pushed correctly, other two becomes all zeros;
When set RNIC to eth2, the auto configuration of interface is wrong.

We need your help on these problems.

root@gpu-h20-0156:/app/ps-lite# ibv_devinfo 
hca_id: erdma_0
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3775
        sys_image_guid:                 0216:3eff:fe2f:3775
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: erdma_1
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3111
        sys_image_guid:                 0216:3eff:fe2f:3111
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

root@gpu-h20-0156:/app/ps-lite/fserver# root@gpu-h20-0156:/app/ps-lite# ROLE=joinst RNIC=eth1 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
server [(0, [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')], [0, 1, 2])]
worker push [tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0'), tensor([[2., 2., 2., 2., 2., 2., 2., 2.]], device='cuda:0'), tensor([[3., 3., 3., 3., 3., 3., 3., 3.]], device='cuda:0')]
server pull result [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')]
Traceback (most recent call last):
  File "/app/ps-lite/tests/fserver/test_fserver.py", line 28, in <module>
    assert torch.allclose(sum(push_tensors), pull_tensors[0])
AssertionError
^Ckill all testing process of ps lite for user 
^Ctests/fserver/run_single_gpu.sh: line 2: 3091164 Killed                  DMLC_ROLE=scheduler python3 $THIS_DIR/$BIN.py
tests/fserver/run_single_gpu.sh: line 2: 3091165 Killed                  DMLC_ROLE=server python3 $THIS_DIR/$BIN.py $@


root@gpu-h20-0156:/app/ps-lite# PS_VERBOSE=2 ROLE=joinst RNIC=eth2 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
[21:16:34] scheduler /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.95
[21:16:34] scheduler /app/ps-lite/src/van.cc:600: Bind to [role=scheduler, id=1, ip=10.53.6.95, port=8123, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=1
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=14, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=15, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=16, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 16
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=17, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 17
[21:16:34] worker /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] server /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] worker /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] worker /app/ps-lite/src/van.cc:600: Bind to [role=worker, ip=10.53.6.96, port=36119, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=924, maxInline=96
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=925, maxInline=96
[21:16:34] server /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] server /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] server /app/ps-lite/src/van.cc:600: Bind to [role=server, ip=10.53.6.96, port=47583, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] server /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=926, maxInline=96
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=927, maxInline=96

GuangguanWang · 2025-09-05T12:28:15Z

Thanks for the test.

Hi @GuangguanWang

Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

Only the second tensor is pushed correctly, other two becomes all zeros;

I can not reproduce the issue, I am not sure whether there are some different env settings result in the issue.

(base) root@iZ2zefjxapte7lcerub0tiZ:/mnt/shangguan/StepMesh# ROLE=joint RNIC=eth0 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user root
[(0, [tensor([[0.0853, 0.3159, 0.1476, ..., 0.9959, 0.0053, 0.3634]],
device='cuda:0'), tensor([[0.4815, 0.9897, 0.6918, ..., 0.3182, 0.1305, 0.9965]],
device='cuda:0'), tensor([[0.7301, 0.6911, 0.4391, ..., 0.8204, 0.3732, 0.2680]],
device='cuda:0')], [0, 1, 2])]
worker test done
kill all testing process of ps lite for user root

When set RNIC to eth2, the auto configuration of interface is wrong.

In your environment, erdma_0 is attached to eth1(10.53.6.96) and erdma_1 is attached to eth2(10.53.6.95), right?
If I understand correctly, the auto configuration of interface is to find the nearest rdma dev to the GPU used in PCIE topo.
In your environment, there are two CPU sockets, the GPU 0/1/2/3 and erdma_0 are under the PCIE topo of socket 0, the GPU 4/5/6/7 and erdma_1 are under the PCIE topo of socket 1.
If STEPMESH_GPU is 0, the auto configuration of interface should select the netdev eth1(10.53.6.96 erdma_0) which is closer to GPU0 than erdma_1. And based on your log, it has selected the right device.
There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

GuangguanWang · 2025-09-08T07:30:03Z

Hi, @niehao100

Thanks for the test.

Hi @GuangguanWang
Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

Only the second tensor is pushed correctly, other two becomes all zeros;

I can not reproduce the issue, I am not sure whether there are some different env settings result in the issue.

(base) root@iZ2zefjxapte7lcerub0tiZ:/mnt/shangguan/StepMesh# ROLE=joint RNIC=eth0 bash tests/fserver/run_single_gpu.sh kill all testing process of ps lite for user root [(0, [tensor([[0.0853, 0.3159, 0.1476, ..., 0.9959, 0.0053, 0.3634]], device='cuda:0'), tensor([[0.4815, 0.9897, 0.6918, ..., 0.3182, 0.1305, 0.9965]], device='cuda:0'), tensor([[0.7301, 0.6911, 0.4391, ..., 0.8204, 0.3732, 0.2680]], device='cuda:0')], [0, 1, 2])] worker test done kill all testing process of ps lite for user root

I still can not reproduce it.

When set RNIC to eth2, the auto configuration of interface is wrong.

In your environment, erdma_0 is attached to eth1(10.53.6.96) and erdma_1 is attached to eth2(10.53.6.95), right? If I understand correctly, the auto configuration of interface is to find the nearest rdma dev to the GPU used in PCIE topo. In your environment, there are two CPU sockets, the GPU 0/1/2/3 and erdma_0 are under the PCIE topo of socket 0, the GPU 4/5/6/7 and erdma_1 are under the PCIE topo of socket 1. If STEPMESH_GPU is 0, the auto configuration of interface should select the netdev eth1(10.53.6.96 erdma_0) which is closer to GPU0 than erdma_1. And based on your log, it has selected the right device.

I have created a new PR #34 for RDMA dev auto detect when STEPMESH_GPU is not 0. But it is not a eRDMA specific issue.

There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

We have confirmed that it is a bug in eRDMA driver. We have already fix it, but need more time to release a new eRDMA driver.

niehao100 · 2025-09-09T02:20:11Z

Hi @GuangguanWang
Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

GuangguanWang · 2025-09-09T02:21:37Z

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

bare metal

GuangguanWang · 2025-09-11T08:14:40Z

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100
which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

niehao100 · 2025-09-15T02:21:46Z

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100 which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

Hi @GuangguanWang We test on ubuntu in self hosted pod with erdma controller, and using 8 H20.

GuangguanWang · 2025-09-16T02:47:17Z

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100 which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

Hi @GuangguanWang We test on ubuntu in self hosted pod with erdma controller, and using 8 H20.

Hi @niehao100 , which version of mlnx_ofed did you use? The version of mlnx_ofed can be shown by cmd "dpkg -l | grep ofed" in Ubuntu.

niehao100 · 2025-09-17T06:31:50Z

HI, I check in our container images, its seems no ofed is installed in pod.

GuangguanWang · 2025-09-17T06:56:47Z

HI, I check in our container images, its seems no ofed is installed in pod.

mlnx_ofed is installed in host, not in pod (execute the cmd "dpkg -l | grep ofed" in host). It was auto-installed at the first start of ECS GPU Instance.

I have reproduce the issue in mlnx_ofed 5.4-3.5.8, and find it is a bug when using GDR in this version of mlnx_ofed.
The newer version 24.10-2.1.8 has fixed the bug, which is also the default auto-installed version of mlnx_ofed in Ubuntu
in new-created ECS GPU Instance. If the ECS GPU Instance you test is early-created, the mlnx_ofed may be 5.4-3.5.8 or earlier version.
If you confirm it is an old version of mlnx_ofed in your environment. You can update the mlnx_ofed and reinstall the erdma driver in host to fix the bug reference to the doc:
https://help.aliyun.com/zh/ecs/user-guide/on-the-gpu-instance-configuration-erdma?spm=5176.12818093_47.console-base_help.dexternal.5adc2cc9VwgSbX&scm=20140722.S_help%40%40%E6%96%87%E6%A1%A3%40%402248432.S_BB1%40bl%2BBB2%40bl%2BRQW%40ag0%2Bos0.ID_2248432-RL_ofed-LOC_console~UND~help-OR_ser-PAR1_213e367917580909130901805ed39e-V_4-P0_1-P1_0#fbe56ff74e705:~:text=CUDNN_VERSION%20%24IS_INSTALL_RDMA%20%24IS_INSTALL_eRDMA-,%E6%89%8B%E5%8A%A8%E5%AE%89%E8%A3%85%E6%96%B9%E5%BC%8F,-GPU%E5%AE%9E%E4%BE%8B%E5%88%9B%E5%BB%BA

GuangguanWang · 2025-10-15T02:31:30Z

There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

We have confirmed that it is a bug in eRDMA driver. We have already fix it, but need more time to release a new eRDMA driver.

Hi, @niehao100, the new eRDMA driver including this fix has been released, you can get the new eRDMA driver by
curl -O http://mirrors.cloud.aliyuncs.com/erdma/env_setup.sh
bash env_setup.sh --url http://mirrors.cloud.aliyuncs.com/erdma/erdma_installer-1.5.0.tar.gz
in the host of your environment.

niehao100 · 2025-10-17T09:49:22Z

Hi @GuangguanWang , thanks for the comment, we will try it.

GuangguanWang force-pushed the main branch from ddc124b to 196b883 Compare August 28, 2025 08:40

GuangguanWang changed the title ~~Introduce ERDMA support for StepMesh~~ Introduce eRDMA support for StepMesh Aug 28, 2025

GuangguanWang force-pushed the main branch from 196b883 to 3140498 Compare August 28, 2025 08:45

GuangguanWang added 3 commits August 29, 2025 11:04

Introduce RDMAProvider and eRDMA support

dc3058f

Signed-off-by: Guangguan Wang <[email protected]>

Optimize mr usage in WRContext

f979f9b

Signed-off-by: Guangguan Wang <[email protected]>

Lint clean

160abdb

Signed-off-by: Guangguan Wang <[email protected]>

GuangguanWang force-pushed the main branch from a6a0851 to 160abdb Compare August 29, 2025 03:34

niehao100 changed the base branch from main to erdma September 4, 2025 11:26

niehao100 merged commit f63ca39 into stepfun-ai:erdma Sep 4, 2025
1 check passed

Introduce eRDMA support for StepMesh #32

Introduce eRDMA support for StepMesh #32

Uh oh!

Conversation

GuangguanWang commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niehao100 commented Aug 28, 2025

Uh oh!

GuangguanWang commented Aug 29, 2025

Uh oh!

niehao100 commented Sep 4, 2025

Uh oh!

Uh oh!

niehao100 commented Sep 4, 2025

Uh oh!

GuangguanWang commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuangguanWang commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niehao100 commented Sep 9, 2025

Uh oh!

GuangguanWang commented Sep 9, 2025

Uh oh!

GuangguanWang commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niehao100 commented Sep 15, 2025

Uh oh!

GuangguanWang commented Sep 16, 2025

Uh oh!

niehao100 commented Sep 17, 2025

Uh oh!

GuangguanWang commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuangguanWang commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niehao100 commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GuangguanWang commented Aug 28, 2025 •

edited

Loading

GuangguanWang commented Sep 5, 2025 •

edited

Loading

GuangguanWang commented Sep 8, 2025 •

edited

Loading

GuangguanWang commented Sep 11, 2025 •

edited

Loading

GuangguanWang commented Sep 17, 2025 •

edited

Loading

GuangguanWang commented Oct 15, 2025 •

edited

Loading