Skip to content

Conversation

@GuangguanWang
Copy link
Contributor

@GuangguanWang GuangguanWang commented Aug 28, 2025

Introduce eRDMA support for StepMesh

@GuangguanWang GuangguanWang changed the title Introduce ERDMA support for StepMesh Introduce eRDMA support for StepMesh Aug 28, 2025
@niehao100
Copy link
Collaborator

@GuangguanWang Thanks for your contribution, we will test it soon.

Signed-off-by: Guangguan Wang <[email protected]>
Signed-off-by: Guangguan Wang <[email protected]>
@GuangguanWang
Copy link
Contributor Author

force push to fix potential incorrect data len and send_flags in wr when inline

@niehao100 niehao100 changed the base branch from main to erdma September 4, 2025 11:26
@niehao100
Copy link
Collaborator

Merge into eRDMA branch.

@niehao100 niehao100 merged commit f63ca39 into stepfun-ai:erdma Sep 4, 2025
1 check passed
@niehao100
Copy link
Collaborator

Hi @GuangguanWang

Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

  1. Only the second tensor is pushed correctly, other two becomes all zeros;
  2. When set RNIC to eth2, the auto configuration of interface is wrong.

We need your help on these problems.

root@gpu-h20-0156:/app/ps-lite# ibv_devinfo 
hca_id: erdma_0
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3775
        sys_image_guid:                 0216:3eff:fe2f:3775
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: erdma_1
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3111
        sys_image_guid:                 0216:3eff:fe2f:3111
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
root@gpu-h20-0156:/app/ps-lite/fserver# root@gpu-h20-0156:/app/ps-lite# ROLE=joinst RNIC=eth1 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
server [(0, [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')], [0, 1, 2])]
worker push [tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0'), tensor([[2., 2., 2., 2., 2., 2., 2., 2.]], device='cuda:0'), tensor([[3., 3., 3., 3., 3., 3., 3., 3.]], device='cuda:0')]
server pull result [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')]
Traceback (most recent call last):
  File "/app/ps-lite/tests/fserver/test_fserver.py", line 28, in <module>
    assert torch.allclose(sum(push_tensors), pull_tensors[0])
AssertionError
^Ckill all testing process of ps lite for user 
^Ctests/fserver/run_single_gpu.sh: line 2: 3091164 Killed                  DMLC_ROLE=scheduler python3 $THIS_DIR/$BIN.py
tests/fserver/run_single_gpu.sh: line 2: 3091165 Killed                  DMLC_ROLE=server python3 $THIS_DIR/$BIN.py $@


root@gpu-h20-0156:/app/ps-lite# PS_VERBOSE=2 ROLE=joinst RNIC=eth2 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
[21:16:34] scheduler /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.95
[21:16:34] scheduler /app/ps-lite/src/van.cc:600: Bind to [role=scheduler, id=1, ip=10.53.6.95, port=8123, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=1
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=14, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=15, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=16, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 16
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=17, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 17
[21:16:34] worker /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] server /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] worker /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] worker /app/ps-lite/src/van.cc:600: Bind to [role=worker, ip=10.53.6.96, port=36119, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=924, maxInline=96
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=925, maxInline=96
[21:16:34] server /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] server /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] server /app/ps-lite/src/van.cc:600: Bind to [role=server, ip=10.53.6.96, port=47583, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] server /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=926, maxInline=96
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=927, maxInline=96

@GuangguanWang
Copy link
Contributor Author

GuangguanWang commented Sep 5, 2025

Thanks for the test.

Hi @GuangguanWang

Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

  1. Only the second tensor is pushed correctly, other two becomes all zeros;

I can not reproduce the issue, I am not sure whether there are some different env settings result in the issue.

(base) root@iZ2zefjxapte7lcerub0tiZ:/mnt/shangguan/StepMesh# ROLE=joint RNIC=eth0 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user root
[(0, [tensor([[0.0853, 0.3159, 0.1476, ..., 0.9959, 0.0053, 0.3634]],
device='cuda:0'), tensor([[0.4815, 0.9897, 0.6918, ..., 0.3182, 0.1305, 0.9965]],
device='cuda:0'), tensor([[0.7301, 0.6911, 0.4391, ..., 0.8204, 0.3732, 0.2680]],
device='cuda:0')], [0, 1, 2])]
worker test done
kill all testing process of ps lite for user root

  1. When set RNIC to eth2, the auto configuration of interface is wrong.

In your environment, erdma_0 is attached to eth1(10.53.6.96) and erdma_1 is attached to eth2(10.53.6.95), right?
If I understand correctly, the auto configuration of interface is to find the nearest rdma dev to the GPU used in PCIE topo.
In your environment, there are two CPU sockets, the GPU 0/1/2/3 and erdma_0 are under the PCIE topo of socket 0, the GPU 4/5/6/7 and erdma_1 are under the PCIE topo of socket 1.
If STEPMESH_GPU is 0, the auto configuration of interface should select the netdev eth1(10.53.6.96 erdma_0) which is closer to GPU0 than erdma_1. And based on your log, it has selected the right device.
There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

@GuangguanWang
Copy link
Contributor Author

GuangguanWang commented Sep 8, 2025

Hi, @niehao100

Thanks for the test.

Hi @GuangguanWang
Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem:

  1. Only the second tensor is pushed correctly, other two becomes all zeros;

I can not reproduce the issue, I am not sure whether there are some different env settings result in the issue.

(base) root@iZ2zefjxapte7lcerub0tiZ:/mnt/shangguan/StepMesh# ROLE=joint RNIC=eth0 bash tests/fserver/run_single_gpu.sh kill all testing process of ps lite for user root [(0, [tensor([[0.0853, 0.3159, 0.1476, ..., 0.9959, 0.0053, 0.3634]], device='cuda:0'), tensor([[0.4815, 0.9897, 0.6918, ..., 0.3182, 0.1305, 0.9965]], device='cuda:0'), tensor([[0.7301, 0.6911, 0.4391, ..., 0.8204, 0.3732, 0.2680]], device='cuda:0')], [0, 1, 2])] worker test done kill all testing process of ps lite for user root

I still can not reproduce it.

  1. When set RNIC to eth2, the auto configuration of interface is wrong.

In your environment, erdma_0 is attached to eth1(10.53.6.96) and erdma_1 is attached to eth2(10.53.6.95), right? If I understand correctly, the auto configuration of interface is to find the nearest rdma dev to the GPU used in PCIE topo. In your environment, there are two CPU sockets, the GPU 0/1/2/3 and erdma_0 are under the PCIE topo of socket 0, the GPU 4/5/6/7 and erdma_1 are under the PCIE topo of socket 1. If STEPMESH_GPU is 0, the auto configuration of interface should select the netdev eth1(10.53.6.96 erdma_0) which is closer to GPU0 than erdma_1. And based on your log, it has selected the right device.

I have created a new PR #34 for RDMA dev auto detect when STEPMESH_GPU is not 0. But it is not a eRDMA specific issue.

There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

We have confirmed that it is a bug in eRDMA driver. We have already fix it, but need more time to release a new eRDMA driver.

@niehao100
Copy link
Collaborator

Hi @GuangguanWang
Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

@GuangguanWang
Copy link
Contributor Author

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

bare metal

@GuangguanWang
Copy link
Contributor Author

GuangguanWang commented Sep 11, 2025

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100
which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

@niehao100
Copy link
Collaborator

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100 which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

Hi @GuangguanWang We test on ubuntu in self hosted pod with erdma controller, and using 8 H20.

@GuangguanWang
Copy link
Contributor Author

Hi @GuangguanWang Thanks for your replay, for the first bug, did you run it on bare metal or in k8s pod?

Hi @niehao100 which OS did you test? Alinux3 or Ubuntu or others? And which GPU did you use?

Hi @GuangguanWang We test on ubuntu in self hosted pod with erdma controller, and using 8 H20.

Hi @niehao100 , which version of mlnx_ofed did you use? The version of mlnx_ofed can be shown by cmd "dpkg -l | grep ofed" in Ubuntu.

@niehao100
Copy link
Collaborator

HI, I check in our container images, its seems no ofed is installed in pod.

@GuangguanWang
Copy link
Contributor Author

GuangguanWang commented Sep 17, 2025

HI, I check in our container images, its seems no ofed is installed in pod.

mlnx_ofed is installed in host, not in pod (execute the cmd "dpkg -l | grep ofed" in host). It was auto-installed at the first start of ECS GPU Instance.

I have reproduce the issue in mlnx_ofed 5.4-3.5.8, and find it is a bug when using GDR in this version of mlnx_ofed.
The newer version 24.10-2.1.8 has fixed the bug, which is also the default auto-installed version of mlnx_ofed in Ubuntu
in new-created ECS GPU Instance. If the ECS GPU Instance you test is early-created, the mlnx_ofed may be 5.4-3.5.8 or earlier version.
If you confirm it is an old version of mlnx_ofed in your environment. You can update the mlnx_ofed and reinstall the erdma driver in host to fix the bug reference to the doc:
https://help.aliyun.com/zh/ecs/user-guide/on-the-gpu-instance-configuration-erdma?spm=5176.12818093_47.console-base_help.dexternal.5adc2cc9VwgSbX&scm=20140722.S_help%40%40%E6%96%87%E6%A1%A3%40%402248432.S_BB1%40bl%2BBB2%40bl%2BRQW%40ag0%2Bos0.ID_2248432-RL_ofed-LOC_console~UND~help-OR_ser-PAR1_213e367917580909130901805ed39e-V_4-P0_1-P1_0#fbe56ff74e705:~:text=CUDNN_VERSION%20%24IS_INSTALL_RDMA%20%24IS_INSTALL_eRDMA-,%E6%89%8B%E5%8A%A8%E5%AE%89%E8%A3%85%E6%96%B9%E5%BC%8F,-GPU%E5%AE%9E%E4%BE%8B%E5%88%9B%E5%BB%BA

@GuangguanWang
Copy link
Contributor Author

GuangguanWang commented Oct 15, 2025

There maybe have another issue that using the explict src addr of eth1 and dst addr of eth2 to do rdma_connect which can be hung, we are debugging it now. Is mlx RDMA works well in this case?

We have confirmed that it is a bug in eRDMA driver. We have already fix it, but need more time to release a new eRDMA driver.

Hi, @niehao100, the new eRDMA driver including this fix has been released, you can get the new eRDMA driver by
curl -O http://mirrors.cloud.aliyuncs.com/erdma/env_setup.sh
bash env_setup.sh --url http://mirrors.cloud.aliyuncs.com/erdma/erdma_installer-1.5.0.tar.gz
in the host of your environment.

@niehao100
Copy link
Collaborator

Hi @GuangguanWang , thanks for the comment, we will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants