- 
                Notifications
    
You must be signed in to change notification settings  - Fork 27
 
Introduce eRDMA support for StepMesh #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
           @GuangguanWang Thanks for your contribution, we will test it soon.  | 
    
Signed-off-by: Guangguan Wang <[email protected]>
Signed-off-by: Guangguan Wang <[email protected]>
Signed-off-by: Guangguan Wang <[email protected]>
| 
           force push to fix potential incorrect data len and send_flags in wr when inline  | 
    
| 
           Merge into eRDMA branch.  | 
    
| 
          
 Thanks for your contribution. We test single gpu case on H20 with two erdma RNIC. There are two problem: 
 We need your help on these problems. root@gpu-h20-0156:/app/ps-lite# ibv_devinfo 
hca_id: erdma_0
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3775
        sys_image_guid:                 0216:3eff:fe2f:3775
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
hca_id: erdma_1
        transport:                      eRDMA (0)
        fw_ver:                         0.2.0
        node_guid:                      0216:3eff:fe2f:3111
        sys_image_guid:                 0216:3eff:fe2f:3111
        vendor_id:                      0x1ded
        vendor_part_id:                 4223
        hw_ver:                         0x0
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernetroot@gpu-h20-0156:/app/ps-lite/fserver# root@gpu-h20-0156:/app/ps-lite# ROLE=joinst RNIC=eth1 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
server [(0, [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0'), tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')], [0, 1, 2])]
worker push [tensor([[1., 1., 1., 1., 1., 1., 1., 1.]], device='cuda:0'), tensor([[2., 2., 2., 2., 2., 2., 2., 2.]], device='cuda:0'), tensor([[3., 3., 3., 3., 3., 3., 3., 3.]], device='cuda:0')]
server pull result [tensor([[0., 0., 0., 0., 0., 0., 0., 0.]], device='cuda:0')]
Traceback (most recent call last):
  File "/app/ps-lite/tests/fserver/test_fserver.py", line 28, in <module>
    assert torch.allclose(sum(push_tensors), pull_tensors[0])
AssertionError
^Ckill all testing process of ps lite for user 
^Ctests/fserver/run_single_gpu.sh: line 2: 3091164 Killed                  DMLC_ROLE=scheduler python3 $THIS_DIR/$BIN.py
tests/fserver/run_single_gpu.sh: line 2: 3091165 Killed                  DMLC_ROLE=server python3 $THIS_DIR/$BIN.py $@
root@gpu-h20-0156:/app/ps-lite# PS_VERBOSE=2 ROLE=joinst RNIC=eth2 bash tests/fserver/run_single_gpu.sh
kill all testing process of ps lite for user 
[21:16:34] scheduler /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.95
[21:16:34] scheduler /app/ps-lite/src/van.cc:600: Bind to [role=scheduler, id=1, ip=10.53.6.95, port=8123, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=1
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=14, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=15, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=16, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 16
[21:16:34] scheduler /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f53e4001030 , cq=0x7f53e4001100, qp=17, maxInline=96
[21:16:34] scheduler /app/ps-lite/src/./rdma_van.h:1014: 1 OnConnect to 1 with Transport=RDMA QP_NUM 17
[21:16:34] worker /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] server /app/ps-lite/src/postoffice.cc:75: Creating Van: ibverbs. group_size=1
[21:16:34] worker /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] worker /app/ps-lite/src/van.cc:600: Bind to [role=worker, ip=10.53.6.96, port=36119, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] worker /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=924, maxInline=96
[21:16:34] worker /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7f2c10001030 , cq=0x7f2c10001100, qp=925, maxInline=96
[21:16:34] server /app/ps-lite/src/van.cc:556: automatic detect interface and ip from gpu: eth1 (10.53.6.96)
[21:16:34] server /app/ps-lite/src/./rdma_van.h:153: bind to DMLC_NODE_HOST: 10.53.6.96
[21:16:34] server /app/ps-lite/src/van.cc:600: Bind to [role=server, ip=10.53.6.96, port=47583, is_recovery=0, aux_id=-1, num_ports=1]
[21:16:34] server /app/ps-lite/src/./rdma_van.h:178: Connecting to Node 1, My_Node=32767
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=926, maxInline=96
[21:16:34] server /app/ps-lite/src/././rdma_transport.h:194: qp created: pd=0x7fedb8001030 , cq=0x7fedb8001100, qp=927, maxInline=96 | 
    
| 
           Thanks for the test. 
 I can not reproduce the issue, I am not sure whether there are some different env settings result in the issue. (base) root@iZ2zefjxapte7lcerub0tiZ:/mnt/shangguan/StepMesh# ROLE=joint RNIC=eth0 bash tests/fserver/run_single_gpu.sh 
 In your environment, erdma_0 is attached to eth1(10.53.6.96) and erdma_1 is attached to eth2(10.53.6.95), right?  | 
    
| 
           Hi, @niehao100 
 I still can not reproduce it. 
 I have created a new PR #34 for RDMA dev auto detect when STEPMESH_GPU is not 0. But it is not a eRDMA specific issue. 
 We have confirmed that it is a bug in eRDMA driver. We have already fix it, but need more time to release a new eRDMA driver.  | 
    
| 
           Hi @GuangguanWang  | 
    
          
 bare metal  | 
    
          
 Hi @niehao100  | 
    
          
 Hi @GuangguanWang We test on ubuntu in self hosted pod with erdma controller, and using 8 H20.  | 
    
          
 Hi @niehao100 , which version of mlnx_ofed did you use? The version of mlnx_ofed can be shown by cmd "dpkg -l | grep ofed" in Ubuntu.  | 
    
| 
           HI, I check in our container images, its seems no ofed is installed in pod.  | 
    
          
 mlnx_ofed is installed in host, not in pod (execute the cmd "dpkg -l | grep ofed" in host). It was auto-installed at the first start of ECS GPU Instance. I have reproduce the issue in mlnx_ofed 5.4-3.5.8, and find it is a bug when using GDR in this version of mlnx_ofed.  | 
    
          
 Hi, @niehao100, the new eRDMA driver including this fix has been released, you can get the new eRDMA driver by  | 
    
| 
           Hi @GuangguanWang , thanks for the comment, we will try it.  | 
    
Introduce eRDMA support for StepMesh