You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use default command, it seems to use 29500 as master_port.
However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"
error message:
[W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so)
...