-
Couldn't load subscription status.
- Fork 489
UCT/CUDA_IPC: add remote iface addr sys_dev map #7896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCT/CUDA_IPC: add remote iface addr sys_dev map #7896
Conversation
| } else if (khret == UCS_KH_PUT_KEY_PRESENT) { | ||
| /* do nothing */ | ||
| } else { | ||
| ucs_error("unable to use cuda_ipc remote_iface_addr hash"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I'd just do else if (khret != UCS_KH_PUT_KEY_PRESENT) instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich addressed this
| uct_base_iface_query(&iface->super, iface_attr); | ||
|
|
||
| iface_attr->iface_addr_len = sizeof(pid_t); | ||
| iface_attr->iface_addr_len = sizeof(uct_cuda_base_sys_dev_map_t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we define the length according to the real number of gpus? (to avoid unnecessary address increase)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich Kept it this way to handle the case when different processes within the same node use different number of devices in CUDA_VISIBLE_DEVICES. For example, if process 0 uses CUDA_VISIBLE_DEVICES=0 and process 1 uses CUDA_VISIBLE_DEVICES=1,2 then the instances of cuda_ipc iface for the two processes will have different iface_addr_len. I wasn't sure if iface address lengths have to be consistent across processes during iface_addr exchange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it is ok to have different address lengths, but it will not work with unified mode (which is disabled by default)
@yosefe, wdyt?
| cuda_device); | ||
| status = ucs_topo_sys_device_set_name(sys_dev, device_name); | ||
| ucs_assert_always(status == UCS_OK); | ||
| ucs_spin_lock(&uct_cuda_base_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it needed to protect uct_cuda_sys_dev_bus_id_map? If yes, when can it be accessed concurrently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this logic is called as part of md_query functionality and because we access global structure uct_cuda_sys_dev_bus_id_map, I used spin_lock around the access. Is this unnecessary? Is it guaranteed that query_md_resources will be called by only thread at any given time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we support multi-threading for UCP only, and all UCP API calls (including progress) are protected by a global lock.
Imo, there is not need for lock here then.
|
|
||
| return ((uct_cuda_ipc_iface_node_guid(&iface->super) == | ||
| *((const uint64_t *)dev_addr)) && ((getpid() != *(pid_t *)iface_addr))); | ||
| *((const uint64_t *)dev_addr)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will extra checks be added later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No further checks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason to update hash in this function? How it is it going to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich updated the PR description as well. Lmk if it makes sense.
UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. Iface address exchange phase seemed like the most convenient place to add this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean uct_iface_estimate_perf API, right? So, we implicitly assume that iface_is_reachable for the corresponding iface (created on needed device id) has to be called before this perf estimate routine? If yes, wouldn't it be better to cache device id in ep creation routine rather than during reachability check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brminich addressed this
| } else { | ||
| ucs_error("unable to use cuda_ipc remote_iface_addr hash"); | ||
| } | ||
| ucs_recursive_spin_unlock(&iface->rem_iface_addr_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is concurrency possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the other lock use, I wanted to ensure that khash isn't accessed by more than one thread at any given point for writes. is it not possible that iface_is_reachable is called by two threads belonging to the same process simultaneously (if there's a worker-level lock already guarding this maybe)?
8b83250 to
cf17489
Compare
What
UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. This PR creates and exchanges local sys_device_id->bus_id map with peers.
Why ?
Without this, remote sys_device_id cannot be used meaningfully to see if two cuda devices are nvlink/cuda_ipc reachable. Reachability (through non-zero bandwidth) is a requirement to decide if device bounce buffers can be used for pipeline protocols.