Skip to content

Conversation

@Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh Akshay-Venkatesh commented Feb 3, 2022

What

UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. This PR creates and exchanges local sys_device_id->bus_id map with peers.

Why ?

Without this, remote sys_device_id cannot be used meaningfully to see if two cuda devices are nvlink/cuda_ipc reachable. Reachability (through non-zero bandwidth) is a requirement to decide if device bounce buffers can be used for pipeline protocols.

Comment on lines 120 to 124
} else if (khret == UCS_KH_PUT_KEY_PRESENT) {
/* do nothing */
} else {
ucs_error("unable to use cuda_ipc remote_iface_addr hash");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I'd just do else if (khret != UCS_KH_PUT_KEY_PRESENT) instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich addressed this

uct_base_iface_query(&iface->super, iface_attr);

iface_attr->iface_addr_len = sizeof(pid_t);
iface_attr->iface_addr_len = sizeof(uct_cuda_base_sys_dev_map_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we define the length according to the real number of gpus? (to avoid unnecessary address increase)

Copy link
Contributor Author

@Akshay-Venkatesh Akshay-Venkatesh Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich Kept it this way to handle the case when different processes within the same node use different number of devices in CUDA_VISIBLE_DEVICES. For example, if process 0 uses CUDA_VISIBLE_DEVICES=0 and process 1 uses CUDA_VISIBLE_DEVICES=1,2 then the instances of cuda_ipc iface for the two processes will have different iface_addr_len. I wasn't sure if iface address lengths have to be consistent across processes during iface_addr exchange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it is ok to have different address lengths, but it will not work with unified mode (which is disabled by default)
@yosefe, wdyt?

cuda_device);
status = ucs_topo_sys_device_set_name(sys_dev, device_name);
ucs_assert_always(status == UCS_OK);
ucs_spin_lock(&uct_cuda_base_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it needed to protect uct_cuda_sys_dev_bus_id_map? If yes, when can it be accessed concurrently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this logic is called as part of md_query functionality and because we access global structure uct_cuda_sys_dev_bus_id_map, I used spin_lock around the access. Is this unnecessary? Is it guaranteed that query_md_resources will be called by only thread at any given time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we support multi-threading for UCP only, and all UCP API calls (including progress) are protected by a global lock.
Imo, there is not need for lock here then.


return ((uct_cuda_ipc_iface_node_guid(&iface->super) ==
*((const uint64_t *)dev_addr)) && ((getpid() != *(pid_t *)iface_addr)));
*((const uint64_t *)dev_addr)) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will extra checks be added later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No further checks here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason to update hash in this function? How it is it going to be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich updated the PR description as well. Lmk if it makes sense.

UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. Iface address exchange phase seemed like the most convenient place to add this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean uct_iface_estimate_perf API, right? So, we implicitly assume that iface_is_reachable for the corresponding iface (created on needed device id) has to be called before this perf estimate routine? If yes, wouldn't it be better to cache device id in ep creation routine rather than during reachability check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich addressed this

} else {
ucs_error("unable to use cuda_ipc remote_iface_addr hash");
}
ucs_recursive_spin_unlock(&iface->rem_iface_addr_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is concurrency possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other lock use, I wanted to ensure that khash isn't accessed by more than one thread at any given point for writes. is it not possible that iface_is_reachable is called by two threads belonging to the same process simultaneously (if there's a worker-level lock already guarding this maybe)?

@Akshay-Venkatesh Akshay-Venkatesh force-pushed the topic/cuda-ipc-remote-iface-sys-dev-map branch from 8b83250 to cf17489 Compare February 14, 2022 18:53
@Akshay-Venkatesh
Copy link
Contributor Author

@brminich any more comments from your end? The one I've not addressed is concerning lock usage comments 1 2. If you and @yosefe both agree that this won't be needed (even for standalone UCT use cases), then I'll remove the lock wrappers around khash and md cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants