UCT/CUDA_IPC: add remote iface addr sys_dev map #7896

Akshay-Venkatesh · 2022-02-03T00:37:22Z

What

UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. This PR creates and exchanges local sys_device_id->bus_id map with peers.

Why ?

Without this, remote sys_device_id cannot be used meaningfully to see if two cuda devices are nvlink/cuda_ipc reachable. Reachability (through non-zero bandwidth) is a requirement to decide if device bounce buffers can be used for pipeline protocols.

brminich · 2022-02-04T09:44:10Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

+    } else if (khret == UCS_KH_PUT_KEY_PRESENT) {
+        /* do nothing */
+    } else {
+        ucs_error("unable to use cuda_ipc remote_iface_addr hash");
+    }


minor: I'd just do else if (khret != UCS_KH_PUT_KEY_PRESENT) instead

@brminich addressed this

brminich · 2022-02-04T09:46:27Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

    uct_base_iface_query(&iface->super, iface_attr);

-    iface_attr->iface_addr_len          = sizeof(pid_t);
+    iface_attr->iface_addr_len          = sizeof(uct_cuda_base_sys_dev_map_t);


can we define the length according to the real number of gpus? (to avoid unnecessary address increase)

@brminich Kept it this way to handle the case when different processes within the same node use different number of devices in CUDA_VISIBLE_DEVICES. For example, if process 0 uses CUDA_VISIBLE_DEVICES=0 and process 1 uses CUDA_VISIBLE_DEVICES=1,2 then the instances of cuda_ipc iface for the two processes will have different iface_addr_len. I wasn't sure if iface address lengths have to be consistent across processes during iface_addr exchange.

i think it is ok to have different address lengths, but it will not work with unified mode (which is disabled by default)
@yosefe, wdyt?

brminich · 2022-02-04T09:48:34Z

src/uct/cuda/base/cuda_md.c

-                              cuda_device);
-            status = ucs_topo_sys_device_set_name(sys_dev, device_name);
-            ucs_assert_always(status == UCS_OK);
+    ucs_spin_lock(&uct_cuda_base_lock);


is it needed to protect uct_cuda_sys_dev_bus_id_map? If yes, when can it be accessed concurrently?

As this logic is called as part of md_query functionality and because we access global structure uct_cuda_sys_dev_bus_id_map, I used spin_lock around the access. Is this unnecessary? Is it guaranteed that query_md_resources will be called by only thread at any given time?

Currently we support multi-threading for UCP only, and all UCP API calls (including progress) are protected by a global lock.
Imo, there is not need for lock here then.

brminich · 2022-02-04T09:50:29Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c


    return ((uct_cuda_ipc_iface_node_guid(&iface->super) ==
-             *((const uint64_t *)dev_addr)) && ((getpid() != *(pid_t *)iface_addr)));
+             *((const uint64_t *)dev_addr)) &&


will extra checks be added later?

No further checks here.

what is the reason to update hash in this function? How it is it going to be used?

@brminich updated the PR description as well. Lmk if it makes sense.

UCT perf_estimate API takes two system devices (local, remote) to provide bandwidth/latency estimates. In the case of cuda_ipc, local GPU's sys_dev and remote GPU's sys_dev. But remote GPU's sys_dev may not match the same bus_id on the local process because of the order in which sys_devices are populated. For this reason, we need a way to interpret remote sys_dev and translate to local sys_dev and use in cuda_ipc perf estimate to check if there are nvlinks between the devices or not. Iface address exchange phase seemed like the most convenient place to add this logic.

you mean uct_iface_estimate_perf API, right? So, we implicitly assume that iface_is_reachable for the corresponding iface (created on needed device id) has to be called before this perf estimate routine? If yes, wouldn't it be better to cache device id in ep creation routine rather than during reachability check?

@brminich addressed this

brminich · 2022-02-04T09:50:43Z

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c

+    } else {
+        ucs_error("unable to use cuda_ipc remote_iface_addr hash");
+    }
+    ucs_recursive_spin_unlock(&iface->rem_iface_addr_lock);


when is concurrency possible?

Similar to the other lock use, I wanted to ensure that khash isn't accessed by more than one thread at any given point for writes. is it not possible that iface_is_reachable is called by two threads belonging to the same process simultaneously (if there's a worker-level lock already guarding this maybe)?

Akshay-Venkatesh · 2022-02-14T19:47:58Z

@brminich any more comments from your end? The one I've not addressed is concerning lock usage comments 1 2. If you and @yosefe both agree that this won't be needed (even for standalone UCT use cases), then I'll remove the lock wrappers around khash and md cases.

brminich reviewed Feb 4, 2022

View reviewed changes

Akshay-Venkatesh added 2 commits February 14, 2022 10:27

UCT/CUDA_IPC: add remote iface addr sys_dev map

7d1e19c

UCT/CUDA_IPC: update header path

cf17489

Akshay-Venkatesh force-pushed the topic/cuda-ipc-remote-iface-sys-dev-map branch from 8b83250 to cf17489 Compare February 14, 2022 18:53

Uh oh!

UCT/CUDA_IPC: add remote iface addr sys_dev map #7896

Are you sure you want to change the base?

UCT/CUDA_IPC: add remote iface addr sys_dev map #7896

Uh oh!

Conversation

Akshay-Venkatesh commented Feb 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why ?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Akshay-Venkatesh Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Akshay-Venkatesh commented Feb 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Akshay-Venkatesh commented Feb 3, 2022 •

edited

Loading

Akshay-Venkatesh Feb 8, 2022 •

edited

Loading