Use CUDA buffer IDs to validate rcache entries #5910

SeyedMir · 2020-11-17T00:20:09Z

What

This PR associates a memory attribute to each rcache entry. The attribute enables rcache to use CUDA buffer IDs to validate entries that correspond to CUDA memory allocations.

Why ?

If UCM hooks fail to install successfully, CUDA memory release functions are not intercepted. Therefore, the corresponding rcache entries are not invalidated. If a new CUDA allocation happens to use the same VA range as the one in rcache, it will lead to an invalid cache hit. This can be avoided by using CUDA buffer IDs as an external mechanism to validate rcache results.

How ?

Three new callbacks are added to rcache ops, and are used when rcache memory hooks fail to install:

mem_reg_ext_validate used for registration callback
mem_dereg_ext_validate used for deregistration callback
mem_ext_validate used to validate a cache hit

IB and gdrcopy memory domains will use UCT memory attributes (#6306) to implement the above callbacks.

mem_reg_ext_validate callback queries memory attribute and adds it to rcache region
mem_dereg_ext_validate callback destroys the memory attribute
mem_ext_validate callback compares the memory attribute of the queried address with the memory attribute of the region returned by rcache.

swx-jenkins3 · 2020-11-17T00:24:28Z

Can one of the admins verify this patch?

shamisp · 2020-11-17T01:36:58Z

Hi @SeyedMir . Thanks for the patch. What is your affiliation ? thanks !

yosefe · 2020-11-17T08:17:57Z

Hi @SeyedMir . Thanks for the patch. What is your affiliation ? thanks !

NVIDIA

yosefe · 2020-11-17T08:18:06Z

ok to test

shamisp · 2020-11-17T16:08:07Z

@yosefe - can we ask Nvidia folks to update their GitHub profiles ?

yosefe · 2020-11-17T16:09:55Z

@yosefe - can we ask Nvidia folks to update their GitHub profiles ?

shamisp · 2020-11-17T16:33:18Z

@yosefe - it was not there yesterday, I checked.

SeyedMir · 2020-11-17T17:01:00Z

That's right. I update it yesterday after Pavel's comment.

SeyedMir · 2021-01-19T15:43:52Z

@bureddy @yosefe can you please review when possible? Thanks

bureddy · 2021-01-20T19:23:46Z

src/ucm/cuda/cudamem.c

-            ucm_warn("failed to install cuda memory hooks on runtime API")
-        }
+        goto out_unlock;
+    }


seems this logic is incorrect

(1) is bistro fails with driver funcs, it is not sufficient with just runtime funcs bistro because the application might be using driver app
(2) with "reloc", both driver and runtime func needs to be successfully installed to consider ucm_cudamem_installed = 1

bureddy · 2021-01-20T19:24:52Z

src/ucm/cuda/cudamem.c

    ucm_cuda_func_t *func;
    ucs_status_t status;
    void *func_ptr;



if (hook_mode == UCM_MMAP_HOOK_NONE) { return }

ucm_cudamem_install() already does that check. Do we need it again here?

missed it. then we are good.

SeyedMir · 2021-03-11T22:40:20Z

@yosefe @Akshay-Venkatesh @bureddy I updated this PR to use UCT memory attributes (from #6306) to validate rcache results when CUDA UCM hooks fail.
Please let me know your comments.

Unlike the existing uct_md_mem_query(), then new API is not specific to a single MD. Therefore, it does not need a handle to an opened MD.

SeyedMir force-pushed the rcache-validation branch from 58745df to a436683 Compare November 17, 2020 00:24

SeyedMir force-pushed the rcache-validation branch 3 times, most recently from 9116803 to cc2ae0d Compare November 17, 2020 00:50

SeyedMir changed the title ~~Use CUDA buffer IDs to validate rcache entries~~ Use CUDA buffer IDs to validate rcache entries Nov 20, 2020

SeyedMir force-pushed the rcache-validation branch 13 times, most recently from f04fa1f to 91c70d1 Compare November 26, 2020 18:09

SeyedMir force-pushed the rcache-validation branch 4 times, most recently from 4813fd2 to cd8efd5 Compare December 3, 2020 16:50

SeyedMir force-pushed the rcache-validation branch 4 times, most recently from c693345 to 6be9648 Compare January 12, 2021 14:39

SeyedMir force-pushed the rcache-validation branch 6 times, most recently from 9c514af to 3a28c2c Compare January 15, 2021 19:15

bureddy reviewed Jan 20, 2021

View reviewed changes

SeyedMir force-pushed the rcache-validation branch 2 times, most recently from a9739a4 to f4e23ea Compare January 21, 2021 16:41

SeyedMir force-pushed the rcache-validation branch from f4e23ea to 1108af5 Compare February 4, 2021 16:37

SeyedMir force-pushed the rcache-validation branch 2 times, most recently from 10c033c to 521c3bd Compare March 11, 2021 21:00

SeyedMir force-pushed the rcache-validation branch from 521c3bd to e233608 Compare March 12, 2021 17:35

SeyedMir added 4 commits March 12, 2021 11:04

UCT/MD: Add a new UCT memory query API

2f676ac

Unlike the existing uct_md_mem_query(), then new API is not specific to a single MD. Therefore, it does not need a handle to an opened MD.

UCM/CUDA: Update UCM CUDA hooks

a747a67

UCS/RCACHE: Add rcache external validation config

ccce840

UCS/RCACHE: Add external validation callbacks to rcache

9bcc72e

SeyedMir force-pushed the rcache-validation branch from e233608 to 333f205 Compare March 12, 2021 19:22

SeyedMir requested a review from Akshay-Venkatesh March 12, 2021 19:26

SeyedMir force-pushed the rcache-validation branch 2 times, most recently from 9e03aca to c10db76 Compare March 13, 2021 01:20

SeyedMir added 2 commits March 15, 2021 06:58

TEST/RCACHE: Update rcache tests

05bbccb

UCT/MD: Add external rcache validation to memory domains

67d5299

SeyedMir force-pushed the rcache-validation branch from c10db76 to 67d5299 Compare March 15, 2021 13:59

Use CUDA buffer IDs to validate rcache entries #5910

Are you sure you want to change the base?

Use CUDA buffer IDs to validate rcache entries #5910

Uh oh!

Conversation

SeyedMir commented Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why ?

How ?

Uh oh!

swx-jenkins3 commented Nov 17, 2020

Uh oh!

shamisp commented Nov 17, 2020

Uh oh!

yosefe commented Nov 17, 2020

Uh oh!

yosefe commented Nov 17, 2020

Uh oh!

shamisp commented Nov 17, 2020

Uh oh!

yosefe commented Nov 17, 2020

Uh oh!

shamisp commented Nov 17, 2020

Uh oh!

SeyedMir commented Nov 17, 2020

Uh oh!

SeyedMir commented Jan 19, 2021

Uh oh!

bureddy Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

bureddy Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

SeyedMir Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

bureddy Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

SeyedMir commented Mar 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SeyedMir commented Nov 17, 2020 •

edited

Loading