-
Notifications
You must be signed in to change notification settings - Fork 490
Use CUDA buffer IDs to validate rcache entries #5910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
58745df to
a436683
Compare
|
Can one of the admins verify this patch? |
9116803 to
cc2ae0d
Compare
|
Hi @SeyedMir . Thanks for the patch. What is your affiliation ? thanks ! |
NVIDIA |
|
ok to test |
|
@yosefe - can we ask Nvidia folks to update their GitHub profiles ? |
|
|
@yosefe - it was not there yesterday, I checked. |
|
That's right. I update it yesterday after Pavel's comment. |
rcache entriesf04fa1f to
91c70d1
Compare
4813fd2 to
cd8efd5
Compare
c693345 to
6be9648
Compare
9c514af to
3a28c2c
Compare
| ucm_warn("failed to install cuda memory hooks on runtime API") | ||
| } | ||
| goto out_unlock; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this logic is incorrect
(1) is bistro fails with driver funcs, it is not sufficient with just runtime funcs bistro because the application might be using driver app
(2) with "reloc", both driver and runtime func needs to be successfully installed to consider ucm_cudamem_installed = 1
| ucm_cuda_func_t *func; | ||
| ucs_status_t status; | ||
| void *func_ptr; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hook_mode == UCM_MMAP_HOOK_NONE) { return }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucm_cudamem_install() already does that check. Do we need it again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missed it. then we are good.
a9739a4 to
f4e23ea
Compare
f4e23ea to
1108af5
Compare
10c033c to
521c3bd
Compare
|
@yosefe @Akshay-Venkatesh @bureddy I updated this PR to use UCT memory attributes (from #6306) to validate rcache results when CUDA UCM hooks fail. |
521c3bd to
e233608
Compare
Unlike the existing uct_md_mem_query(), then new API is not specific to a single MD. Therefore, it does not need a handle to an opened MD.
e233608 to
333f205
Compare
9e03aca to
c10db76
Compare
c10db76 to
67d5299
Compare

What
This PR associates a memory attribute to each rcache entry. The attribute enables rcache to use CUDA buffer IDs to validate entries that correspond to CUDA memory allocations.
Why ?
If UCM hooks fail to install successfully, CUDA memory release functions are not intercepted. Therefore, the corresponding rcache entries are not invalidated. If a new CUDA allocation happens to use the same VA range as the one in rcache, it will lead to an invalid cache hit. This can be avoided by using CUDA buffer IDs as an external mechanism to validate rcache results.
How ?
Three new callbacks are added to rcache ops, and are used when rcache memory hooks fail to install:
mem_reg_ext_validateused for registration callbackmem_dereg_ext_validateused for deregistration callbackmem_ext_validateused to validate a cache hitIB and gdrcopy memory domains will use UCT memory attributes (#6306) to implement the above callbacks.
mem_reg_ext_validatecallback queries memory attribute and adds it to rcache regionmem_dereg_ext_validatecallback destroys the memory attributemem_ext_validatecallback compares the memory attribute of the queried address with the memory attribute of the region returned by rcache.