Skip to content

Conversation

@SeyedMir
Copy link
Contributor

@SeyedMir SeyedMir commented Nov 17, 2020

What

This PR associates a memory attribute to each rcache entry. The attribute enables rcache to use CUDA buffer IDs to validate entries that correspond to CUDA memory allocations.

Why ?

If UCM hooks fail to install successfully, CUDA memory release functions are not intercepted. Therefore, the corresponding rcache entries are not invalidated. If a new CUDA allocation happens to use the same VA range as the one in rcache, it will lead to an invalid cache hit. This can be avoided by using CUDA buffer IDs as an external mechanism to validate rcache results.

How ?

Three new callbacks are added to rcache ops, and are used when rcache memory hooks fail to install:

  • mem_reg_ext_validate used for registration callback
  • mem_dereg_ext_validate used for deregistration callback
  • mem_ext_validate used to validate a cache hit

IB and gdrcopy memory domains will use UCT memory attributes (#6306) to implement the above callbacks.

  • mem_reg_ext_validate callback queries memory attribute and adds it to rcache region
  • mem_dereg_ext_validate callback destroys the memory attribute
  • mem_ext_validate callback compares the memory attribute of the queried address with the memory attribute of the region returned by rcache.

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@SeyedMir SeyedMir force-pushed the rcache-validation branch 3 times, most recently from 9116803 to cc2ae0d Compare November 17, 2020 00:50
@shamisp
Copy link
Contributor

shamisp commented Nov 17, 2020

Hi @SeyedMir . Thanks for the patch. What is your affiliation ? thanks !

@yosefe
Copy link
Contributor

yosefe commented Nov 17, 2020

Hi @SeyedMir . Thanks for the patch. What is your affiliation ? thanks !

NVIDIA

@yosefe
Copy link
Contributor

yosefe commented Nov 17, 2020

ok to test

@shamisp
Copy link
Contributor

shamisp commented Nov 17, 2020

@yosefe - can we ask Nvidia folks to update their GitHub profiles ?

@yosefe
Copy link
Contributor

yosefe commented Nov 17, 2020

@yosefe - can we ask Nvidia folks to update their GitHub profiles ?

image

@shamisp
Copy link
Contributor

shamisp commented Nov 17, 2020

@yosefe - it was not there yesterday, I checked.

@SeyedMir
Copy link
Contributor Author

That's right. I update it yesterday after Pavel's comment.

@SeyedMir SeyedMir changed the title Use CUDA buffer IDs to validate rcache entries Use CUDA buffer IDs to validate rcache entries Nov 20, 2020
@SeyedMir SeyedMir force-pushed the rcache-validation branch 13 times, most recently from f04fa1f to 91c70d1 Compare November 26, 2020 18:09
@SeyedMir SeyedMir force-pushed the rcache-validation branch 4 times, most recently from 4813fd2 to cd8efd5 Compare December 3, 2020 16:50
@SeyedMir SeyedMir force-pushed the rcache-validation branch 4 times, most recently from c693345 to 6be9648 Compare January 12, 2021 14:39
@SeyedMir SeyedMir force-pushed the rcache-validation branch 6 times, most recently from 9c514af to 3a28c2c Compare January 15, 2021 19:15
@SeyedMir
Copy link
Contributor Author

@bureddy @yosefe can you please review when possible? Thanks

ucm_warn("failed to install cuda memory hooks on runtime API")
}
goto out_unlock;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this logic is incorrect

(1) is bistro fails with driver funcs, it is not sufficient with just runtime funcs bistro because the application might be using driver app
(2) with "reloc", both driver and runtime func needs to be successfully installed to consider ucm_cudamem_installed = 1

ucm_cuda_func_t *func;
ucs_status_t status;
void *func_ptr;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (hook_mode == UCM_MMAP_HOOK_NONE) { return }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucm_cudamem_install() already does that check. Do we need it again here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed it. then we are good.

@SeyedMir SeyedMir force-pushed the rcache-validation branch 2 times, most recently from a9739a4 to f4e23ea Compare January 21, 2021 16:41
@SeyedMir SeyedMir force-pushed the rcache-validation branch 2 times, most recently from 10c033c to 521c3bd Compare March 11, 2021 21:00
@SeyedMir
Copy link
Contributor Author

@yosefe @Akshay-Venkatesh @bureddy I updated this PR to use UCT memory attributes (from #6306) to validate rcache results when CUDA UCM hooks fail.
Please let me know your comments.

@SeyedMir SeyedMir force-pushed the rcache-validation branch from 521c3bd to e233608 Compare March 12, 2021 17:35
Unlike the existing uct_md_mem_query(), then new API is not specific
to a single MD. Therefore, it does not need a handle to an opened MD.
@SeyedMir SeyedMir force-pushed the rcache-validation branch from e233608 to 333f205 Compare March 12, 2021 19:22
@SeyedMir SeyedMir force-pushed the rcache-validation branch 2 times, most recently from 9e03aca to c10db76 Compare March 13, 2021 01:20
@SeyedMir SeyedMir force-pushed the rcache-validation branch from c10db76 to 67d5299 Compare March 15, 2021 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP-DNM Work in progress / Do not review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants