-
Notifications
You must be signed in to change notification settings - Fork 3.5k
CUDA Programming
Baiju Meswani edited this page Feb 15, 2023
·
3 revisions
-
Understand the hardware
-
Architecture Generations
- P100: Pascal / sm 60
- V100: Volta / sm 70
- A100: Ampere / sm 80
-
CUDA Core vs. Tensor Core
-
-
Programming model
- Thread
- Block
- Grid
- Stream
-
Must-know functions
-
cudaMalloc()vs.cudaFree() -
cudaMemcpy()vs.cudaMemcpyAsync() -
cudaMemset()vs.cudaMemsetAsync() -
cudaStreamSynchronize()vs.cudaDeviceSynchronize() -
cudaEventRecord()vs.cudaStreamWaitEvent()
-
- Avoid memcpy
- Avoid unnecessary Sync
- Preprocess data in CPU
- when to use
#pragmaunroll?
- Easy: Dropout/DropGrad
- Medium: SoftmaxCrossEntropyLoss(Grad)
- Hard: LayerNormalization, ReduceSum, GatherGrad
-
printf()works inside CUDA code - Memcpy data to CPU for inspection?
Please use the learning roadmap on the home wiki page for building general understanding of ORT.