This is my final project for the CUDA Programming Course at Oxford (Prof. Mike Giles).
- Course website: https://people.maths.ox.ac.uk/gilesm/cuda/
- Practicals: https://github.com/MaxMLang/cuda-programming-notes
- Some thoughts on spatial reordering with Morton and Hilbert Curves: https://maxmlang.github.io/spatial-curves/
- Smooths noisy NDVI satellite data from Uganda using hex grids
- Compares CPU vs GPU (up to 70x faster!)
- Shows how I went from basic to optimized CUDA
- Tries different memory tricks
- 500,104 hexagons from Kampala, Uganda (500k+ points)
- Avg 5.99 neighbors per hex (max 6)
- Goal: smooth NDVI using neighbors
- Problem: neighbors are all over memory
src/
├── cuda/ # CUDA v1-v5
├── cpu/ # CPU code
scripts/ # Scripts
data/ # Input files
results/ # Outputs and benchmarks
bin/ # Compiled executables
docs/ # Notes
- Naive: 3.55 ms (baseline)
- OpenMP: 4.91 ms (actually slower!)
- v1: 51.4 μs (69x faster)
- v2: 50.8 μs (70x faster)
- v3: 56.8 μs (no real change)
- v3 (shuffle): 87.4 μs (worse)
- v4: 83.6 μs (4 vars at once)
- v4 (per var): 20.9 μs (170x faster!)
- 1st-order, Gaussian, Single, Reordered: 54.49 μs
- 1st-order, Gaussian, Fusion, Reordered: 90.83 μs (22.71 μs/var)
- Both-orders, Gaussian, Single, Reordered: 116.17 μs
- Both-orders, Gaussian, Fusion, Reordered: 170.69 μs (42.67 μs/var)
- Both-orders, Uniform, Fusion, Reordered: 171.83 μs (42.96 μs/var)
Memory Access Improvement:
- Average distance between consecutive elements: 66.37 (vs random ordering)
- Recursive bisection with 6 levels creates optimal spatial locality
- 4-7% overhead for significant memory access benefits
- Just ported CPU code
- One thread per hex
- Bad memory access
- Still way faster than CPU
- Changed neighbor array layout
- Threads in warp access together
- Padded to 6 neighbors
- 1% better than v1
- Texture memory: no help
- Warp shuffle: much worse
- Process all 4 vars in one go
- Reuse neighbor lookups
- 4x faster than doing vars separately
- Looks at neighbors-of-neighbors
- Gaussian weights: center (1), 1st (0.6), 2nd (0.13)
- More work but fusion still helps
- Simple tricks > fancy ones
- Kernel fusion is awesome for multi-var
- Shared memory only helps if reused a lot
- Always measure, don't just guess
- Only 5.99 neighbors per hex limits what you can do
# Build everything
make all
# Just CPU
make cpu
# Just CUDA
make cuda
# Clean up
make clean
# Main run
./scripts/case-study.sh
# Test v5
./scripts/test-v5.sh
For millions of hexes:
- OpenMP finally wins
- Shared memory starts to matter
- Reordering helps
- Need multi-GPU
Memory:
- 75k: ~5MB
- 1M: ~70MB
- 10M: ~700MB
- 100M: ~7GB
- 1B: ~70GB
This was for the CUDA course at Oxford. Covered:
- Parallel basics
- Memory tricks (global, shared, etc)
- Tuning and profiling
- Fancy stuff (warp shuffle, atomics)
- Real-world uses
- Prof. Mike Giles and Prof. Wes Armour for teaching the course
- Oxford ARC for GPU server access