CUDA Hexagonal Grid Smoothing Case Study

Overview

This is my final project for the CUDA Programming Course at Oxford (Prof. Mike Giles).

Course website: https://people.maths.ox.ac.uk/gilesm/cuda/
Practicals: https://github.com/MaxMLang/cuda-programming-notes
Some thoughts on spatial reordering with Morton and Hilbert Curves: https://maxmlang.github.io/spatial-curves/

What this does

Smooths noisy NDVI satellite data from Uganda using hex grids
Compares CPU vs GPU (up to 70x faster!)
Shows how I went from basic to optimized CUDA
Tries different memory tricks

Dataset

500,104 hexagons from Kampala, Uganda (500k+ points)
Avg 5.99 neighbors per hex (max 6)
Goal: smooth NDVI using neighbors
Problem: neighbors are all over memory

Repository Structure

src/
├── cuda/          # CUDA v1-v5
├── cpu/           # CPU code
scripts/           # Scripts
data/              # Input files
results/           # Outputs and benchmarks
bin/               # Compiled executables
docs/              # Notes

Performance Results

CPU

Naive: 3.55 ms (baseline)
OpenMP: 4.91 ms (actually slower!)

CUDA

v1: 51.4 μs (69x faster)
v2: 50.8 μs (70x faster)
v3: 56.8 μs (no real change)
v3 (shuffle): 87.4 μs (worse)
v4: 83.6 μs (4 vars at once)
v4 (per var): 20.9 μs (170x faster!)

v5 - Extended Neighborhoods with Recursive Bisection Reordering

1st-order, Gaussian, Single, Reordered: 54.49 μs
1st-order, Gaussian, Fusion, Reordered: 90.83 μs (22.71 μs/var)
Both-orders, Gaussian, Single, Reordered: 116.17 μs
Both-orders, Gaussian, Fusion, Reordered: 170.69 μs (42.67 μs/var)
Both-orders, Uniform, Fusion, Reordered: 171.83 μs (42.96 μs/var)

Memory Access Improvement:

Average distance between consecutive elements: 66.37 (vs random ordering)
Recursive bisection with 6 levels creates optimal spatial locality
4-7% overhead for significant memory access benefits

Key Optimizations

v1 - Naive GPU

Just ported CPU code
One thread per hex
Bad memory access
Still way faster than CPU

v2 - Coalesced

Changed neighbor array layout
Threads in warp access together
Padded to 6 neighbors
1% better than v1

v3 - Advanced

Texture memory: no help
Warp shuffle: much worse

v4 - Kernel Fusion

Process all 4 vars in one go
Reuse neighbor lookups
4x faster than doing vars separately

v5 - 2nd Order

Looks at neighbors-of-neighbors
Gaussian weights: center (1), 1st (0.6), 2nd (0.13)
More work but fusion still helps

Lessons Learned

Simple tricks > fancy ones
Kernel fusion is awesome for multi-var
Shared memory only helps if reused a lot
Always measure, don't just guess
Only 5.99 neighbors per hex limits what you can do

Building

# Build everything
make all
# Just CPU
make cpu
# Just CUDA
make cuda
# Clean up
make clean

Running

# Main run
./scripts/case-study.sh
# Test v5
./scripts/test-v5.sh

Scaling Notes

For millions of hexes:

OpenMP finally wins
Shared memory starts to matter
Reordering helps
Need multi-GPU

Memory:

75k: ~5MB
1M: ~70MB
10M: ~700MB
100M: ~7GB
1B: ~70GB

Course Context

This was for the CUDA course at Oxford. Covered:

Parallel basics
Memory tricks (global, shared, etc)
Tuning and profiling
Fancy stuff (warp shuffle, atomics)
Real-world uses

Acknowledgments

Prof. Mike Giles and Prof. Wes Armour for teaching the course
Oxford ARC for GPU server access

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
bin		bin
data		data
docs		docs
results/benchmarks		results/benchmarks
scripts		scripts
src		src
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CUDA Hexagonal Grid Smoothing Case Study

Overview

What this does

Dataset

Repository Structure

Performance Results

CPU

CUDA

v5 - Extended Neighborhoods with Recursive Bisection Reordering

Key Optimizations

v1 - Naive GPU

v2 - Coalesced

v3 - Advanced

v4 - Kernel Fusion

v5 - 2nd Order

Lessons Learned

Building

Running

Scaling Notes

Course Context

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

MaxMLang/cuda-gis-smoothing

Folders and files

Latest commit

History

Repository files navigation

CUDA Hexagonal Grid Smoothing Case Study

Overview

What this does

Dataset

Repository Structure

Performance Results

CPU

CUDA

v5 - Extended Neighborhoods with Recursive Bisection Reordering

Key Optimizations

v1 - Naive GPU

v2 - Coalesced

v3 - Advanced

v4 - Kernel Fusion

v5 - 2nd Order

Lessons Learned

Building

Running

Scaling Notes

Course Context

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages