This was my final project for Caltech's GPU Programming class!
A comprehensive Variational Autoencoder (VAE) inference system with both CPU and GPU implementations using PyTorch and CUDA. Features pretrained weights from the gan-vae-pretrained-pytorch repository for MNIST digit reconstruction.
Variational Autoencoders are a class of deep generative models that combine neural networks with variational inference to learn meaningful latent representations of data. Originally introduced by Kingma and Welling in 2013, VAEs consist of an encoder network that maps input data to a probabilistic latent space and a decoder network that reconstructs data from latent samples. The key innovation is the reparameterization trick, which expresses latent samples as deterministic transformations of the encoder outputs and random noise. VAEs optimize a joint objective function combining reconstruction loss (measuring how well decoded samples match the original input) and KL divergence loss (ensuring the learned latent distribution approximates a prior distribution, typically standard Gaussian).
The VAE inference process is computationally intensive due to four main components:
- Forward passes through encoder and decoder networks involving multiple dense layers and activation functions
- The reparameterization trick requiring parallel sampling from Gaussian distributions
- Reconstruction loss calculations comparing high-dimensional input and output tensors
- KL divergence computations involving logarithmic and exponential operations across latent dimensions
Each operation exhibits high parallelism potential. The GPU implementation exploits this parallelism through custom CUDA kernels optimized for each computational component.
- Pure CPU implementation: Complete PyTorch-based VAE inference running entirely on CPU
- Automatic weight downloading: Downloads pretrained weights from GitHub automatically
- Reconstruction quality evaluation: Computes reconstruction loss and generates detailed analysis
- High-quality visualization: Side-by-side comparison of original vs reconstructed MNIST digits
- Comprehensive testing suite: Full test coverage with unit tests, validation tests, and benchmarks
- Configurable parameters: Batch size, number of samples, random seed, timing modes
- High-performance GPU inference: Complete CUDA C++ implementation with custom kernels
- Weight loading system: Converts PyTorch weights to optimized CUDA-compatible format
- Custom CUDA kernels: Matrix multiplication, activation functions, and loss calculations
- Multiple executables: Main inference, evaluation, and benchmarking tools
- Safetensors support: Efficient weight loading with bfloat16 precision support
The VAE model implements the following architecture:
- Encoder: 784 → 400 → 20 (mean) and 20 (log-variance)
- Latent space: 20-dimensional Gaussian distributions with reparameterization trick
- Decoder: 20 → 400 → 784
- Activation functions: ReLU for hidden layers, Sigmoid for output
- Loss function: Binary Cross-Entropy (reconstruction) + KL Divergence (regularization)
- Python 3.10+
- NVIDIA GPU with CUDA Compute Capability 6.0+ (RTX A5000 has 8.6)
- CUDA Toolkit 11.1+ (tested with 12.5)
- CMake 3.27+ (tested with 4.0.0)
- C++20 compatible compiler (GCC 9+ or Clang 10+, tested with GCC 11.4.0)
- Install Python dependencies:
cd 179_vae_inference
pip install -r requirements.txt
- Download weights and MNIST data (required first step):
python3 vae_inference.py
This will automatically download the pretrained weights and MNIST dataset needed for both CPU and CUDA implementations.
- Convert PyTorch weights to CUDA format (required for CUDA inference):
python3 scripts/convert_vae_weights.py
- For CUDA implementation, build the project:
mkdir -p build && cd build
cmake .. && cmake --build .
cd ..
Run the main inference analysis with default parameters:
python3 vae_inference.py
This will generate:
- Reconstruction comparison image:
vae_analysis_output/reconstruction_analysis.png
: Side-by-side visualization of original vs reconstructed MNIST digits - Average reconstruction loss: Computed over 200 test samples (expected: approximately 100.3107)
In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.
Note on Reconstruction Loss: The average reconstruction loss may appear high but this is normal for MNIST VAE models. The loss represents the sum of Binary Cross-Entropy over all 784 pixels, so even small per-pixel errors accumulate. The per-pixel loss better reflects the actual reconstruction quality.
python3 vae_inference.py --inference-batch 64 --evaluation-samples 200 --random-state 42 --output-path my_results
python3 vae_inference.py --timing-mode
--inference-batch
: Batch size for inference processing (default: 64, minimum: 2)--random-state
: Random seed for reproducibility (default: 42)--evaluation-samples
: Total samples to evaluate for metrics (default: 200)--model-weights-url
: URL for downloading pretrained model weights--output-path
: Directory path for saving analysis results (default: 'vae_analysis_output')--timing-mode
: Run CPU timing benchmark instead of full analysis
Run the main CUDA inference with pretrained weights:
./build/cuda_vae
Run evaluation on real MNIST data with visualization:
./build/run_eval
This will generate:
- Reconstruction comparison image:
reconstruction_analysis.png
: Side-by-side visualization of original vs reconstructed MNIST digits - Average reconstruction loss: Computed over 200 test samples (expected: approximately 99.6678).
In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.
export VAE_WEIGHTS_PATH="/path/to/my/weights.safetensors"
./build/cuda_vae
./build/cuda_timing_benchmark <batch_size> <num_iterations>
<batch_size>
: Batch size for benchmarking (positional argument)<num_iterations>
: Number of iterations to run for timing (positional argument)
python3 scripts/convert_vae_weights.py --input pretrained_vae_weights.pth --output pretrained_vae_weights.safetensors
--input
,-i
: Input .pth file path (default: 'pretrained_vae_weights.pth')--output
,-o
: Output safetensors file path (default: 'pretrained_vae_weights.safetensors')--no-bfloat16
: Keep float32 instead of converting to bfloat16 (flag)
python3 scripts/verify_weights.py --tolerance 0.01
--original
,-o
: Original .pth file path (default: 'pretrained_vae_weights.pth')--converted
,-c
: Converted safetensors file path (default: 'pretrained_vae_weights.safetensors')--tolerance
,-t
: Tolerance for comparison (default: 1e-3)
To compare CPU vs CUDA performance, use these benchmarking commands with large batch sizes:
# CPU timing (2M samples over 2000 runs)
python3 vae_inference.py --timing-mode --inference-batch 1024 --evaluation-samples 2048000
# CUDA timing (2000 runs)
./build/cuda_timing_benchmark 1024 2000
Expected Results: CPU ~7.069 ms/batch, GPU ~2.28392 ms/batch (3.10x speedup)
# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 2048 --evaluation-samples 2048000
# CUDA timing
./build/cuda_timing_benchmark 2048 2000
Expected Results: CPU ~13.531 ms/batch, GPU ~3.27272 ms/batch (4.13x speedup)
# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 4096 --evaluation-samples 2048000
# CUDA timing
./build/cuda_timing_benchmark 4096 2000
Expected Results: CPU ~26.788 ms/batch, GPU ~5.24699 ms/batch (5.11x speedup)
Note: Results are stochastic and may vary slightly between runs, but the speedup trends should be consistent.
python3 run_tests.py
python3 run_tests.py --output-dir custom_test_results --batch-size 32
--output-dir
: Directory to save test results (default: 'test_results')--model-weights-url
: URL for pretrained model weights--batch-size
: Batch size for data loading tests (default: 32, minimum: 2)
python3 run_cuda_tests.py
python3 run_cuda_tests.py --output-dir cuda_test_results --gpu-id 0 --timeout 300
--output-dir
: Directory to save test results (default: 'cuda_test_results')--cuda-vae-exe
: Path to CUDA VAE executable (default: './build/cuda_vae')--run-eval-exe
: Path to run_eval executable (default: './build/run_eval')--batch-size
: Batch size for testing (default: 32, minimum: 2)--gpu-id
: GPU device ID to use for testing (default: 0)--timeout
: Timeout in seconds for executable runs (default: 300)
├── vae_inference.py # Main CPU inference script
├── run_tests.py # CPU testing suite
├── run_cuda_tests.py # CUDA testing suite
├── requirements.txt # Python dependencies
├── CMakeLists.txt # CMake build configuration
├── scripts/
│ ├── convert_vae_weights.py # PyTorch to safetensors conversion
│ └── verify_weights.py # Weight conversion verification
├── src/ # CUDA source files
│ ├── main_inference.cu # Main CUDA inference entry point
│ ├── run_eval.cu # CUDA evaluation with MNIST
│ ├── cuda_timing_benchmark.cu # CUDA performance benchmark
│ ├── VAEModel.cu # CUDA VAE model implementation
│ ├── VAEKernels.cu # Custom CUDA kernels
│ ├── VAELoader.cu # Weight loading system
│ ├── MatrixVectorMultiply.cu # Matrix multiplication kernels
│ ├── MnistLoader.cpp # MNIST data loading
│ └── EvalHelpers.cu # Evaluation utilities
├── include/ # CUDA header files
│ ├── VAEModel.cuh # VAE model interface
│ ├── VAEKernels.cuh # Kernel declarations
│ ├── VAELoader.h # Weight loader interface
│ └── ...
├── test_vae_inference.py # CPU unit tests
├── test_cuda_vae_inference.py # CUDA unit tests
├── vae_verification_utils.py # CPU testing utilities
└── cuda_vae_verification_utils.py # CUDA testing utilities
Similar to the previous transformer lab, this implementation uses a mixed-precision approach for optimal performance:
- Model weights and storage:
bfloat16
format to reduce memory bandwidth compared tofloat32
- Internal computations:
float32
precision for accumulation within CUDA kernels to minimize floating-point rounding errors - Memory layout: All tensors use row-major ordering (contiguous along last dimension) for simplicity, though this may not be optimal for all operations
Based on profiling analysis using NVIDIA Nsight Compute (ncu --export build/profile.ncu-rep --force-overwrite --set full ./build/cuda_timing_benchmark 1024 1
), we have identified specific performance bottlenecks and actionable optimization opportunities:
System Performance:
- Current GPU speedup: 3-5x over CPU baseline (1024-4096 batch sizes)
- Target potential: Additional 2-5x speedup with optimizations below
Kernel-Specific Status:
- GEMM Kernels (
gemm_bf16_relu/sigmoid
): Good compute utilization (70%+) but memory-bound - Type Conversion (
fp32_to_bf16
): High bandwidth (401.93 GB/s) but inefficient utilization (30.31% busy) - VAE Operations: Well-fused encoder heads and loss computations
Profiling Evidence:
- Global loads: Only 6.3 of 32 bytes per sector utilized (19.7% efficiency)
- Global stores: Only 4.0 of 32 bytes per sector utilized (12.5% efficiency)
- Memory throughput: Only 6.52 GB/s (memory-bound)
- L1/TEX Cache: 67.14% hit rate with coalescing inefficiencies
Root Cause: Current thread-to-memory mapping creates stride patterns that fragment cache lines
Profiling Evidence:
- L1 Cache: Poor 11.88% hit rate (pure streaming pattern)
- L2 Cache: Low 34.73% hit rate (no data reuse)
- Memory busy: Only 30.31% despite 401.93 GB/s throughput
Root Cause: Scalar processing without vectorization, creating numerous small memory transactions
Profiling Evidence: 0% compression success rate across all kernels
Root Cause: Data layout not optimized for hardware compression algorithms
1. Vectorize Type Conversion Kernel
- Expected gain: 2x throughput improvement
2. Fix GEMM Memory Coalescing
- Reorganize thread indexing to ensure consecutive threads access consecutive memory locations
- Optimize shared memory loading patterns for better global memory efficiency
- Expected gain: 3-5x improvement in memory-bound GEMM operations
3. Tensor Core Integration
- Implement
mma.sync
instructions for native bfloat16 matrix operations - Restructure data layouts to match Tensor Core memory requirements
- Expected gain: 2-4x speedup for larger batch sizes (2048+)
4. Eliminate Redundant Conversions
- Cache converted bfloat16 data across multiple forward passes
- Fuse type conversion with subsequent matrix operations
- Expected gain: 50-80% reduction in conversion overhead
5. Enable L2 Compression
- Align data structures to 128-byte cache line boundaries
- Experiment with data patterns that compress effectively
- Expected gain: 10-20% memory bandwidth reduction
6. Multi-Stream Pipeline
- Overlap encoder/decoder computations using multiple CUDA streams
- Pipeline batch processing for continuous GPU utilization
- Expected gain: 15-30% overall throughput improvement
This implementation is based on the original VAE paper by Kingma and Welling (2014) and uses pretrained weights from the gan-vae-pretrained-pytorch repository.