VAE Inference Implementation

This was my final project for Caltech's GPU Programming class!

A comprehensive Variational Autoencoder (VAE) inference system with both CPU and GPU implementations using PyTorch and CUDA. Features pretrained weights from the gan-vae-pretrained-pytorch repository for MNIST digit reconstruction.

Project Motivation

Understanding Variational Autoencoders

Variational Autoencoders are a class of deep generative models that combine neural networks with variational inference to learn meaningful latent representations of data. Originally introduced by Kingma and Welling in 2013, VAEs consist of an encoder network that maps input data to a probabilistic latent space and a decoder network that reconstructs data from latent samples. The key innovation is the reparameterization trick, which expresses latent samples as deterministic transformations of the encoder outputs and random noise. VAEs optimize a joint objective function combining reconstruction loss (measuring how well decoded samples match the original input) and KL divergence loss (ensuring the learned latent distribution approximates a prior distribution, typically standard Gaussian).

Computational Challenges and GPU Acceleration

The VAE inference process is computationally intensive due to four main components:

Forward passes through encoder and decoder networks involving multiple dense layers and activation functions
The reparameterization trick requiring parallel sampling from Gaussian distributions
Reconstruction loss calculations comparing high-dimensional input and output tensors
KL divergence computations involving logarithmic and exponential operations across latent dimensions

Each operation exhibits high parallelism potential. The GPU implementation exploits this parallelism through custom CUDA kernels optimized for each computational component.

Features

CPU Implementation

Pure CPU implementation: Complete PyTorch-based VAE inference running entirely on CPU
Automatic weight downloading: Downloads pretrained weights from GitHub automatically
Reconstruction quality evaluation: Computes reconstruction loss and generates detailed analysis
High-quality visualization: Side-by-side comparison of original vs reconstructed MNIST digits
Comprehensive testing suite: Full test coverage with unit tests, validation tests, and benchmarks
Configurable parameters: Batch size, number of samples, random seed, timing modes

CUDA Implementation

High-performance GPU inference: Complete CUDA C++ implementation with custom kernels
Weight loading system: Converts PyTorch weights to optimized CUDA-compatible format
Custom CUDA kernels: Matrix multiplication, activation functions, and loss calculations
Multiple executables: Main inference, evaluation, and benchmarking tools
Safetensors support: Efficient weight loading with bfloat16 precision support

Architecture

The VAE model implements the following architecture:

Encoder: 784 → 400 → 20 (mean) and 20 (log-variance)
Latent space: 20-dimensional Gaussian distributions with reparameterization trick
Decoder: 20 → 400 → 784
Activation functions: ReLU for hidden layers, Sigmoid for output
Loss function: Binary Cross-Entropy (reconstruction) + KL Divergence (regularization)

Installation

Prerequisites

For CPU Implementation

Python 3.10+

For CUDA Implementation

NVIDIA GPU with CUDA Compute Capability 6.0+ (RTX A5000 has 8.6)
CUDA Toolkit 11.1+ (tested with 12.5)
CMake 3.27+ (tested with 4.0.0)
C++20 compatible compiler (GCC 9+ or Clang 10+, tested with GCC 11.4.0)

Installation Steps

Install Python dependencies:

cd 179_vae_inference
pip install -r requirements.txt

Download weights and MNIST data (required first step):

python3 vae_inference.py

This will automatically download the pretrained weights and MNIST dataset needed for both CPU and CUDA implementations.

Convert PyTorch weights to CUDA format (required for CUDA inference):

python3 scripts/convert_vae_weights.py

For CUDA implementation, build the project:

mkdir -p build && cd build
cmake .. && cmake --build .
cd ..

Usage

CPU Implementation

Basic VAE Inference

Run the main inference analysis with default parameters:

python3 vae_inference.py

This will generate:

Reconstruction comparison image: vae_analysis_output/reconstruction_analysis.png: Side-by-side visualization of original vs reconstructed MNIST digits
Average reconstruction loss: Computed over 200 test samples (expected: approximately 100.3107)

In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.

Note on Reconstruction Loss: The average reconstruction loss may appear high but this is normal for MNIST VAE models. The loss represents the sum of Binary Cross-Entropy over all 784 pixels, so even small per-pixel errors accumulate. The per-pixel loss better reflects the actual reconstruction quality.

Advanced Usage with Custom Parameters

python3 vae_inference.py --inference-batch 64 --evaluation-samples 200 --random-state 42 --output-path my_results

CPU Timing Benchmark

python3 vae_inference.py --timing-mode

Available Arguments for CPU Implementation

--inference-batch: Batch size for inference processing (default: 64, minimum: 2)
--random-state: Random seed for reproducibility (default: 42)
--evaluation-samples: Total samples to evaluate for metrics (default: 200)
--model-weights-url: URL for downloading pretrained model weights
--output-path: Directory path for saving analysis results (default: 'vae_analysis_output')
--timing-mode: Run CPU timing benchmark instead of full analysis

CUDA Implementation

Basic CUDA Inference

Run the main CUDA inference with pretrained weights:

./build/cuda_vae

CUDA Evaluation with MNIST Data

Run evaluation on real MNIST data with visualization:

./build/run_eval

This will generate:

Reconstruction comparison image: reconstruction_analysis.png: Side-by-side visualization of original vs reconstructed MNIST digits
Average reconstruction loss: Computed over 200 test samples (expected: approximately 99.6678).

In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.

Using Custom Weights Path

export VAE_WEIGHTS_PATH="/path/to/my/weights.safetensors"
./build/cuda_vae

CUDA Performance Benchmark

./build/cuda_timing_benchmark <batch_size> <num_iterations>

Available Arguments for CUDA Timing Benchmark

<batch_size>: Batch size for benchmarking (positional argument)
<num_iterations>: Number of iterations to run for timing (positional argument)

Weight Conversion and Verification

Convert PyTorch weights to CUDA format:

python3 scripts/convert_vae_weights.py --input pretrained_vae_weights.pth --output pretrained_vae_weights.safetensors

Available Arguments for Weight Conversion

--input, -i: Input .pth file path (default: 'pretrained_vae_weights.pth')
--output, -o: Output safetensors file path (default: 'pretrained_vae_weights.safetensors')
--no-bfloat16: Keep float32 instead of converting to bfloat16 (flag)

Verify weight conversion accuracy:

python3 scripts/verify_weights.py --tolerance 0.01

Available Arguments for Weight Verification

--original, -o: Original .pth file path (default: 'pretrained_vae_weights.pth')
--converted, -c: Converted safetensors file path (default: 'pretrained_vae_weights.safetensors')
--tolerance, -t: Tolerance for comparison (default: 1e-3)

Performance Benchmarking

To compare CPU vs CUDA performance, use these benchmarking commands with large batch sizes:

Batch Size 1024

# CPU timing (2M samples over 2000 runs)
python3 vae_inference.py --timing-mode --inference-batch 1024 --evaluation-samples 2048000

# CUDA timing (2000 runs)
./build/cuda_timing_benchmark 1024 2000

Expected Results: CPU ~7.069 ms/batch, GPU ~2.28392 ms/batch (3.10x speedup)

Batch Size 2048

# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 2048 --evaluation-samples 2048000

# CUDA timing
./build/cuda_timing_benchmark 2048 2000

Expected Results: CPU ~13.531 ms/batch, GPU ~3.27272 ms/batch (4.13x speedup)

Batch Size 4096

# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 4096 --evaluation-samples 2048000

# CUDA timing
./build/cuda_timing_benchmark 4096 2000

Expected Results: CPU ~26.788 ms/batch, GPU ~5.24699 ms/batch (5.11x speedup)

Note: Results are stochastic and may vary slightly between runs, but the speedup trends should be consistent.

Testing

CPU Testing Suite

Run all CPU tests:

python3 run_tests.py

Custom test configuration:

python3 run_tests.py --output-dir custom_test_results --batch-size 32

Available Arguments for CPU Testing

--output-dir: Directory to save test results (default: 'test_results')
--model-weights-url: URL for pretrained model weights
--batch-size: Batch size for data loading tests (default: 32, minimum: 2)

CUDA Testing Suite

Run all CUDA tests:

python3 run_cuda_tests.py

Custom CUDA test configuration:

python3 run_cuda_tests.py --output-dir cuda_test_results --gpu-id 0 --timeout 300

Available Arguments for CUDA Testing

--output-dir: Directory to save test results (default: 'cuda_test_results')
--cuda-vae-exe: Path to CUDA VAE executable (default: './build/cuda_vae')
--run-eval-exe: Path to run_eval executable (default: './build/run_eval')
--batch-size: Batch size for testing (default: 32, minimum: 2)
--gpu-id: GPU device ID to use for testing (default: 0)
--timeout: Timeout in seconds for executable runs (default: 300)

Project Structure

Core Implementation Files

├── vae_inference.py              # Main CPU inference script
├── run_tests.py                  # CPU testing suite
├── run_cuda_tests.py             # CUDA testing suite
├── requirements.txt              # Python dependencies
├── CMakeLists.txt               # CMake build configuration
├── scripts/
│   ├── convert_vae_weights.py   # PyTorch to safetensors conversion
│   └── verify_weights.py        # Weight conversion verification
├── src/                         # CUDA source files
│   ├── main_inference.cu        # Main CUDA inference entry point
│   ├── run_eval.cu             # CUDA evaluation with MNIST
│   ├── cuda_timing_benchmark.cu # CUDA performance benchmark
│   ├── VAEModel.cu             # CUDA VAE model implementation
│   ├── VAEKernels.cu           # Custom CUDA kernels
│   ├── VAELoader.cu            # Weight loading system
│   ├── MatrixVectorMultiply.cu # Matrix multiplication kernels
│   ├── MnistLoader.cpp         # MNIST data loading
│   └── EvalHelpers.cu          # Evaluation utilities
├── include/                     # CUDA header files
│   ├── VAEModel.cuh            # VAE model interface
│   ├── VAEKernels.cuh          # Kernel declarations
│   ├── VAELoader.h             # Weight loader interface
│   └── ...
├── test_vae_inference.py        # CPU unit tests
├── test_cuda_vae_inference.py   # CUDA unit tests
├── vae_verification_utils.py    # CPU testing utilities
└── cuda_vae_verification_utils.py # CUDA testing utilities

Technical Implementation Details

Similar to the previous transformer lab, this implementation uses a mixed-precision approach for optimal performance:

Model weights and storage: bfloat16 format to reduce memory bandwidth compared to float32
Internal computations: float32 precision for accumulation within CUDA kernels to minimize floating-point rounding errors
Memory layout: All tensors use row-major ordering (contiguous along last dimension) for simplicity, though this may not be optimal for all operations

Potential Improvements

Based on profiling analysis using NVIDIA Nsight Compute (ncu --export build/profile.ncu-rep --force-overwrite --set full ./build/cuda_timing_benchmark 1024 1), we have identified specific performance bottlenecks and actionable optimization opportunities:

Current Performance Assessment

System Performance:

Current GPU speedup: 3-5x over CPU baseline (1024-4096 batch sizes)
Target potential: Additional 2-5x speedup with optimizations below

Kernel-Specific Status:

GEMM Kernels (gemm_bf16_relu/sigmoid): Good compute utilization (70%+) but memory-bound
Type Conversion (fp32_to_bf16): High bandwidth (401.93 GB/s) but inefficient utilization (30.31% busy)
VAE Operations: Well-fused encoder heads and loss computations

Issues Identified

1. Memory Coalescing (GEMM Kernels)

Profiling Evidence:

Global loads: Only 6.3 of 32 bytes per sector utilized (19.7% efficiency)
Global stores: Only 4.0 of 32 bytes per sector utilized (12.5% efficiency)
Memory throughput: Only 6.52 GB/s (memory-bound)
L1/TEX Cache: 67.14% hit rate with coalescing inefficiencies

Root Cause: Current thread-to-memory mapping creates stride patterns that fragment cache lines

2. Type Conversion Inefficiencies (fp32_to_bf16 Kernel)

Profiling Evidence:

L1 Cache: Poor 11.88% hit rate (pure streaming pattern)
L2 Cache: Low 34.73% hit rate (no data reuse)
Memory busy: Only 30.31% despite 401.93 GB/s throughput

Root Cause: Scalar processing without vectorization, creating numerous small memory transactions

3. L2 Compression Completely Unused

Profiling Evidence: 0% compression success rate across all kernels

Root Cause: Data layout not optimized for hardware compression algorithms

Specific Optimization Plan

1. Vectorize Type Conversion Kernel

Expected gain: 2x throughput improvement

2. Fix GEMM Memory Coalescing

Reorganize thread indexing to ensure consecutive threads access consecutive memory locations
Optimize shared memory loading patterns for better global memory efficiency
Expected gain: 3-5x improvement in memory-bound GEMM operations

3. Tensor Core Integration

Implement mma.sync instructions for native bfloat16 matrix operations
Restructure data layouts to match Tensor Core memory requirements
Expected gain: 2-4x speedup for larger batch sizes (2048+)

4. Eliminate Redundant Conversions

Cache converted bfloat16 data across multiple forward passes
Fuse type conversion with subsequent matrix operations
Expected gain: 50-80% reduction in conversion overhead

5. Enable L2 Compression

Align data structures to 128-byte cache line boundaries
Experiment with data patterns that compress effectively
Expected gain: 10-20% memory bandwidth reduction

6. Multi-Stream Pipeline

Overlap encoder/decoder computations using multiple CUDA streams
Pipeline batch processing for continuous GPU utilization
Expected gain: 15-30% overall throughput improvement

License

This implementation is based on the original VAE paper by Kingma and Welling (2014) and uses pretrained weights from the gan-vae-pretrained-pytorch repository.

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
include		include
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
cuda_vae_verification_utils.py		cuda_vae_verification_utils.py
fp32.png		fp32.png
gemm_relu.png		gemm_relu.png
overall_profiling.png		overall_profiling.png
reconstruction_analysis.png		reconstruction_analysis.png
requirements.txt		requirements.txt
run_cuda_tests.py		run_cuda_tests.py
run_tests.py		run_tests.py
test_cuda_vae_inference.py		test_cuda_vae_inference.py
test_vae_inference.py		test_vae_inference.py
vae_inference.py		vae_inference.py
vae_verification_utils.py		vae_verification_utils.py

Adarsh321123/179_vae_inference

Folders and files

Latest commit

History

Repository files navigation

VAE Inference Implementation

Project Motivation

Understanding Variational Autoencoders

Computational Challenges and GPU Acceleration

Features

CPU Implementation

CUDA Implementation

Architecture

Installation

Prerequisites

For CPU Implementation

For CUDA Implementation

Installation Steps

Usage

CPU Implementation

Basic VAE Inference

Advanced Usage with Custom Parameters

CPU Timing Benchmark

Available Arguments for CPU Implementation

CUDA Implementation

Basic CUDA Inference

CUDA Evaluation with MNIST Data

Using Custom Weights Path

CUDA Performance Benchmark

Available Arguments for CUDA Timing Benchmark

Weight Conversion and Verification

Convert PyTorch weights to CUDA format:

Available Arguments for Weight Conversion

Verify weight conversion accuracy:

Available Arguments for Weight Verification

Performance Benchmarking

Batch Size 1024

Batch Size 2048

Batch Size 4096

Testing

CPU Testing Suite

Run all CPU tests:

Custom test configuration:

Available Arguments for CPU Testing

CUDA Testing Suite

Run all CUDA tests:

Custom CUDA test configuration:

Available Arguments for CUDA Testing

Project Structure

Core Implementation Files

Technical Implementation Details

Potential Improvements

Current Performance Assessment

Issues Identified

1. Memory Coalescing (GEMM Kernels)

2. Type Conversion Inefficiencies (fp32_to_bf16 Kernel)

3. L2 Compression Completely Unused

Specific Optimization Plan

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages