Skip to content

Adarsh321123/179_vae_inference

Repository files navigation

VAE Inference Implementation

This was my final project for Caltech's GPU Programming class!

A comprehensive Variational Autoencoder (VAE) inference system with both CPU and GPU implementations using PyTorch and CUDA. Features pretrained weights from the gan-vae-pretrained-pytorch repository for MNIST digit reconstruction.

Project Motivation

Understanding Variational Autoencoders

Variational Autoencoders are a class of deep generative models that combine neural networks with variational inference to learn meaningful latent representations of data. Originally introduced by Kingma and Welling in 2013, VAEs consist of an encoder network that maps input data to a probabilistic latent space and a decoder network that reconstructs data from latent samples. The key innovation is the reparameterization trick, which expresses latent samples as deterministic transformations of the encoder outputs and random noise. VAEs optimize a joint objective function combining reconstruction loss (measuring how well decoded samples match the original input) and KL divergence loss (ensuring the learned latent distribution approximates a prior distribution, typically standard Gaussian).

Computational Challenges and GPU Acceleration

The VAE inference process is computationally intensive due to four main components:

  1. Forward passes through encoder and decoder networks involving multiple dense layers and activation functions
  2. The reparameterization trick requiring parallel sampling from Gaussian distributions
  3. Reconstruction loss calculations comparing high-dimensional input and output tensors
  4. KL divergence computations involving logarithmic and exponential operations across latent dimensions

Each operation exhibits high parallelism potential. The GPU implementation exploits this parallelism through custom CUDA kernels optimized for each computational component.

Features

CPU Implementation

  • Pure CPU implementation: Complete PyTorch-based VAE inference running entirely on CPU
  • Automatic weight downloading: Downloads pretrained weights from GitHub automatically
  • Reconstruction quality evaluation: Computes reconstruction loss and generates detailed analysis
  • High-quality visualization: Side-by-side comparison of original vs reconstructed MNIST digits
  • Comprehensive testing suite: Full test coverage with unit tests, validation tests, and benchmarks
  • Configurable parameters: Batch size, number of samples, random seed, timing modes

CUDA Implementation

  • High-performance GPU inference: Complete CUDA C++ implementation with custom kernels
  • Weight loading system: Converts PyTorch weights to optimized CUDA-compatible format
  • Custom CUDA kernels: Matrix multiplication, activation functions, and loss calculations
  • Multiple executables: Main inference, evaluation, and benchmarking tools
  • Safetensors support: Efficient weight loading with bfloat16 precision support

Architecture

The VAE model implements the following architecture:

  • Encoder: 784 → 400 → 20 (mean) and 20 (log-variance)
  • Latent space: 20-dimensional Gaussian distributions with reparameterization trick
  • Decoder: 20 → 400 → 784
  • Activation functions: ReLU for hidden layers, Sigmoid for output
  • Loss function: Binary Cross-Entropy (reconstruction) + KL Divergence (regularization)

Installation

Prerequisites

For CPU Implementation

  • Python 3.10+

For CUDA Implementation

  • NVIDIA GPU with CUDA Compute Capability 6.0+ (RTX A5000 has 8.6)
  • CUDA Toolkit 11.1+ (tested with 12.5)
  • CMake 3.27+ (tested with 4.0.0)
  • C++20 compatible compiler (GCC 9+ or Clang 10+, tested with GCC 11.4.0)

Installation Steps

  1. Install Python dependencies:
cd 179_vae_inference
pip install -r requirements.txt
  1. Download weights and MNIST data (required first step):
python3 vae_inference.py

This will automatically download the pretrained weights and MNIST dataset needed for both CPU and CUDA implementations.

  1. Convert PyTorch weights to CUDA format (required for CUDA inference):
python3 scripts/convert_vae_weights.py
  1. For CUDA implementation, build the project:
mkdir -p build && cd build
cmake .. && cmake --build .
cd ..

Usage

CPU Implementation

Basic VAE Inference

Run the main inference analysis with default parameters:

python3 vae_inference.py

This will generate:

  • Reconstruction comparison image: vae_analysis_output/reconstruction_analysis.png: Side-by-side visualization of original vs reconstructed MNIST digits
  • Average reconstruction loss: Computed over 200 test samples (expected: approximately 100.3107)

CPU VAE Reconstruction Analysis

In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.

Note on Reconstruction Loss: The average reconstruction loss may appear high but this is normal for MNIST VAE models. The loss represents the sum of Binary Cross-Entropy over all 784 pixels, so even small per-pixel errors accumulate. The per-pixel loss better reflects the actual reconstruction quality.

Advanced Usage with Custom Parameters

python3 vae_inference.py --inference-batch 64 --evaluation-samples 200 --random-state 42 --output-path my_results

CPU Timing Benchmark

python3 vae_inference.py --timing-mode

Available Arguments for CPU Implementation

  • --inference-batch: Batch size for inference processing (default: 64, minimum: 2)
  • --random-state: Random seed for reproducibility (default: 42)
  • --evaluation-samples: Total samples to evaluate for metrics (default: 200)
  • --model-weights-url: URL for downloading pretrained model weights
  • --output-path: Directory path for saving analysis results (default: 'vae_analysis_output')
  • --timing-mode: Run CPU timing benchmark instead of full analysis

CUDA Implementation

Basic CUDA Inference

Run the main CUDA inference with pretrained weights:

./build/cuda_vae

CUDA Evaluation with MNIST Data

Run evaluation on real MNIST data with visualization:

./build/run_eval

This will generate:

  • Reconstruction comparison image: reconstruction_analysis.png: Side-by-side visualization of original vs reconstructed MNIST digits
  • Average reconstruction loss: Computed over 200 test samples (expected: approximately 99.6678).

CUDA VAE Reconstruction Analysis

In the reconstruction analysis image: top row shows original MNIST digits, bottom row shows VAE reconstructions.

Using Custom Weights Path

export VAE_WEIGHTS_PATH="/path/to/my/weights.safetensors"
./build/cuda_vae

CUDA Performance Benchmark

./build/cuda_timing_benchmark <batch_size> <num_iterations>

Available Arguments for CUDA Timing Benchmark

  • <batch_size>: Batch size for benchmarking (positional argument)
  • <num_iterations>: Number of iterations to run for timing (positional argument)

Weight Conversion and Verification

Convert PyTorch weights to CUDA format:

python3 scripts/convert_vae_weights.py --input pretrained_vae_weights.pth --output pretrained_vae_weights.safetensors

Available Arguments for Weight Conversion

  • --input, -i: Input .pth file path (default: 'pretrained_vae_weights.pth')
  • --output, -o: Output safetensors file path (default: 'pretrained_vae_weights.safetensors')
  • --no-bfloat16: Keep float32 instead of converting to bfloat16 (flag)

Verify weight conversion accuracy:

python3 scripts/verify_weights.py --tolerance 0.01

Available Arguments for Weight Verification

  • --original, -o: Original .pth file path (default: 'pretrained_vae_weights.pth')
  • --converted, -c: Converted safetensors file path (default: 'pretrained_vae_weights.safetensors')
  • --tolerance, -t: Tolerance for comparison (default: 1e-3)

Performance Benchmarking

To compare CPU vs CUDA performance, use these benchmarking commands with large batch sizes:

Batch Size 1024

# CPU timing (2M samples over 2000 runs)
python3 vae_inference.py --timing-mode --inference-batch 1024 --evaluation-samples 2048000

# CUDA timing (2000 runs)
./build/cuda_timing_benchmark 1024 2000

Expected Results: CPU ~7.069 ms/batch, GPU ~2.28392 ms/batch (3.10x speedup)

Batch Size 2048

# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 2048 --evaluation-samples 2048000

# CUDA timing
./build/cuda_timing_benchmark 2048 2000

Expected Results: CPU ~13.531 ms/batch, GPU ~3.27272 ms/batch (4.13x speedup)

Batch Size 4096

# CPU timing
python3 vae_inference.py --timing-mode --inference-batch 4096 --evaluation-samples 2048000

# CUDA timing
./build/cuda_timing_benchmark 4096 2000

Expected Results: CPU ~26.788 ms/batch, GPU ~5.24699 ms/batch (5.11x speedup)

Note: Results are stochastic and may vary slightly between runs, but the speedup trends should be consistent.

Testing

CPU Testing Suite

Run all CPU tests:

python3 run_tests.py

Custom test configuration:

python3 run_tests.py --output-dir custom_test_results --batch-size 32

Available Arguments for CPU Testing

  • --output-dir: Directory to save test results (default: 'test_results')
  • --model-weights-url: URL for pretrained model weights
  • --batch-size: Batch size for data loading tests (default: 32, minimum: 2)

CUDA Testing Suite

Run all CUDA tests:

python3 run_cuda_tests.py

Custom CUDA test configuration:

python3 run_cuda_tests.py --output-dir cuda_test_results --gpu-id 0 --timeout 300

Available Arguments for CUDA Testing

  • --output-dir: Directory to save test results (default: 'cuda_test_results')
  • --cuda-vae-exe: Path to CUDA VAE executable (default: './build/cuda_vae')
  • --run-eval-exe: Path to run_eval executable (default: './build/run_eval')
  • --batch-size: Batch size for testing (default: 32, minimum: 2)
  • --gpu-id: GPU device ID to use for testing (default: 0)
  • --timeout: Timeout in seconds for executable runs (default: 300)

Project Structure

Core Implementation Files

├── vae_inference.py              # Main CPU inference script
├── run_tests.py                  # CPU testing suite
├── run_cuda_tests.py             # CUDA testing suite
├── requirements.txt              # Python dependencies
├── CMakeLists.txt               # CMake build configuration
├── scripts/
│   ├── convert_vae_weights.py   # PyTorch to safetensors conversion
│   └── verify_weights.py        # Weight conversion verification
├── src/                         # CUDA source files
│   ├── main_inference.cu        # Main CUDA inference entry point
│   ├── run_eval.cu             # CUDA evaluation with MNIST
│   ├── cuda_timing_benchmark.cu # CUDA performance benchmark
│   ├── VAEModel.cu             # CUDA VAE model implementation
│   ├── VAEKernels.cu           # Custom CUDA kernels
│   ├── VAELoader.cu            # Weight loading system
│   ├── MatrixVectorMultiply.cu # Matrix multiplication kernels
│   ├── MnistLoader.cpp         # MNIST data loading
│   └── EvalHelpers.cu          # Evaluation utilities
├── include/                     # CUDA header files
│   ├── VAEModel.cuh            # VAE model interface
│   ├── VAEKernels.cuh          # Kernel declarations
│   ├── VAELoader.h             # Weight loader interface
│   └── ...
├── test_vae_inference.py        # CPU unit tests
├── test_cuda_vae_inference.py   # CUDA unit tests
├── vae_verification_utils.py    # CPU testing utilities
└── cuda_vae_verification_utils.py # CUDA testing utilities

Technical Implementation Details

Similar to the previous transformer lab, this implementation uses a mixed-precision approach for optimal performance:

  • Model weights and storage: bfloat16 format to reduce memory bandwidth compared to float32
  • Internal computations: float32 precision for accumulation within CUDA kernels to minimize floating-point rounding errors
  • Memory layout: All tensors use row-major ordering (contiguous along last dimension) for simplicity, though this may not be optimal for all operations

Potential Improvements

Based on profiling analysis using NVIDIA Nsight Compute (ncu --export build/profile.ncu-rep --force-overwrite --set full ./build/cuda_timing_benchmark 1024 1), we have identified specific performance bottlenecks and actionable optimization opportunities:

Overall Profiling GEMM ReLU FP32 to BF16

Current Performance Assessment

System Performance:

  • Current GPU speedup: 3-5x over CPU baseline (1024-4096 batch sizes)
  • Target potential: Additional 2-5x speedup with optimizations below

Kernel-Specific Status:

  • GEMM Kernels (gemm_bf16_relu/sigmoid): Good compute utilization (70%+) but memory-bound
  • Type Conversion (fp32_to_bf16): High bandwidth (401.93 GB/s) but inefficient utilization (30.31% busy)
  • VAE Operations: Well-fused encoder heads and loss computations

Issues Identified

1. Memory Coalescing (GEMM Kernels)

Profiling Evidence:

  • Global loads: Only 6.3 of 32 bytes per sector utilized (19.7% efficiency)
  • Global stores: Only 4.0 of 32 bytes per sector utilized (12.5% efficiency)
  • Memory throughput: Only 6.52 GB/s (memory-bound)
  • L1/TEX Cache: 67.14% hit rate with coalescing inefficiencies

Root Cause: Current thread-to-memory mapping creates stride patterns that fragment cache lines

2. Type Conversion Inefficiencies (fp32_to_bf16 Kernel)

Profiling Evidence:

  • L1 Cache: Poor 11.88% hit rate (pure streaming pattern)
  • L2 Cache: Low 34.73% hit rate (no data reuse)
  • Memory busy: Only 30.31% despite 401.93 GB/s throughput

Root Cause: Scalar processing without vectorization, creating numerous small memory transactions

3. L2 Compression Completely Unused

Profiling Evidence: 0% compression success rate across all kernels

Root Cause: Data layout not optimized for hardware compression algorithms

Specific Optimization Plan

1. Vectorize Type Conversion Kernel

  • Expected gain: 2x throughput improvement

2. Fix GEMM Memory Coalescing

  • Reorganize thread indexing to ensure consecutive threads access consecutive memory locations
  • Optimize shared memory loading patterns for better global memory efficiency
  • Expected gain: 3-5x improvement in memory-bound GEMM operations

3. Tensor Core Integration

  • Implement mma.sync instructions for native bfloat16 matrix operations
  • Restructure data layouts to match Tensor Core memory requirements
  • Expected gain: 2-4x speedup for larger batch sizes (2048+)

4. Eliminate Redundant Conversions

  • Cache converted bfloat16 data across multiple forward passes
  • Fuse type conversion with subsequent matrix operations
  • Expected gain: 50-80% reduction in conversion overhead

5. Enable L2 Compression

  • Align data structures to 128-byte cache line boundaries
  • Experiment with data patterns that compress effectively
  • Expected gain: 10-20% memory bandwidth reduction

6. Multi-Stream Pipeline

  • Overlap encoder/decoder computations using multiple CUDA streams
  • Pipeline batch processing for continuous GPU utilization
  • Expected gain: 15-30% overall throughput improvement

License

This implementation is based on the original VAE paper by Kingma and Welling (2014) and uses pretrained weights from the gan-vae-pretrained-pytorch repository.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published