Skip to content

GPU-accelerated Schulze voting method in Python, Numba, CUDA, and Mojo 🔥, using ideas from Algebraic Graph Theory

License

Notifications You must be signed in to change notification settings

ashvardanian/ScalingElections

Repository files navigation

Scaling Elections with GPUs

Scaling Elections Thumbnail

This repository implements tiled parallel adaptations of the Schulze voting algorithm with hardware acceleration across CPUs and GPUs in Mojo and CUDA C++ wrapped into Python. That algorithm is often used by Pirate Parties and open-source foundations, and it's a good example of a combinatorial problem that can be parallelized by changing evaluation order.

  • The Mojo implementation is packed into a single file scaling_elections.mojo, containing both CPU and GPU implementations, and the benchmarking code.
  • The other implementation is a sandwich of scaling_elections.cu CUDA C++ code wrapped with PyBind11, and scaling_elections.py packing the Python benchmarking code and Numba reference kernels.

Not a single line of CMake is used in this repository! The entire native library build is packed into setup.py for Python, and pixi takes care of the Mojo build.

Usage

Both Python and Mojo implementations are included. Both support the same CLI arguments:

  • --no-serial: Skip serial baseline
  • --num-candidates N: Number of candidates (default: 128)
  • --num-voters N: Number of voters (default: 2000)
  • --run-cpu: Run CPU implementations
  • --run-gpu: Run GPU implementation
  • --cpu-tile-size N: CPU tile size (default: 16)
  • --gpu-tile-size N: GPU tile size (default: 32)
  • --help, -h: Show help message

Python

Pull:

git clone https://github.com/ashvardanian/ScalingElections.git
cd ScalingElections

Build the environment and run with uv:

uv venv -p python3.12               # Pick a recent Python version
uv sync --extra cpu --extra gpu     # Build locally and install dependencies
uv run scaling_elections.py         # Run the default problem size
uv run scaling_elections.py --num-candidates 4096 --num-voters 4096 --run-cpu --run-gpu

Alternatively, with your local environment:

pip install -e . --force-reinstall  # Build locally and install dependencies
python scaling_elections.py         # Run the default problem size

Mojo

This repository also includes a pure Mojo implementation in scaling_elections.mojo. To install and run it, use pixi:

pixi install
pixi run mojo scaling_elections.mojo --help
pixi run mojo scaling_elections.mojo --num-candidates 4096 --num-voters 4096 --run-cpu --run-gpu

Or compile and run as a standalone binary:

pixi run mojo build scaling_elections.mojo -o schulze
./schulze --num-candidates 4096 --run-cpu --run-gpu

Links

Throughput

Similar to measuring matrix multiplications in FLOPS, we can measure the throughput of the Schulze algorithm in cells per second. Or in our case, in GigaCells per Second (gcs), where a cell is a single pairwise comparison between two candidates.

Candidates Numba 384c Mojo 🔥 384c Mojo 🔥 SIMD 384c CUDA h100 Mojo 🔥 h100
2'048 34.4 gcs 37.9 gcs 62.1 gcs 182.7 gcs 153.4 gcs
4'096 86.8 gcs 59.8 gcs 171.5 gcs 264.1 gcs 232.6 gcs
8'192 74.6 gcs 76.6 gcs 357.3 gcs 495.3 gcs 408.0 gcs
16'384 76.7 gcs 80.7 gcs 369.0 gcs 600.7 gcs 635.3 gcs
32'768 101.4 gcs 82.3 gcs 293.1 gcs 921.4 gcs 893.7 gcs

The 384c columns refer to benchmarks obtained on AWS m8i instances with dual-socket Xeon 6 CPUs, totalling 384 cores. The h100 columns refer to benchmarks obtained on Nebius GPU instances with NVIDIA H100 GPUs.


With NVIDIA Nsight Compute CLI we can dissect the kernels and see that there is more room for improvement:

ncu uv run scaling_elections.py --num-candidates 4096 --num-voters 4096 --gpu-tile-size 32 --run-gpu
>  void _cuda_independent<32>(unsigned int, unsigned int, unsigned int *) (128, 128, 1)x(32, 32, 1), Context 1, Stream 7, Device 0, CC 9.0
>    Section: GPU Speed Of Light Throughput
>    ----------------------- ----------- ------------
>    Metric Name             Metric Unit Metric Value
>    ----------------------- ----------- ------------
>    DRAM Frequency                  Ghz         3.20
>    SM Frequency                    Ghz         1.50
>    Elapsed Cycles                cycle       307595
>    Memory Throughput                 %        78.65
>    DRAM Throughput                   %        10.69
>    Duration                         us       205.22
>    L1/TEX Cache Throughput           %        79.90
>    L2 Cache Throughput               %        15.54
>    SM Active Cycles              cycle    302305.17
>    Compute (SM) Throughput           %        66.51
>    ----------------------- ----------- ------------
>
>    OPT   Memory is more heavily utilized than Compute: Look at the Memory Workload Analysis section to identify the L1 
>          bottleneck. Check memory replay (coalescing) metrics to make sure you're efficiently utilizing the bytes      
>          transferred. Also consider whether it is possible to do more work per memory access (kernel fusion) or        
>          whether there are values you can (re)compute.                                                                 
>
>    Section: Launch Statistics
>    -------------------------------- --------------- ---------------
>    Metric Name                          Metric Unit    Metric Value
>    -------------------------------- --------------- ---------------
>    Block Size                                                  1024
>    Cluster Scheduling Policy                           PolicySpread
>    Cluster Size                                                   0
>    Function Cache Configuration                     CachePreferNone
>    Grid Size                                                  16384
>    Registers Per Thread             register/thread              31
>    Shared Memory Configuration Size           Kbyte           32.77
>    Driver Shared Memory Per Block       Kbyte/block            1.02
>    Dynamic Shared Memory Per Block       byte/block               0
>    Static Shared Memory Per Block       Kbyte/block           12.29
>    # SMs                                         SM             132
>    Stack Size                                                  1024
>    Threads                                   thread        16777216
>    # TPCs                                                        66
>    Enabled TPC IDs                                              all
>    Uses Green Context                                             0
>    Waves Per SM                                               62.06
>    -------------------------------- --------------- ---------------
>
>    Section: Occupancy
>    ------------------------------- ----------- ------------
>    Metric Name                     Metric Unit Metric Value
>    ------------------------------- ----------- ------------
>    Max Active Clusters                 cluster            0
>    Max Cluster Size                      block            8
>    Overall GPU Occupancy                     %            0
>    Cluster Occupancy                         %            0
>    Block Limit Barriers                  block           32
>    Block Limit SM                        block           32
>    Block Limit Registers                 block            2
>    Block Limit Shared Mem                block            2
>    Block Limit Warps                     block            2
>    Theoretical Active Warps per SM        warp           64
>    Theoretical Occupancy                     %          100
>    Achieved Occupancy                        %        92.97
>    Achieved Active Warps Per SM           warp        59.50
>    ------------------------------- ----------- ------------

So feel free to fork and suggest improvements 🤗

About

GPU-accelerated Schulze voting method in Python, Numba, CUDA, and Mojo 🔥, using ideas from Algebraic Graph Theory

Topics

Resources

License

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •