Parallelize SDPA across multiple simdgroups #144

Maratyszcza · 2025-08-17T07:25:34Z

No description provided.

balajirajput96 · 2025-08-17T08:48:18Z

This pull request introduces support for dynamic threadgroup buffer allocation for Metal compute kernels, refactors kernel launch APIs to clarify buffer usage, and improves the SDPA (scaled dot-product attention) kernel for more flexible and efficient execution. The changes span both C/C++ and Metal Shading Language code, updating function signatures, internal logic, and kernel arguments to enable these new capabilities.

Metal kernel launch API improvements

Refactored the kernel launch API (gptoss_metal_command_buffer_encode_launch_kernel) to distinguish device buffers from threadgroup memory, adding a new threadgroup_buffer_size argument and updating buffer-related parameter names for clarity. [1] [2] [3] [4] [5]
Updated all kernel launch call sites to pass the new threadgroup_buffer_size argument, setting it to zero where not required. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

SDPA (scaled dot-product attention) kernel enhancements

Added dynamic calculation of threadgroup size and buffer size for the SDPA kernel, using new math utilities for power-of-two rounding, and passing the computed buffer size to the kernel launch. [1] [2]
Updated the Metal SDPA kernel (sdpa.metal) to accept a threadgroup buffer and new threadgroup/simdgroup arguments, enabling more flexible parallelization and resource usage.
Changed initialization of local variables in the SDPA kernel to depend on the simdgroup index, improving correctness in parallel execution.

Math utilities

Added new functions math_round_down_po2 and improved math_round_up_po2 for rounding numbers to powers of two, including assertions for input validity.
Included <assert.h> in the math header to support these assertions.

Parallelize SDPA across multiple simdgroups

395218b

Maratyszcza requested review from volsgd and dkundel-openai August 17, 2025 07:25

dkundel-openai approved these changes Aug 18, 2025

View reviewed changes

dkundel-openai merged commit 995e148 into main Aug 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallelize SDPA across multiple simdgroups #144

Parallelize SDPA across multiple simdgroups #144

Uh oh!

Maratyszcza commented Aug 17, 2025

Uh oh!

balajirajput96 commented Aug 17, 2025

Uh oh!

Uh oh!

Parallelize SDPA across multiple simdgroups #144

Parallelize SDPA across multiple simdgroups #144

Uh oh!

Conversation

Maratyszcza commented Aug 17, 2025

Uh oh!

balajirajput96 commented Aug 17, 2025

Metal kernel launch API improvements

SDPA (scaled dot-product attention) kernel enhancements

Math utilities

Uh oh!

Uh oh!