Skip to content

Conversation

Maratyszcza
Copy link
Collaborator

No description provided.

@balajirajput96
Copy link

This pull request introduces support for dynamic threadgroup buffer allocation for Metal compute kernels, refactors kernel launch APIs to clarify buffer usage, and improves the SDPA (scaled dot-product attention) kernel for more flexible and efficient execution. The changes span both C/C++ and Metal Shading Language code, updating function signatures, internal logic, and kernel arguments to enable these new capabilities.

Metal kernel launch API improvements

  • Refactored the kernel launch API (gptoss_metal_command_buffer_encode_launch_kernel) to distinguish device buffers from threadgroup memory, adding a new threadgroup_buffer_size argument and updating buffer-related parameter names for clarity. [1] [2] [3] [4] [5]
  • Updated all kernel launch call sites to pass the new threadgroup_buffer_size argument, setting it to zero where not required. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

SDPA (scaled dot-product attention) kernel enhancements

  • Added dynamic calculation of threadgroup size and buffer size for the SDPA kernel, using new math utilities for power-of-two rounding, and passing the computed buffer size to the kernel launch. [1] [2]
  • Updated the Metal SDPA kernel (sdpa.metal) to accept a threadgroup buffer and new threadgroup/simdgroup arguments, enabling more flexible parallelization and resource usage.
  • Changed initialization of local variables in the SDPA kernel to depend on the simdgroup index, improving correctness in parallel execution.

Math utilities

  • Added new functions math_round_down_po2 and improved math_round_up_po2 for rounding numbers to powers of two, including assertions for input validity.
  • Included <assert.h> in the math header to support these assertions.

@dkundel-openai dkundel-openai merged commit 995e148 into main Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants