Reduce allocation overhead in quantized sdpa #15610

kimishpatel · 2025-11-05T18:45:13Z

Stack from ghstack (oldest at bottom):

For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: D85532077

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

pytorch-bot · 2025-11-05T18:45:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15610

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 4 Unrelated Failures

As of commit 602a3a7 with merge base 7600df8 ():

NEW FAILURES - The following jobs have failed:

pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t 385b0a8e21f34ce4759d484f9762e69607eda3ac5cb7c8ac667cb5137fcd7117 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 3c1035c58db5eadc530da3de62ab4c76740ffdfc4310b7ec22099842df262abe /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 4ca30b1d6603de01cf8dd7285c81844094f27d282e81483e8691fcee49cf42c9 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t cf5dfdb62bfd0eef5e1c9d33a8fcd1425c5a32dfc54537027c814d4b53c100ff /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t c58dbf51de518ede1aebfa4334c2527de46449b7f0a890fe317e3bf0cc594ed9 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 849ad90cc3db945c8d7ccf5348b6c044438a71eb1d8113a0666faacee6e5cf38 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t 3fe1ff808b4b03a474a4f77a2fb5c4f16297000287815412998bfe44635af86f /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t ddb137267c51c2ff04add2bbfea8b17093e71efbc33df5e496edbbcad0f77502 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t ec67a7fff1cdf57231180ccc7ce011eb1f34b3f91cf4af10c79763b64bfdb929 /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest / macos / macos-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-11-05T18:45:46Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

Pull Request resolved: #15610 For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. ghstack-source-id: 321455128 @exported-using-ghexport Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

Copilot

Pull Request Overview

This PR refactors the quantized scaled dot-product attention (SDPA) implementation to reduce allocation overhead by moving the dequantization buffer allocation from inside the dequant_and_gemm function to the outer cpu_flash_attention scope. Instead of allocating a new std::vector for each dequantization operation, a pre-allocated per-thread scratch buffer is now shared across iterations.

Key changes:

Added buf_qdq_ptr parameter to dequant_and_gemm and _qk_at_v_gemm functions to accept externally allocated dequantization buffers
Allocated a shared scratch buffer (scratch_for_quant_dequant) in cpu_flash_attention with per-thread partitioning
Removed the local std::vector<float> dequantized_v_data allocation from dequant_and_gemm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T16:18:24Z

extension/llm/custom_ops/op_sdpa_impl.h

  //    {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
  //    query.options());
+  int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
+  // Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,


The comment says "align to 64 bytes" but kAlignment = 32 aligns to 32 elements. Since size_per_thread_qdq_vec is an element count (not byte count), and assuming accum_t is float (4 bytes), this aligns to 128 bytes (32 * 4), not 64 bytes.

Either:

Change kAlignment to 16 if 64-byte alignment is desired, or

Update the comment to say "align to 32 elements" or "align to 128 bytes (for float)"

Suggested change

// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,

// Lets align size_per_thread_qdq_vec to 32 elements (128 bytes for float), for coalesced cache reads,

Copilot · 2025-11-17T16:18:25Z

extension/llm/custom_ops/op_sdpa_impl.h

+  // by padding with right number of per thread elements
+  constexpr int64_t kAlignment = 32;
+  size_per_thread_qdq_vec =
+      (size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));


The alignment calculation is incorrect. The formula (x + kAlignment - 1) & (-(kAlignment - 1)) uses the wrong mask.

For aligning to a power-of-2 boundary, the correct formula is:

(size_per_thread_qdq_vec + kAlignment - 1) & (-kAlignment)

or equivalently:

(size_per_thread_qdq_vec + kAlignment - 1) & ~(kAlignment - 1)

The current code uses -(kAlignment - 1) which equals -31 = 0xFFFFFFE1, but the correct mask should be -32 = 0xFFFFFFE0 to properly zero out the bottom 5 bits.

Suggested change

(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));

(size_per_thread_qdq_vec + kAlignment - 1) & -kAlignment;

@copilot are you sure? Please double check again

Copilot · 2025-11-17T16:18:25Z

extension/llm/custom_ops/op_sdpa_impl.h

  // at::Tensor buf_reduced = at::empty(
  //    {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
  //    query.options());
+  int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;


The buffer size calculation appears to be larger than necessary. The dequantize operation needs kvBlockSize * headSize elements (at most kvSplitSize * headSize), but this allocates qSplitSize * kvSplitSize * headSize. The extra qSplitSize factor seems unnecessary and wastes memory per thread.

Consider changing to:

int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Suggested change

int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;

int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Copilot · 2025-11-17T16:45:16Z

@mergennachin I've opened a new pull request, #15852, to work on those changes. Once the pull request is ready, I'll request review from you.

kimishpatel requested review from jackzhxng, larryliu0820 and mergennachin as code owners November 5, 2025 18:45

This was referenced Nov 5, 2025

[Executorch] parallelize op_choose_qparams #15607

Merged

[Executorch] Add simd path for op quantize #15608

Merged

[Executorch] Add multithreading for op_quantize #15609

Merged

[Executorch] Introduce caching cpu memory allocator #15611

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 5, 2025

metascroy approved these changes Nov 6, 2025

View reviewed changes

kimishpatel mentioned this pull request Nov 14, 2025

[Executorch] make slice_copy parallel #15830

Open

mergennachin requested a review from Copilot November 17, 2025 16:12

Copilot started reviewing on behalf of mergennachin November 17, 2025 16:13 View session

Copilot finished reviewing on behalf of mergennachin November 17, 2025 16:17

Copilot AI reviewed Nov 17, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 17, 2025

Fix incorrect alignment mask in quantized SDPA allocation #15852

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce allocation overhead in quantized sdpa #15610

Reduce allocation overhead in quantized sdpa #15610

Uh oh!

kimishpatel commented Nov 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

mergennachin Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,
	// Lets align size_per_thread_qdq_vec to 32 elements (128 bytes for float), for coalesced cache reads,

	(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));
	(size_per_thread_qdq_vec + kAlignment - 1) & -kAlignment;

	int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
	int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Reduce allocation overhead in quantized sdpa #15610

Are you sure you want to change the base?

Reduce allocation overhead in quantized sdpa #15610

Uh oh!

Conversation

kimishpatel commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15610

❌ 9 New Failures, 4 Unrelated Failures

Uh oh!

github-actions bot commented Nov 5, 2025

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kimishpatel commented Nov 5, 2025 •

edited

Loading

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

This PR needs a `release notes:` label