Skip to content

Conversation

@kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Nov 5, 2025

Stack from ghstack (oldest at bottom):

For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: D85532077

For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15610

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures, 4 Unrelated Failures

As of commit 602a3a7 with merge base 7600df8 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link

github-actions bot commented Nov 5, 2025

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025
For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Nov 6, 2025
Pull Request resolved: #15610

For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.
ghstack-source-id: 321455128
@exported-using-ghexport

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)
For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

[ghstack-poisoned]
For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

[ghstack-poisoned]
For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

[ghstack-poisoned]
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the quantized scaled dot-product attention (SDPA) implementation to reduce allocation overhead by moving the dequantization buffer allocation from inside the dequant_and_gemm function to the outer cpu_flash_attention scope. Instead of allocating a new std::vector for each dequantization operation, a pre-allocated per-thread scratch buffer is now shared across iterations.

Key changes:

  • Added buf_qdq_ptr parameter to dequant_and_gemm and _qk_at_v_gemm functions to accept externally allocated dequantization buffers
  • Allocated a shared scratch buffer (scratch_for_quant_dequant) in cpu_flash_attention with per-thread partitioning
  • Removed the local std::vector<float> dequantized_v_data allocation from dequant_and_gemm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
// query.options());
int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "align to 64 bytes" but kAlignment = 32 aligns to 32 elements. Since size_per_thread_qdq_vec is an element count (not byte count), and assuming accum_t is float (4 bytes), this aligns to 128 bytes (32 * 4), not 64 bytes.

Either:

  1. Change kAlignment to 16 if 64-byte alignment is desired, or
  2. Update the comment to say "align to 32 elements" or "align to 128 bytes (for float)"
Suggested change
// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,
// Lets align size_per_thread_qdq_vec to 32 elements (128 bytes for float), for coalesced cache reads,

Copilot uses AI. Check for mistakes.
// by padding with right number of per thread elements
constexpr int64_t kAlignment = 32;
size_per_thread_qdq_vec =
(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alignment calculation is incorrect. The formula (x + kAlignment - 1) & (-(kAlignment - 1)) uses the wrong mask.

For aligning to a power-of-2 boundary, the correct formula is:

(size_per_thread_qdq_vec + kAlignment - 1) & (-kAlignment)

or equivalently:

(size_per_thread_qdq_vec + kAlignment - 1) & ~(kAlignment - 1)

The current code uses -(kAlignment - 1) which equals -31 = 0xFFFFFFE1, but the correct mask should be -32 = 0xFFFFFFE0 to properly zero out the bottom 5 bits.

Suggested change
(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));
(size_per_thread_qdq_vec + kAlignment - 1) & -kAlignment;

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot are you sure? Please double check again

// at::Tensor buf_reduced = at::empty(
// {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
// query.options());
int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer size calculation appears to be larger than necessary. The dequantize operation needs kvBlockSize * headSize elements (at most kvSplitSize * headSize), but this allocates qSplitSize * kvSplitSize * headSize. The extra qSplitSize factor seems unnecessary and wastes memory per thread.

Consider changing to:

int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;
Suggested change
int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI commented Nov 17, 2025

@mergennachin I've opened a new pull request, #15852, to work on those changes. Once the pull request is ready, I'll request review from you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants