[Executorch] Introduce caching cpu memory allocator #15611

kimishpatel · 2025-11-05T18:45:18Z

Stack from ghstack (oldest at bottom):

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: D85532079

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]

pytorch-bot · 2025-11-05T18:45:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15611

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures, 1 Cancelled Job, 7 Unrelated Failures

As of commit 7939d44 with merge base 7600df8 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/memory_allocator/test/cpu_caching_malloc_allocator_test.cpp:
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t fd1fd7d3a605c504c801a30ae4e968eff278e55a432525a14398c22d2aefba0f /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t bbbb1bda42bfbd588df41543b32bca95174bb96a64f970801cfeba276afd0337 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 86fd1737c87a192fdadd0ed7fde18a6c75a5e98702462fbf23c954943d55892d /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t ec8c1d937acafaf2ede59ad71ac7de6b1314b7b947d046e7e831ab52df6edc70 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t 06c34798e0633e860e4efd2a600f619881b9b8cd454bccfdcb8b81d82708eee3 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 1d4b2f4a934660bd6e4c29191f85187a4b60655a44f543fc842ba77e3d8bf210 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
RuntimeError: Command docker exec -t d1f20073cd734fbca14c7c16fc8938e0b3302a67f64daeda64081677b872db74 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
RuntimeError: Command docker exec -t f37d27a27f752d25c91ecffb3aec27e14de7b9fd8eff80052c1535251ccbb2c4 /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
RuntimeError: Command docker exec -t 89366bab81d0dba169c2b902012c355f9ead53f8dd536be1e46827b71608c2d5 /exec failed with exit code 1
trunk / test-arm-cortex-m-size-test (zephyr-preset) / linux-job (gh)
RuntimeError: Command docker exec -t 4dcd67ccbdfbeae1be1db9f86bb0c965987cdff469f5e31800fe74cffd7fc809 /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

trunk / test-models-macos-coreml (resnet50) / macos-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-eval_llama-wikitext-linux / linux-job (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 128

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / linux / linux-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest / macos / macos-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
trunk / unittest-release / linux / linux-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_int8_static_quant_recipe

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) ghstack-source-id: 321123656 Pull Request resolved: #15611

github-actions · 2025-11-05T18:45:58Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]

Pull Request resolved: #15611 Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations ghstack-source-id: 321455824 @exported-using-ghexport Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]

Copilot

Pull Request Overview

This PR introduces a thread-safe CPU caching memory allocator for ExecuTorch to reduce allocation overhead, particularly for temporary kernel allocations like SDPA on iOS. The allocator caches previously allocated memory blocks and reuses them based on size, avoiding repeated malloc/free calls.

Implements CPUCachingAllocator class with caching logic based on allocation size
Adds build system support (CMake, Bazel) for the new memory allocator extension
Provides comprehensive test coverage for various allocation scenarios including thread safety

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
tools/cmake/Codegen.cmake	Adds `EXTENSION_MEMORY_ALLOCATOR_SRCS` variable to build system for new allocator sources
shim_et/xplat/executorch/build/build_variables.bzl	Defines source files list for the caching allocator module
extension/memory_allocator/test/targets.bzl	Adds Bazel test target for caching allocator tests
extension/memory_allocator/test/cpu_caching_malloc_allocator_test.cpp	Comprehensive test suite covering allocation, caching, alignment, thread safety, and edge cases
extension/memory_allocator/targets.bzl	Defines Bazel library target for the caching allocator
extension/memory_allocator/cpu_caching_malloc_allocator.h	Header declaring the `CPUCachingAllocator` class with thread-safe caching interface
extension/memory_allocator/cpu_caching_malloc_allocator.cpp	Implementation of caching allocator with size-based caching and alignment support
extension/memory_allocator/CMakeLists.txt	CMake build configuration for the memory allocator extension library

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T16:19:59Z

extension/memory_allocator/cpu_caching_malloc_allocator.cpp

+}
+
+CPUCachingAllocator::~CPUCachingAllocator() {
+  // destructor must be called in thread safe manner


There's a potential race condition in the destructor. While the comment states "destructor must be called in thread safe manner", the destructor doesn't lock the mutex before calling reset() and free_cached(). If another thread is still executing methods on this object when the destructor is called, this could lead to undefined behavior. Consider adding a lock guard at the start of the destructor, or document that the caller must ensure no concurrent access during destruction.

Suggested change

// destructor must be called in thread safe manner

// destructor must be called in thread safe manner

std::lock_guard<std::mutex> guard(mutex_);

Copilot · 2025-11-17T16:19:59Z

extension/memory_allocator/cpu_caching_malloc_allocator.cpp

+  }
+  available_map_.clear();
+  current_size_ = 0;


current_size_ should track the total size of all allocated memory (both in allocation_map_ and available_map_), but free_cached() sets it to 0 after only freeing memory in available_map_. This is incorrect when there are still active allocations in allocation_map_. Instead, current_size_ should be decremented by the sizes of the freed allocations. Consider: for (const auto& it : available_map_) { current_size_ -= it.first * it.second.size(); }

Suggested change

}

available_map_.clear();

current_size_ = 0;

current_size_ -= it.first * it.second.size();

}

available_map_.clear();

Copilot · 2025-11-17T16:19:59Z

extension/memory_allocator/cpu_caching_malloc_allocator.cpp

+}
+
+void CPUCachingAllocator::free_cached() {
+  // We dont lock mutex_ here because it will cause deadlock otherwise


Typo in comment: "dont" should be "don't".

Suggested change

// We dont lock mutex_ here because it will cause deadlock otherwise

// We don't lock mutex_ here because it will cause deadlock otherwise

Copilot · 2025-11-17T16:20:00Z

extension/memory_allocator/CMakeLists.txt

+  set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../..)
+endif()
+
+list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/")


The variable name _extension_module__srcs appears to be incorrect. This file is for extension_memory_allocator, so it should reference _extension_memory_allocator__srcs which is defined in the build variables.

Suggested change

list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/")

list(TRANSFORM _extension_memory_allocator__srcs PREPEND "${EXECUTORCH_ROOT}/")

Copilot · 2025-11-17T16:20:00Z

extension/memory_allocator/cpu_caching_malloc_allocator.cpp

+    current_size_ += size;
+    if (ptr == nullptr) {
+      ET_LOG(Error, "Failed to allocate memory");
+      return nullptr;
+    }


The current_size_ is incremented before checking if the allocation succeeded. If std::aligned_alloc returns nullptr (line 46), the function returns nullptr but current_size_ has already been incremented by size. This leads to incorrect size tracking. The increment should happen after confirming the allocation succeeded.

Suggested change

current_size_ += size;

if (ptr == nullptr) {

ET_LOG(Error, "Failed to allocate memory");

return nullptr;

}

if (ptr == nullptr) {

ET_LOG(Error, "Failed to allocate memory");

return nullptr;

}

current_size_ += size;

Copilot · 2025-11-17T16:20:00Z

extension/memory_allocator/cpu_caching_malloc_allocator.h

+  // Since allocation_map, which is a global instance, is mutated/read via
+  // all public APIs we need a global mutex.


The comment mentions "allocation_map, which is a global instance" but allocation_map_ is actually an instance member variable, not a global instance. This comment is misleading and should be corrected to reflect that the mutex protects the instance's member variables.

Suggested change

// Since allocation_map, which is a global instance, is mutated/read via

// all public APIs we need a global mutex.

// Since allocation_map_ and other member variables are mutated/read via

// all public APIs, we need a mutex to protect concurrent access to these instance members.

Copilot · 2025-11-17T16:20:01Z

extension/memory_allocator/cpu_caching_malloc_allocator.h

+  /*
+    max_size: Maximum size of memory to cache. Never cache more than that.
+  */
+  CPUCachingAllocator(uint32_t max_size);


The constructor parameter max_size is declared as uint32_t but the member variable max_size_ is of type size_t. This type mismatch could lead to implicit narrowing on platforms where size_t is larger than 32 bits. Consider changing the constructor parameter to size_t for consistency.

Suggested change

CPUCachingAllocator(uint32_t max_size);

CPUCachingAllocator(size_t max_size);

Copilot · 2025-11-17T16:20:01Z

extension/memory_allocator/CMakeLists.txt

+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+# Please this file formatted by running:


Typo in comment: "Please this file" should be "Please keep this file" or "Please ensure this file is".

Suggested change

# Please this file formatted by running:

# Please keep this file formatted by running:

Copilot · 2025-11-17T16:20:01Z

extension/memory_allocator/test/cpu_caching_malloc_allocator_test.cpp

+  // Should not get cached pointer since size was adjusted during first
+  // allocation


The comment on lines 281-282 states "Should not get cached pointer since size was adjusted during first allocation", but this expectation is incorrect. The allocation on line 280 will use the same alignment-adjusted size as line 276, so it will allocate a new pointer (not from cache). The test then resets on line 283 and allocates again on line 285, which should indeed reuse p1. However, the intermediate allocation on line 280 is not verified and the comment is confusing. Consider removing lines 280-282 or adding an assertion to clarify the expected behavior.

Suggested change

// Should not get cached pointer since size was adjusted during first

// allocation

// Should allocate a new pointer since both allocations use the same alignment-adjusted size.

EXPECT_NE(p1, p2);

kimishpatel requested review from JacobSzwejbka, kirklandsign and larryliu0820 as code owners November 5, 2025 18:45

This was referenced Nov 5, 2025

[Executorch] parallelize op_choose_qparams #15607

Merged

[Executorch] Add simd path for op quantize #15608

Merged

[Executorch] Add multithreading for op_quantize #15609

Merged

kimishpatel mentioned this pull request Nov 5, 2025

Reduce allocation overhead in quantized sdpa #15610

Open

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 5, 2025

zingo added the ciflow/trunk label Nov 6, 2025

metascroy approved these changes Nov 6, 2025

View reviewed changes

This was referenced Nov 10, 2025

[Executorch] Use temp allocator for allocating scratch memory #15708

Open

[Executorch] Make module constructors uniform across #15709

Open

[Executorch][LLM] Use caching allocator for runner #15710

Open

This was referenced Nov 11, 2025

[Executorch] Use temp allocator for allocating scratch memory #15728

Open

[Executorch] Make module constructors uniform across #15729

Open

kimishpatel mentioned this pull request Nov 11, 2025

[Executorch][LLM] Use caching allocator for runner #15730

Open

kimishpatel mentioned this pull request Nov 14, 2025

[Executorch] make slice_copy parallel #15830

Open

mergennachin requested a review from Copilot November 17, 2025 16:14

Copilot started reviewing on behalf of mergennachin November 17, 2025 16:15 View session

Copilot finished reviewing on behalf of mergennachin November 17, 2025 16:18

Copilot AI reviewed Nov 17, 2025

View reviewed changes

	// destructor must be called in thread safe manner
	// destructor must be called in thread safe manner
	std::lock_guard<std::mutex> guard(mutex_);

	// We dont lock mutex_ here because it will cause deadlock otherwise
	// We don't lock mutex_ here because it will cause deadlock otherwise

	list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/")
	list(TRANSFORM _extension_memory_allocator__srcs PREPEND "${EXECUTORCH_ROOT}/")

		// Since allocation_map, which is a global instance, is mutated/read via
		// all public APIs we need a global mutex.

	CPUCachingAllocator(uint32_t max_size);
	CPUCachingAllocator(size_t max_size);

	# Please this file formatted by running:
	# Please keep this file formatted by running:

		// Should not get cached pointer since size was adjusted during first
		// allocation

[Executorch] Introduce caching cpu memory allocator #15611

Are you sure you want to change the base?

[Executorch] Introduce caching cpu memory allocator #15611

Uh oh!

Conversation

kimishpatel commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15611

❌ 11 New Failures, 1 Cancelled Job, 7 Unrelated Failures

Uh oh!

github-actions bot commented Nov 5, 2025

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kimishpatel commented Nov 5, 2025 •

edited

Loading

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

This PR needs a `release notes:` label