Skip to content

Conversation

@kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Nov 5, 2025

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15611

Note: Links to docs will display an error until the docs builds have been completed.

❌ 11 New Failures, 1 Cancelled Job, 7 Unrelated Failures

As of commit 7939d44 with merge base 7600df8 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kimishpatel added a commit that referenced this pull request Nov 5, 2025
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

ghstack-source-id: 321123656
Pull Request resolved: #15611
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025
@github-actions
Copy link

github-actions bot commented Nov 5, 2025

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
kimishpatel added a commit that referenced this pull request Nov 6, 2025
Pull Request resolved: #15611

Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations
ghstack-source-id: 321455824
@exported-using-ghexport

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations

Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)

[ghstack-poisoned]
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a thread-safe CPU caching memory allocator for ExecuTorch to reduce allocation overhead, particularly for temporary kernel allocations like SDPA on iOS. The allocator caches previously allocated memory blocks and reuses them based on size, avoiding repeated malloc/free calls.

  • Implements CPUCachingAllocator class with caching logic based on allocation size
  • Adds build system support (CMake, Bazel) for the new memory allocator extension
  • Provides comprehensive test coverage for various allocation scenarios including thread safety

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tools/cmake/Codegen.cmake Adds EXTENSION_MEMORY_ALLOCATOR_SRCS variable to build system for new allocator sources
shim_et/xplat/executorch/build/build_variables.bzl Defines source files list for the caching allocator module
extension/memory_allocator/test/targets.bzl Adds Bazel test target for caching allocator tests
extension/memory_allocator/test/cpu_caching_malloc_allocator_test.cpp Comprehensive test suite covering allocation, caching, alignment, thread safety, and edge cases
extension/memory_allocator/targets.bzl Defines Bazel library target for the caching allocator
extension/memory_allocator/cpu_caching_malloc_allocator.h Header declaring the CPUCachingAllocator class with thread-safe caching interface
extension/memory_allocator/cpu_caching_malloc_allocator.cpp Implementation of caching allocator with size-based caching and alignment support
extension/memory_allocator/CMakeLists.txt CMake build configuration for the memory allocator extension library

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

CPUCachingAllocator::~CPUCachingAllocator() {
// destructor must be called in thread safe manner
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition in the destructor. While the comment states "destructor must be called in thread safe manner", the destructor doesn't lock the mutex before calling reset() and free_cached(). If another thread is still executing methods on this object when the destructor is called, this could lead to undefined behavior. Consider adding a lock guard at the start of the destructor, or document that the caller must ensure no concurrent access during destruction.

Suggested change
// destructor must be called in thread safe manner
// destructor must be called in thread safe manner
std::lock_guard<std::mutex> guard(mutex_);

Copilot uses AI. Check for mistakes.
Comment on lines +67 to +69
}
available_map_.clear();
current_size_ = 0;
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current_size_ should track the total size of all allocated memory (both in allocation_map_ and available_map_), but free_cached() sets it to 0 after only freeing memory in available_map_. This is incorrect when there are still active allocations in allocation_map_. Instead, current_size_ should be decremented by the sizes of the freed allocations. Consider: for (const auto& it : available_map_) { current_size_ -= it.first * it.second.size(); }

Suggested change
}
available_map_.clear();
current_size_ = 0;
current_size_ -= it.first * it.second.size();
}
available_map_.clear();

Copilot uses AI. Check for mistakes.
}

void CPUCachingAllocator::free_cached() {
// We dont lock mutex_ here because it will cause deadlock otherwise
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "dont" should be "don't".

Suggested change
// We dont lock mutex_ here because it will cause deadlock otherwise
// We don't lock mutex_ here because it will cause deadlock otherwise

Copilot uses AI. Check for mistakes.
set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../..)
endif()

list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/")
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable name _extension_module__srcs appears to be incorrect. This file is for extension_memory_allocator, so it should reference _extension_memory_allocator__srcs which is defined in the build variables.

Suggested change
list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/")
list(TRANSFORM _extension_memory_allocator__srcs PREPEND "${EXECUTORCH_ROOT}/")

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +49
current_size_ += size;
if (ptr == nullptr) {
ET_LOG(Error, "Failed to allocate memory");
return nullptr;
}
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current_size_ is incremented before checking if the allocation succeeded. If std::aligned_alloc returns nullptr (line 46), the function returns nullptr but current_size_ has already been incremented by size. This leads to incorrect size tracking. The increment should happen after confirming the allocation succeeded.

Suggested change
current_size_ += size;
if (ptr == nullptr) {
ET_LOG(Error, "Failed to allocate memory");
return nullptr;
}
if (ptr == nullptr) {
ET_LOG(Error, "Failed to allocate memory");
return nullptr;
}
current_size_ += size;

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +63
// Since allocation_map, which is a global instance, is mutated/read via
// all public APIs we need a global mutex.
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions "allocation_map, which is a global instance" but allocation_map_ is actually an instance member variable, not a global instance. This comment is misleading and should be corrected to reflect that the mutex protects the instance's member variables.

Suggested change
// Since allocation_map, which is a global instance, is mutated/read via
// all public APIs we need a global mutex.
// Since allocation_map_ and other member variables are mutated/read via
// all public APIs, we need a mutex to protect concurrent access to these instance members.

Copilot uses AI. Check for mistakes.
/*
max_size: Maximum size of memory to cache. Never cache more than that.
*/
CPUCachingAllocator(uint32_t max_size);
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor parameter max_size is declared as uint32_t but the member variable max_size_ is of type size_t. This type mismatch could lead to implicit narrowing on platforms where size_t is larger than 32 bits. Consider changing the constructor parameter to size_t for consistency.

Suggested change
CPUCachingAllocator(uint32_t max_size);
CPUCachingAllocator(size_t max_size);

Copilot uses AI. Check for mistakes.
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# Please this file formatted by running:
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "Please this file" should be "Please keep this file" or "Please ensure this file is".

Suggested change
# Please this file formatted by running:
# Please keep this file formatted by running:

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +282
// Should not get cached pointer since size was adjusted during first
// allocation
Copy link

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on lines 281-282 states "Should not get cached pointer since size was adjusted during first allocation", but this expectation is incorrect. The allocation on line 280 will use the same alignment-adjusted size as line 276, so it will allocate a new pointer (not from cache). The test then resets on line 283 and allocates again on line 285, which should indeed reuse p1. However, the intermediate allocation on line 280 is not verified and the comment is confusing. Consider removing lines 280-282 or adding an assertion to clarify the expected behavior.

Suggested change
// Should not get cached pointer since size was adjusted during first
// allocation
// Should allocate a new pointer since both allocations use the same alignment-adjusted size.
EXPECT_NE(p1, p2);

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants