-
Notifications
You must be signed in to change notification settings - Fork 728
[Executorch] Introduce caching cpu memory allocator #15611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: gh/kimishpatel/203/base
Are you sure you want to change the base?
Conversation
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15611
Note: Links to docs will display an error until the docs builds have been completed. ❌ 11 New Failures, 1 Cancelled Job, 7 Unrelated FailuresAs of commit 7939d44 with merge base 7600df8 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) ghstack-source-id: 321123656 Pull Request resolved: #15611
This PR needs a
|
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
Pull Request resolved: #15611 Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations ghstack-source-id: 321455824 @exported-using-ghexport Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/)
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations Differential Revision: [D85532079](https://our.internmc.facebook.com/intern/diff/D85532079/) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a thread-safe CPU caching memory allocator for ExecuTorch to reduce allocation overhead, particularly for temporary kernel allocations like SDPA on iOS. The allocator caches previously allocated memory blocks and reuses them based on size, avoiding repeated malloc/free calls.
- Implements
CPUCachingAllocatorclass with caching logic based on allocation size - Adds build system support (CMake, Bazel) for the new memory allocator extension
- Provides comprehensive test coverage for various allocation scenarios including thread safety
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/cmake/Codegen.cmake | Adds EXTENSION_MEMORY_ALLOCATOR_SRCS variable to build system for new allocator sources |
| shim_et/xplat/executorch/build/build_variables.bzl | Defines source files list for the caching allocator module |
| extension/memory_allocator/test/targets.bzl | Adds Bazel test target for caching allocator tests |
| extension/memory_allocator/test/cpu_caching_malloc_allocator_test.cpp | Comprehensive test suite covering allocation, caching, alignment, thread safety, and edge cases |
| extension/memory_allocator/targets.bzl | Defines Bazel library target for the caching allocator |
| extension/memory_allocator/cpu_caching_malloc_allocator.h | Header declaring the CPUCachingAllocator class with thread-safe caching interface |
| extension/memory_allocator/cpu_caching_malloc_allocator.cpp | Implementation of caching allocator with size-based caching and alignment support |
| extension/memory_allocator/CMakeLists.txt | CMake build configuration for the memory allocator extension library |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
|
|
||
| CPUCachingAllocator::~CPUCachingAllocator() { | ||
| // destructor must be called in thread safe manner |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition in the destructor. While the comment states "destructor must be called in thread safe manner", the destructor doesn't lock the mutex before calling reset() and free_cached(). If another thread is still executing methods on this object when the destructor is called, this could lead to undefined behavior. Consider adding a lock guard at the start of the destructor, or document that the caller must ensure no concurrent access during destruction.
| // destructor must be called in thread safe manner | |
| // destructor must be called in thread safe manner | |
| std::lock_guard<std::mutex> guard(mutex_); |
| } | ||
| available_map_.clear(); | ||
| current_size_ = 0; |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
current_size_ should track the total size of all allocated memory (both in allocation_map_ and available_map_), but free_cached() sets it to 0 after only freeing memory in available_map_. This is incorrect when there are still active allocations in allocation_map_. Instead, current_size_ should be decremented by the sizes of the freed allocations. Consider: for (const auto& it : available_map_) { current_size_ -= it.first * it.second.size(); }
| } | |
| available_map_.clear(); | |
| current_size_ = 0; | |
| current_size_ -= it.first * it.second.size(); | |
| } | |
| available_map_.clear(); |
| } | ||
|
|
||
| void CPUCachingAllocator::free_cached() { | ||
| // We dont lock mutex_ here because it will cause deadlock otherwise |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "dont" should be "don't".
| // We dont lock mutex_ here because it will cause deadlock otherwise | |
| // We don't lock mutex_ here because it will cause deadlock otherwise |
| set(EXECUTORCH_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../..) | ||
| endif() | ||
|
|
||
| list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/") |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name _extension_module__srcs appears to be incorrect. This file is for extension_memory_allocator, so it should reference _extension_memory_allocator__srcs which is defined in the build variables.
| list(TRANSFORM _extension_module__srcs PREPEND "${EXECUTORCH_ROOT}/") | |
| list(TRANSFORM _extension_memory_allocator__srcs PREPEND "${EXECUTORCH_ROOT}/") |
| current_size_ += size; | ||
| if (ptr == nullptr) { | ||
| ET_LOG(Error, "Failed to allocate memory"); | ||
| return nullptr; | ||
| } |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current_size_ is incremented before checking if the allocation succeeded. If std::aligned_alloc returns nullptr (line 46), the function returns nullptr but current_size_ has already been incremented by size. This leads to incorrect size tracking. The increment should happen after confirming the allocation succeeded.
| current_size_ += size; | |
| if (ptr == nullptr) { | |
| ET_LOG(Error, "Failed to allocate memory"); | |
| return nullptr; | |
| } | |
| if (ptr == nullptr) { | |
| ET_LOG(Error, "Failed to allocate memory"); | |
| return nullptr; | |
| } | |
| current_size_ += size; |
| // Since allocation_map, which is a global instance, is mutated/read via | ||
| // all public APIs we need a global mutex. |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions "allocation_map, which is a global instance" but allocation_map_ is actually an instance member variable, not a global instance. This comment is misleading and should be corrected to reflect that the mutex protects the instance's member variables.
| // Since allocation_map, which is a global instance, is mutated/read via | |
| // all public APIs we need a global mutex. | |
| // Since allocation_map_ and other member variables are mutated/read via | |
| // all public APIs, we need a mutex to protect concurrent access to these instance members. |
| /* | ||
| max_size: Maximum size of memory to cache. Never cache more than that. | ||
| */ | ||
| CPUCachingAllocator(uint32_t max_size); |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor parameter max_size is declared as uint32_t but the member variable max_size_ is of type size_t. This type mismatch could lead to implicit narrowing on platforms where size_t is larger than 32 bits. Consider changing the constructor parameter to size_t for consistency.
| CPUCachingAllocator(uint32_t max_size); | |
| CPUCachingAllocator(size_t max_size); |
| # This source code is licensed under the BSD-style license found in the | ||
| # LICENSE file in the root directory of this source tree. | ||
|
|
||
| # Please this file formatted by running: |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: "Please this file" should be "Please keep this file" or "Please ensure this file is".
| # Please this file formatted by running: | |
| # Please keep this file formatted by running: |
| // Should not get cached pointer since size was adjusted during first | ||
| // allocation |
Copilot
AI
Nov 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on lines 281-282 states "Should not get cached pointer since size was adjusted during first allocation", but this expectation is incorrect. The allocation on line 280 will use the same alignment-adjusted size as line 276, so it will allocate a new pointer (not from cache). The test then resets on line 283 and allocates again on line 285, which should indeed reuse p1. However, the intermediate allocation on line 280 is not verified and the comment is confusing. Consider removing lines 280-282 or adding an assertion to clarify the expected behavior.
| // Should not get cached pointer since size was adjusted during first | |
| // allocation | |
| // Should allocate a new pointer since both allocations use the same alignment-adjusted size. | |
| EXPECT_NE(p1, p2); |
Stack from ghstack (oldest at bottom):
Meant to use this for temp allocator for kernels. Specifically for sdpa, it seems that on iOS there is a significant overhead coming from allocations
Differential Revision: D85532079