-
Notifications
You must be signed in to change notification settings - Fork 662
cloud_storage
: reduce calls to segment_meta_cstore::end()
#27311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
This test demonstrates the difference between a simple `for` loop over a `segment_meta_cstore` from `.begin()` to `.end()`, but shows that precomputing `.end()` boasts a valuable performance improvement, or rather, that the naive ```cpp for (auto it = store.begin(); it != store.end(); ++it) { ... } ``` is a potential performance footgun, and would be better written as e.g. ```cpp for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) { ... } ``` if possible. Results from the benchmark on my machine: ``` test iterations median mad min max allocs tasks inst cycles cstore_bench.cs_iteration_recompute_end_test_1000 108 4.354ms 837.139ns 4.354ms 4.362ms 1530.000 0.000 59711228.3 22991621.4 cstore_bench.cs_iteration_recompute_end_test_10000 9 41.775ms 15.943us 41.710ms 41.816ms 14833.000 0.000 573555849.7 220456245.3 cstore_bench.cs_iteration_precompute_end_test_1000 160 1.293ms 507.575ns 1.293ms 1.295ms 530.000 0.000 18506437.3 6820267.5 cstore_bench.cs_iteration_precompute_end_test_10000 11 11.178ms 28.978us 11.126ms 11.207ms 4833.000 0.000 161778399.1 58919071.7 ``` And with a test bumped up to `100000` entries: ``` test iterations median mad min max allocs tasks inst cycles cstore_bench.cs_iteration_recompute_end_test_100000 1 399.905ms 234.335us 398.822ms 400.803ms 147819.000 0.000 5717966516.0 2109965829.4 cstore_bench.cs_iteration_precompute_end_test_100000 1 101.535ms 359.658us 101.176ms 101.966ms 47819.000 0.000 1594675168.8 536228334.2 ``` This benchmark should serve as motivation for finding sub-optimal loops over `segment_meta_cstore` objects in the code and altering them to be closer to the precomputed case.
cloud_storage
: reduce calls to segmnet_meta_cstore::end()
cloud_storage
: reduce calls to segment_meta_cstore::end()
CI test resultstest results on build#71083
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find. i'm not opposed to accepting the exact change in this PR, but it would be nice to see if we can improve cstore so that new users don't need to know this performance trick.
-
i would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.
-
could end() be cached internally? something like this
diff --git a/src/v/cloud_storage/segment_meta_cstore.cc b/src/v/cloud_storage/segment_meta_cstore.cc
index 4982f02e7f..679f09f9a7 100644
--- a/src/v/cloud_storage/segment_meta_cstore.cc
+++ b/src/v/cloud_storage/segment_meta_cstore.cc
@@ -935,10 +935,12 @@ public:
_col.begin());
}
- std::unique_ptr<segment_meta_materializing_iterator::impl> end() const {
- flush_write_buffer();
- return std::make_unique<segment_meta_materializing_iterator::impl>(
- _col.end());
+ std::unique_ptr<segment_meta_materializing_iterator::impl>& end() const {
+ if (flush_write_buffer()) {
+ _end = std::make_unique<segment_meta_materializing_iterator::impl>(
+ _col.end());
+ }
+ return _end;
}
std::unique_ptr<segment_meta_materializing_iterator::impl>
@@ -1092,12 +1094,13 @@ public:
*this = serde::from_iobuf<impl>(std::move(in));
}
- void flush_write_buffer() const {
+ bool flush_write_buffer() const {
if (_write_buffer.empty()) {
- return;
+ return false;
}
_col.insert_entries(_write_buffer.begin(), _write_buffer.end());
_write_buffer.clear();
+ return true;
}
private:
Yes, range-based
While I like the idea, I don't think The idea of caching within Even worse, the footgun now becomes a potential correctness bug, rather than just a slower operation due to a new user not knowing about how they should be writing their |
are you saying that the way you cache |
If the
No, the way I have it written should be safe. |
oh right duh. thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- all of the changes look good.
- can we write a comment on begin/end about how to use in a performant way?
The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end()
iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end()
remains constant for the duration of their execution?
I would consider a concurrent modification of any underlying container's Truthfully though, I cannot in full confidence say that this guarantee is present. I'd defer to an SME on that, @Lazin. |
Furthermore, do you now have the same concern for currently present range-based for loops over |
hmm, probably not. i mean those have been there for a long time is the only reason.
oh yes of course. it's just that we were reevaluating end() at every loop so it seems like there is at least some chance that undefined behavior ended up being safe by chance. anyway, if the perf savings is good we should consider merging soon because then we'll at least get a ton of test time booked. let's wait to see what @Lazin thinks. |
The |
The It should be safe to not flush inside the end call. |
I would expect that flushing is not the source of the performance issue here, since AFAICT it would only occur on the first call to end() in a for-loop. But maybe that first call is the biggest? @WillemKauf do you know if the performance cost incurred by the first call or its a small cost that adds up? |
It was surprising for me too. The flushing happens if the write buffer is not empty. Maybe there is a concurrent process that appends some data? |
These benchmarks are just sampling a simple
Altering the benchmark to flush before iterating, i.e --- a/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
+++ b/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
@@ -291,6 +291,7 @@ void cs_iteration_recompute_end_test(auto& store, size_t sz) {
store.insert(s);
}
+ store.flush_write_buffer();
perf_tests::start_measuring_time();
for (auto it = store.begin(); it != store.end(); ++it) {
auto v = it->size_bytes;
@@ -304,6 +305,7 @@ void cs_iteration_precompute_end_test(auto& store, size_t sz) {
store.insert(s);
}
+ store.flush_write_buffer();
perf_tests::start_measuring_time();
for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
auto v = it->size_bytes; yields the following results
Looks like the iteration and materialization of |
oh never mind, of course std::list::end is O(1) |
if this is true then we can fix the performance problem without changing the for-loops to cache end(), right? |
Yeah I suppose so. I'll take a look at reworking this soon |
Cool, let me know what you find. I think this is the key insight from @Lazin
But, this IIRC is doing some heavy lifting
And we should make sure it holds. |
The issue is now that Not sure how extensive the changes would have to be to all the |
Isn't that why it's in a unique_ptr? |
yeah i guess so. this area of the code is pretty gnarly in its implementation of iterators/facades. I'm not sure if you think it's worth splunking down deep to try to fix this or if the knowledge that "this is a performance footgun (though not a correctness one)" is enough along with the optimizations in this PR to make it acceptable. |
One way to improve this is to use 'is_end' method of the column store iterator. Even the 'end' iterator is a relatively heavy data structure that stores a collection of iterators internally (one per column). You can avoid creating the 'end' iterator altogether by checking 'is_end'. |
I just double checked and it looks like I was wrong about flushing (the one at the top level which is needed to support replacement of elements). The delta_for "frame" iterator is quite heavy. It contains a copy of the write buffer. The write buffer is 128 bytes. And we have 12 columns. This is why the main reason why materializing iterator is so "heavy". The column store exposes individual columns so the cstore users can iterate on columns to avoid materializing every record. So another way to fix the performance issues is to operate on column level. |
Should this PR be reworked to do that, or do you (and @dotnwat think) that the proposed changes are acceptable in their current form? Happy to make any changes if there's a clear path here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good.
This PR demonstrates the difference between a simple
for
loop over asegment_meta_cstore
from.begin()
to.end()
, showing that precomputing.end()
boasts a valuable performance improvement, or rather, that the naiveis a potential performance footgun, and would be better written as e.g.
if possible.
Results from the benchmark on my machine:
And with a test bumped up to
100000
entries:The added benchmark should serve as motivation for finding sub-optimal loops over
segment_meta_cstore
objects in the code and altering them to be closer to the precomputed case.For a summary as to why
segment_meta_cstore::end()
is not a cheap operation, check the following definitions:segment_meta_cstore::impl::end()
:redpanda/src/v/cloud_storage/segment_meta_cstore.cc
Lines 938 to 942 in 083e363
column_store::end()
:redpanda/src/v/cloud_storage/segment_meta_cstore.cc
Lines 531 to 535 in 083e363
Backports Required
Release Notes