Skip to content

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Aug 20, 2025

This PR demonstrates the difference between a simple for loop over a segment_meta_cstore from .begin() to .end(), showing that precomputing .end() boasts a valuable performance improvement, or rather, that the naive

for (auto it = store.begin(); it != store.end(); ++it) {
    ...
}

is a potential performance footgun, and would be better written as e.g.

for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
    ...
}

if possible.

Results from the benchmark on my machine:

test                                                 iterations      median         mad         min         max      allocs       tasks        inst      cycles
cstore_bench.cs_iteration_recompute_end_test_1000           108     4.354ms   837.139ns     4.354ms     4.362ms    1530.000       0.000  59711228.3  22991621.4
cstore_bench.cs_iteration_recompute_end_test_10000            9    41.775ms    15.943us    41.710ms    41.816ms   14833.000       0.000 573555849.7 220456245.3
cstore_bench.cs_iteration_precompute_end_test_1000          160     1.293ms   507.575ns     1.293ms     1.295ms     530.000       0.000  18506437.3   6820267.5
cstore_bench.cs_iteration_precompute_end_test_10000          11    11.178ms    28.978us    11.126ms    11.207ms    4833.000       0.000 161778399.1  58919071.7

And with a test bumped up to 100000 entries:

test                                                 iterations      median         mad         min         max      allocs       tasks         inst       cycles
cstore_bench.cs_iteration_recompute_end_test_100000           1   399.905ms   234.335us   398.822ms   400.803ms  147819.000       0.000 5717966516.0 2109965829.4
cstore_bench.cs_iteration_precompute_end_test_100000          1   101.535ms   359.658us   101.176ms   101.966ms   47819.000       0.000 1594675168.8  536228334.2

The added benchmark should serve as motivation for finding sub-optimal loops over segment_meta_cstore objects in the code and altering them to be closer to the precomputed case.

For a summary as to why segment_meta_cstore::end() is not a cheap operation, check the following definitions:

segment_meta_cstore::impl::end():

std::unique_ptr<segment_meta_materializing_iterator::impl> end() const {
flush_write_buffer();
return std::make_unique<segment_meta_materializing_iterator::impl>(
_col.end());
}

column_store::end():

/// Return iterator to the first element after the end of the sequence
auto end() const -> iterators_t {
return std::apply(
[](auto&&... col) { return iterators_t(col.end()...); }, columns());
}

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

  • none

This test demonstrates the difference between a simple `for` loop over
a `segment_meta_cstore` from `.begin()` to `.end()`, but shows
that precomputing `.end()` boasts a valuable performance improvement,
or rather, that the naive

```cpp
    for (auto it = store.begin(); it != store.end(); ++it) {
    ...
    }
```

is a potential performance footgun, and would be better written as e.g.

```cpp
    for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
    ...
    }
```

if possible.

Results from the benchmark on my machine:

```
test                                                 iterations      median         mad         min         max      allocs       tasks        inst      cycles
cstore_bench.cs_iteration_recompute_end_test_1000           108     4.354ms   837.139ns     4.354ms     4.362ms    1530.000       0.000  59711228.3  22991621.4
cstore_bench.cs_iteration_recompute_end_test_10000            9    41.775ms    15.943us    41.710ms    41.816ms   14833.000       0.000 573555849.7 220456245.3
cstore_bench.cs_iteration_precompute_end_test_1000          160     1.293ms   507.575ns     1.293ms     1.295ms     530.000       0.000  18506437.3   6820267.5
cstore_bench.cs_iteration_precompute_end_test_10000          11    11.178ms    28.978us    11.126ms    11.207ms    4833.000       0.000 161778399.1  58919071.7
```

And with a test bumped up to `100000` entries:

```
test                                                 iterations      median         mad         min         max      allocs       tasks         inst       cycles
cstore_bench.cs_iteration_recompute_end_test_100000           1   399.905ms   234.335us   398.822ms   400.803ms  147819.000       0.000 5717966516.0 2109965829.4
cstore_bench.cs_iteration_precompute_end_test_100000          1   101.535ms   359.658us   101.176ms   101.966ms   47819.000       0.000 1594675168.8  536228334.2
```

This benchmark should serve as motivation for finding sub-optimal loops over
`segment_meta_cstore` objects in the code and altering them to be closer to
the precomputed case.
@WillemKauf WillemKauf changed the title cloud_storage: reduce calls to segmnet_meta_cstore::end() cloud_storage: reduce calls to segment_meta_cstore::end() Aug 20, 2025
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#71083
test_class test_method test_arguments test_kind job_url test_status passed reason
CrashReporterTest test_redpanda_crash_reporting null integration https://buildkite.com/redpanda/redpanda/builds/71083#0198c8e5-9ab5-4863-96cc-d356945dc85b FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DataMigrationsMultiClusterTest test_basic null integration https://buildkite.com/redpanda/redpanda/builds/71083#0198c889-94b2-4346-bdca-6812a563ef40 FLAKY 20/21 upstream reliability is '96.62027833001989'. current run reliability is '95.23809523809523'. drift is 1.38218 and the allowed drift is set to 50. The test should PASS

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find. i'm not opposed to accepting the exact change in this PR, but it would be nice to see if we can improve cstore so that new users don't need to know this performance trick.

  1. i would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.

  2. could end() be cached internally? something like this

diff --git a/src/v/cloud_storage/segment_meta_cstore.cc b/src/v/cloud_storage/segment_meta_cstore.cc
index 4982f02e7f..679f09f9a7 100644
--- a/src/v/cloud_storage/segment_meta_cstore.cc
+++ b/src/v/cloud_storage/segment_meta_cstore.cc
@@ -935,10 +935,12 @@ public:
           _col.begin());
     }

-    std::unique_ptr<segment_meta_materializing_iterator::impl> end() const {
-        flush_write_buffer();
-        return std::make_unique<segment_meta_materializing_iterator::impl>(
-          _col.end());
+    std::unique_ptr<segment_meta_materializing_iterator::impl>& end() const {
+        if (flush_write_buffer()) {
+            _end = std::make_unique<segment_meta_materializing_iterator::impl>(
+              _col.end());
+        }
+        return _end;
     }

     std::unique_ptr<segment_meta_materializing_iterator::impl>
@@ -1092,12 +1094,13 @@ public:
         *this = serde::from_iobuf<impl>(std::move(in));
     }

-    void flush_write_buffer() const {
+    bool flush_write_buffer() const {
         if (_write_buffer.empty()) {
-            return;
+            return false;
         }
         _col.insert_entries(_write_buffer.begin(), _write_buffer.end());
         _write_buffer.clear();
+        return true;
     }

 private:

@WillemKauf
Copy link
Contributor Author

would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.

Yes, range-based for shouldn't be affected by this pessimization.

could end() be cached internally? something like this

While I like the idea, I don't think flush_write_buffer() provides absolute coverage for all the ways _col can be altered, see _col.append().

The idea of caching within end() feels nice but probably adds more complexity throughout cstore (i.e we have to ensure any operation which mutates _col correctly indicates to the cached value it needs recomputing).

Even worse, the footgun now becomes a potential correctness bug, rather than just a slower operation due to a new user not knowing about how they should be writing their for loops.

@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.

Yes, range-based for shouldn't be affected by this pessimization.

could end() be cached internally? something like this

While I like the idea, I don't think flush_write_buffer() provides absolute coverage for all the ways _col can be altered, see _col.append().

are you saying that the way you cache end() is safe as compared to hiding the caching of end() because they seem identical. or are you saying that neither are safe in general, and we don't want the general case to be unsafe? for that matter, range-based for would also be unsafe?

@WillemKauf
Copy link
Contributor Author

are you saying that the way you cache end() is safe as compared to hiding the caching of end() because they seem identical. or are you saying that neither are safe in general, and we don't want the general case to be unsafe? for that matter, range-based for would also be unsafe?

If the segment_meta_cstore is reused elsewhere after caching end(), and then alters the underlying _col value through means other than the write buffer then the cache invalidation you have doesn't provide full coverage for the potential mutations. The two solutions seem identical in the context of a for loop without yield points but I don't think this is true in general?

or are you saying that neither are safe in general, and we don't want the general case to be unsafe?

No, the way I have it written should be safe.

@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

If the segment_meta_cstore is reused elsewhere after caching end(), and then alters the underlying _col value through means other than the write buffer then the cache invalidation you have doesn't provide full coverage for the potential mutations.

oh right duh. thanks

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • all of the changes look good.
  • can we write a comment on begin/end about how to use in a performant way?

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

@WillemKauf
Copy link
Contributor Author

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior- if we lacked the guarantee before that end() remained constant during iteration, I'm not sure the new behavior could/would be any worse.

Truthfully though, I cannot in full confidence say that this guarantee is present. I'd defer to an SME on that, @Lazin.

@WillemKauf
Copy link
Contributor Author

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

Furthermore, do you now have the same concern for currently present range-based for loops over segment_meta_cstores in the code versus currently present iteration-based for loops?

@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

Furthermore, do you now have the same concern for currently present range-based for loops over segment_meta_cstores in the code versus currently present iteration-based for loops?

hmm, probably not. i mean those have been there for a long time is the only reason.

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior

oh yes of course. it's just that we were reevaluating end() at every loop so it seems like there is at least some chance that undefined behavior ended up being safe by chance.

anyway, if the perf savings is good we should consider merging soon because then we'll at least get a ton of test time booked.

let's wait to see what @Lazin thinks.

@Lazin
Copy link
Contributor

Lazin commented Aug 21, 2025

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

The end returns a special value that indicates end of the range. You can even use the same end iterator with different containers. So it's totally safe to make end const and to not invoke flush from it.

@Lazin
Copy link
Contributor

Lazin commented Aug 21, 2025

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior- if we lacked the guarantee before that end() remained constant during iteration, I'm not sure the new behavior could/would be any worse.

The cstore is safe to modify concurrently. The iterator operates on a copy of the cstore. The end IIRC is just a special value. It doesn't track the actual end of the range.

It should be safe to not flush inside the end call.

@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

It should be safe to not flush inside the end call.

I would expect that flushing is not the source of the performance issue here, since AFAICT it would only occur on the first call to end() in a for-loop. But maybe that first call is the biggest? @WillemKauf do you know if the performance cost incurred by the first call or its a small cost that adds up?

@Lazin
Copy link
Contributor

Lazin commented Aug 21, 2025

I would expect that flushing is not the source of the performance issue here, since AFAICT it would only occur on the first call to end() in a for-loop. But maybe that first call is the biggest? @WillemKauf do you know if the performance cost incurred by the first call or its a small cost that adds up?

It was surprising for me too. The flushing happens if the write buffer is not empty. Maybe there is a concurrent process that appends some data?

@WillemKauf
Copy link
Contributor Author

WillemKauf commented Aug 21, 2025

It was surprising for me too. The flushing happens if the write buffer is not empty. Maybe there is a concurrent process that appends some data?

These benchmarks are just sampling a simple for loop, no concurrent processes involved.

do you know if the performance cost incurred by the first call or its a small cost that adds up?

Altering the benchmark to flush before iterating, i.e

--- a/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
+++ b/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
@@ -291,6 +291,7 @@ void cs_iteration_recompute_end_test(auto& store, size_t sz) {
         store.insert(s);
     }
 
+    store.flush_write_buffer();
     perf_tests::start_measuring_time();
     for (auto it = store.begin(); it != store.end(); ++it) {
         auto v = it->size_bytes;
@@ -304,6 +305,7 @@ void cs_iteration_precompute_end_test(auto& store, size_t sz) {
         store.insert(s);
     }
 
+    store.flush_write_buffer();
     perf_tests::start_measuring_time();
     for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
         auto v = it->size_bytes;

yields the following results

cstore_bench.cs_iteration_recompute_end_test_1000           111     3.937ms     1.712us     3.897ms     4.638ms    1249.000       0.000  56539043.5  20523800.4
cstore_bench.cs_iteration_recompute_end_test_10000            9    39.004ms     9.258us    38.878ms    39.139ms   12368.000       0.000 564275282.4 204212833.3
cstore_bench.cs_iteration_precompute_end_test_1000          163     1.041ms     3.812us     1.036ms     1.046ms     249.000       0.000  15319031.4   5427449.2
cstore_bench.cs_iteration_precompute_end_test_10000          12    10.193ms     8.181us    10.185ms    10.233ms    2368.000       0.000 152148394.5  53416316.2

Looks like the iteration and materialization of end() is the dominant cost.

@oleiman
Copy link
Member

oleiman commented Aug 21, 2025

this is the first time I'm looking at this code, but it looks to me like segment_meta_cstore::end() calls std::list<frame_iter_t>::end a bunch of times.

oh never mind, of course std::list::end is O(1)

@WillemKauf WillemKauf requested a review from dotnwat August 21, 2025 20:33
@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

The end returns a special value that indicates end of the range. You can even use the same end iterator with different containers. So it's totally safe to make end const and to not invoke flush from it.

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

@WillemKauf
Copy link
Contributor Author

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

Yeah I suppose so. I'll take a look at reworking this soon

@dotnwat
Copy link
Member

dotnwat commented Aug 21, 2025

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

Yeah I suppose so. I'll take a look at reworking this soon

Cool, let me know what you find.

I think this is the key insight from @Lazin

The iterator operates on a copy of the cstore.

But, this IIRC is doing some heavy lifting

The end IIRC is just a special value

And we should make sure it holds.

@WillemKauf
Copy link
Contributor Author

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

The issue is now that end() returns a tuple of iterators (iterators_t) which has implicitly deleted copy constructors, so one cannot simply cache and return copies of this special value for end()...

Not sure how extensive the changes would have to be to all the delta_for impls to make this work, but the iterator_t type makes this a non-trivial change I think.

@dotnwat
Copy link
Member

dotnwat commented Aug 25, 2025

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

The issue is now that end() returns a tuple of iterators (iterators_t) which has implicitly deleted copy constructors, so one cannot simply cache and return copies of this special value for end()...

Isn't that why it's in a unique_ptr?

@WillemKauf
Copy link
Contributor Author

Isn't that why it's in a unique_ptr?

yeah i guess so. this area of the code is pretty gnarly in its implementation of iterators/facades.

I'm not sure if you think it's worth splunking down deep to try to fix this or if the knowledge that "this is a performance footgun (though not a correctness one)" is enough along with the optimizations in this PR to make it acceptable.

@Lazin
Copy link
Contributor

Lazin commented Aug 26, 2025

Isn't that why it's in a unique_ptr?

yeah i guess so. this area of the code is pretty gnarly in its implementation of iterators/facades.

I'm not sure if you think it's worth splunking down deep to try to fix this or if the knowledge that "this is a performance footgun (though not a correctness one)" is enough along with the optimizations in this PR to make it acceptable.

One way to improve this is to use 'is_end' method of the column store iterator. Even the 'end' iterator is a relatively heavy data structure that stores a collection of iterators internally (one per column). You can avoid creating the 'end' iterator altogether by checking 'is_end'.

@Lazin
Copy link
Contributor

Lazin commented Aug 26, 2025

I just double checked and it looks like I was wrong about flushing (the one at the top level which is needed to support replacement of elements). The delta_for "frame" iterator is quite heavy. It contains a copy of the write buffer. The write buffer is 128 bytes. And we have 12 columns. This is why the main reason why materializing iterator is so "heavy". The column store exposes individual columns so the cstore users can iterate on columns to avoid materializing every record. So another way to fix the performance issues is to operate on column level.

@WillemKauf
Copy link
Contributor Author

So another way to fix the performance issues is to operate on column level.

Should this PR be reworked to do that, or do you (and @dotnwat think) that the proposed changes are acceptable in their current form?

Happy to make any changes if there's a clear path here.

Copy link
Contributor

@Lazin Lazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants