`cloud_storage`: reduce calls to `segment_meta_cstore::end()` #27311

WillemKauf · 2025-08-20T16:59:06Z

This PR demonstrates the difference between a simple for loop over a segment_meta_cstore from .begin() to .end(), showing that precomputing .end() boasts a valuable performance improvement, or rather, that the naive

for (auto it = store.begin(); it != store.end(); ++it) {
    ...
}

is a potential performance footgun, and would be better written as e.g.

for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
    ...
}

if possible.

Results from the benchmark on my machine:

test                                                 iterations      median         mad         min         max      allocs       tasks        inst      cycles
cstore_bench.cs_iteration_recompute_end_test_1000           108     4.354ms   837.139ns     4.354ms     4.362ms    1530.000       0.000  59711228.3  22991621.4
cstore_bench.cs_iteration_recompute_end_test_10000            9    41.775ms    15.943us    41.710ms    41.816ms   14833.000       0.000 573555849.7 220456245.3
cstore_bench.cs_iteration_precompute_end_test_1000          160     1.293ms   507.575ns     1.293ms     1.295ms     530.000       0.000  18506437.3   6820267.5
cstore_bench.cs_iteration_precompute_end_test_10000          11    11.178ms    28.978us    11.126ms    11.207ms    4833.000       0.000 161778399.1  58919071.7

And with a test bumped up to 100000 entries:

test                                                 iterations      median         mad         min         max      allocs       tasks         inst       cycles
cstore_bench.cs_iteration_recompute_end_test_100000           1   399.905ms   234.335us   398.822ms   400.803ms  147819.000       0.000 5717966516.0 2109965829.4
cstore_bench.cs_iteration_precompute_end_test_100000          1   101.535ms   359.658us   101.176ms   101.966ms   47819.000       0.000 1594675168.8  536228334.2

The added benchmark should serve as motivation for finding sub-optimal loops over segment_meta_cstore objects in the code and altering them to be closer to the precomputed case.

For a summary as to why segment_meta_cstore::end() is not a cheap operation, check the following definitions:

`segment_meta_cstore::impl::end()`:

redpanda/src/v/cloud_storage/segment_meta_cstore.cc

Lines 938 to 942 in 083e363

    
           std::unique_ptr<segment_meta_materializing_iterator::impl> end() const { 
        
               flush_write_buffer(); 
        
               return std::make_unique<segment_meta_materializing_iterator::impl>( 
        
                 _col.end()); 
        
           }

`column_store::end()`:

redpanda/src/v/cloud_storage/segment_meta_cstore.cc

Lines 531 to 535 in 083e363

    
           /// Return iterator to the first element after the end of the sequence 
        
           auto end() const -> iterators_t { 
        
               return std::apply( 
        
                 [](auto&&... col) { return iterators_t(col.end()...); }, columns()); 
        
           }

Backports Required

Release Notes

none

This test demonstrates the difference between a simple `for` loop over a `segment_meta_cstore` from `.begin()` to `.end()`, but shows that precomputing `.end()` boasts a valuable performance improvement, or rather, that the naive ```cpp for (auto it = store.begin(); it != store.end(); ++it) { ... } ``` is a potential performance footgun, and would be better written as e.g. ```cpp for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) { ... } ``` if possible. Results from the benchmark on my machine: ``` test iterations median mad min max allocs tasks inst cycles cstore_bench.cs_iteration_recompute_end_test_1000 108 4.354ms 837.139ns 4.354ms 4.362ms 1530.000 0.000 59711228.3 22991621.4 cstore_bench.cs_iteration_recompute_end_test_10000 9 41.775ms 15.943us 41.710ms 41.816ms 14833.000 0.000 573555849.7 220456245.3 cstore_bench.cs_iteration_precompute_end_test_1000 160 1.293ms 507.575ns 1.293ms 1.295ms 530.000 0.000 18506437.3 6820267.5 cstore_bench.cs_iteration_precompute_end_test_10000 11 11.178ms 28.978us 11.126ms 11.207ms 4833.000 0.000 161778399.1 58919071.7 ``` And with a test bumped up to `100000` entries: ``` test iterations median mad min max allocs tasks inst cycles cstore_bench.cs_iteration_recompute_end_test_100000 1 399.905ms 234.335us 398.822ms 400.803ms 147819.000 0.000 5717966516.0 2109965829.4 cstore_bench.cs_iteration_precompute_end_test_100000 1 101.535ms 359.658us 101.176ms 101.966ms 47819.000 0.000 1594675168.8 536228334.2 ``` This benchmark should serve as motivation for finding sub-optimal loops over `segment_meta_cstore` objects in the code and altering them to be closer to the precomputed case.

vbotbuildovich · 2025-08-20T21:16:33Z

CI test results

test results on build#71083

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
CrashReporterTest	test_redpanda_crash_reporting	null	integration	https://buildkite.com/redpanda/redpanda/builds/71083#0198c8e5-9ab5-4863-96cc-d356945dc85b	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DataMigrationsMultiClusterTest	test_basic	null	integration	https://buildkite.com/redpanda/redpanda/builds/71083#0198c889-94b2-4346-bdca-6812a563ef40	FLAKY	20/21	upstream reliability is '96.62027833001989'. current run reliability is '95.23809523809523'. drift is 1.38218 and the allowed drift is set to 50. The test should PASS

dotnwat

nice find. i'm not opposed to accepting the exact change in this PR, but it would be nice to see if we can improve cstore so that new users don't need to know this performance trick.

i would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.
could end() be cached internally? something like this

diff --git a/src/v/cloud_storage/segment_meta_cstore.cc b/src/v/cloud_storage/segment_meta_cstore.cc
index 4982f02e7f..679f09f9a7 100644
--- a/src/v/cloud_storage/segment_meta_cstore.cc
+++ b/src/v/cloud_storage/segment_meta_cstore.cc
@@ -935,10 +935,12 @@ public:
           _col.begin());
     }

-    std::unique_ptr<segment_meta_materializing_iterator::impl> end() const {
-        flush_write_buffer();
-        return std::make_unique<segment_meta_materializing_iterator::impl>(
-          _col.end());
+    std::unique_ptr<segment_meta_materializing_iterator::impl>& end() const {
+        if (flush_write_buffer()) {
+            _end = std::make_unique<segment_meta_materializing_iterator::impl>(
+              _col.end());
+        }
+        return _end;
     }

     std::unique_ptr<segment_meta_materializing_iterator::impl>
@@ -1092,12 +1094,13 @@ public:
         *this = serde::from_iobuf<impl>(std::move(in));
     }

-    void flush_write_buffer() const {
+    bool flush_write_buffer() const {
         if (_write_buffer.empty()) {
-            return;
+            return false;
         }
         _col.insert_entries(_write_buffer.begin(), _write_buffer.end());
         _write_buffer.clear();
+        return true;
     }

 private:

WillemKauf · 2025-08-21T00:58:04Z

would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.

Yes, range-based for shouldn't be affected by this pessimization.

could end() be cached internally? something like this

While I like the idea, I don't think flush_write_buffer() provides absolute coverage for all the ways _col can be altered, see _col.append().

The idea of caching within end() feels nice but probably adds more complexity throughout cstore (i.e we have to ensure any operation which mutates _col correctly indicates to the cached value it needs recomputing).

Even worse, the footgun now becomes a potential correctness bug, rather than just a slower operation due to a new user not knowing about how they should be writing their for loops.

dotnwat · 2025-08-21T01:11:14Z

would imagine that using a range-based for-loop would not suffer from this problem because it caches its called to end(), too. but that might be a lot of churn.

Yes, range-based for shouldn't be affected by this pessimization.

could end() be cached internally? something like this

While I like the idea, I don't think flush_write_buffer() provides absolute coverage for all the ways _col can be altered, see _col.append().

are you saying that the way you cache end() is safe as compared to hiding the caching of end() because they seem identical. or are you saying that neither are safe in general, and we don't want the general case to be unsafe? for that matter, range-based for would also be unsafe?

WillemKauf · 2025-08-21T01:20:30Z

are you saying that the way you cache end() is safe as compared to hiding the caching of end() because they seem identical. or are you saying that neither are safe in general, and we don't want the general case to be unsafe? for that matter, range-based for would also be unsafe?

If the segment_meta_cstore is reused elsewhere after caching end(), and then alters the underlying _col value through means other than the write buffer then the cache invalidation you have doesn't provide full coverage for the potential mutations. The two solutions seem identical in the context of a for loop without yield points but I don't think this is true in general?

or are you saying that neither are safe in general, and we don't want the general case to be unsafe?

No, the way I have it written should be safe.

dotnwat · 2025-08-21T01:22:18Z

If the segment_meta_cstore is reused elsewhere after caching end(), and then alters the underlying _col value through means other than the write buffer then the cache invalidation you have doesn't provide full coverage for the potential mutations.

oh right duh. thanks

dotnwat

all of the changes look good.
can we write a comment on begin/end about how to use in a performant way?

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

WillemKauf · 2025-08-21T01:46:04Z

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior- if we lacked the guarantee before that end() remained constant during iteration, I'm not sure the new behavior could/would be any worse.

Truthfully though, I cannot in full confidence say that this guarantee is present. I'd defer to an SME on that, @Lazin.

WillemKauf · 2025-08-21T01:57:06Z

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

Furthermore, do you now have the same concern for currently present range-based for loops over segment_meta_cstores in the code versus currently present iteration-based for loops?

dotnwat · 2025-08-21T04:39:00Z

Furthermore, do you now have the same concern for currently present range-based for loops over segment_meta_cstores in the code versus currently present iteration-based for loops?

hmm, probably not. i mean those have been there for a long time is the only reason.

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior

oh yes of course. it's just that we were reevaluating end() at every loop so it seems like there is at least some chance that undefined behavior ended up being safe by chance.

anyway, if the perf savings is good we should consider merging soon because then we'll at least get a ton of test time booked.

let's wait to see what @Lazin thinks.

Lazin · 2025-08-21T09:09:49Z

The biggest concern I have is that we now have, AFAICT, a significant amount of LOC using a cached end() iterator. How do we have confidence that all of these critical sections do not violate the new assumption, namely that end() remains constant for the duration of their execution?

The end returns a special value that indicates end of the range. You can even use the same end iterator with different containers. So it's totally safe to make end const and to not invoke flush from it.

Lazin · 2025-08-21T09:18:24Z

I would consider a concurrent modification of any underlying container's end() during iteration as undefined behavior- if we lacked the guarantee before that end() remained constant during iteration, I'm not sure the new behavior could/would be any worse.

The cstore is safe to modify concurrently. The iterator operates on a copy of the cstore. The end IIRC is just a special value. It doesn't track the actual end of the range.

It should be safe to not flush inside the end call.

dotnwat · 2025-08-21T15:25:31Z

It should be safe to not flush inside the end call.

I would expect that flushing is not the source of the performance issue here, since AFAICT it would only occur on the first call to end() in a for-loop. But maybe that first call is the biggest? @WillemKauf do you know if the performance cost incurred by the first call or its a small cost that adds up?

Lazin · 2025-08-21T15:29:47Z

I would expect that flushing is not the source of the performance issue here, since AFAICT it would only occur on the first call to end() in a for-loop. But maybe that first call is the biggest? @WillemKauf do you know if the performance cost incurred by the first call or its a small cost that adds up?

It was surprising for me too. The flushing happens if the write buffer is not empty. Maybe there is a concurrent process that appends some data?

WillemKauf · 2025-08-21T15:36:46Z

It was surprising for me too. The flushing happens if the write buffer is not empty. Maybe there is a concurrent process that appends some data?

These benchmarks are just sampling a simple for loop, no concurrent processes involved.

do you know if the performance cost incurred by the first call or its a small cost that adds up?

Altering the benchmark to flush before iterating, i.e

--- a/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
+++ b/src/v/cloud_storage/tests/segment_meta_cstore_bench.cc
@@ -291,6 +291,7 @@ void cs_iteration_recompute_end_test(auto& store, size_t sz) {
         store.insert(s);
     }
 
+    store.flush_write_buffer();
     perf_tests::start_measuring_time();
     for (auto it = store.begin(); it != store.end(); ++it) {
         auto v = it->size_bytes;
@@ -304,6 +305,7 @@ void cs_iteration_precompute_end_test(auto& store, size_t sz) {
         store.insert(s);
     }
 
+    store.flush_write_buffer();
     perf_tests::start_measuring_time();
     for (auto it = store.begin(), e_it = store.end(); it != e_it; ++it) {
         auto v = it->size_bytes;

yields the following results

cstore_bench.cs_iteration_recompute_end_test_1000           111     3.937ms     1.712us     3.897ms     4.638ms    1249.000       0.000  56539043.5  20523800.4
cstore_bench.cs_iteration_recompute_end_test_10000            9    39.004ms     9.258us    38.878ms    39.139ms   12368.000       0.000 564275282.4 204212833.3
cstore_bench.cs_iteration_precompute_end_test_1000          163     1.041ms     3.812us     1.036ms     1.046ms     249.000       0.000  15319031.4   5427449.2
cstore_bench.cs_iteration_precompute_end_test_10000          12    10.193ms     8.181us    10.185ms    10.233ms    2368.000       0.000 152148394.5  53416316.2

Looks like the iteration and materialization of end() is the dominant cost.

oleiman · 2025-08-21T15:48:56Z

~~this is the first time I'm looking at this code, but it looks to me like segment_meta_cstore::end() calls std::list<frame_iter_t>::end a bunch of times.~~

oh never mind, of course std::list::end is O(1)

dotnwat · 2025-08-21T23:41:44Z

The end returns a special value that indicates end of the range. You can even use the same end iterator with different containers. So it's totally safe to make end const and to not invoke flush from it.

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

WillemKauf · 2025-08-21T23:44:47Z

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

Yeah I suppose so. I'll take a look at reworking this soon

dotnwat · 2025-08-21T23:49:50Z

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

Yeah I suppose so. I'll take a look at reworking this soon

Cool, let me know what you find.

I think this is the key insight from @Lazin

The iterator operates on a copy of the cstore.

But, this IIRC is doing some heavy lifting

The end IIRC is just a special value

And we should make sure it holds.

WillemKauf · 2025-08-22T03:35:57Z

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

The issue is now that end() returns a tuple of iterators (iterators_t) which has implicitly deleted copy constructors, so one cannot simply cache and return copies of this special value for end()...

Not sure how extensive the changes would have to be to all the delta_for impls to make this work, but the iterator_t type makes this a non-trivial change I think.

dotnwat · 2025-08-25T23:40:33Z

if this is true then we can fix the performance problem without changing the for-loops to cache end(), right?

The issue is now that end() returns a tuple of iterators (iterators_t) which has implicitly deleted copy constructors, so one cannot simply cache and return copies of this special value for end()...

Isn't that why it's in a unique_ptr?

WillemKauf · 2025-08-25T23:52:13Z

Isn't that why it's in a unique_ptr?

yeah i guess so. this area of the code is pretty gnarly in its implementation of iterators/facades.

I'm not sure if you think it's worth splunking down deep to try to fix this or if the knowledge that "this is a performance footgun (though not a correctness one)" is enough along with the optimizations in this PR to make it acceptable.

Lazin · 2025-08-26T07:40:23Z

Isn't that why it's in a unique_ptr?

yeah i guess so. this area of the code is pretty gnarly in its implementation of iterators/facades.

I'm not sure if you think it's worth splunking down deep to try to fix this or if the knowledge that "this is a performance footgun (though not a correctness one)" is enough along with the optimizations in this PR to make it acceptable.

One way to improve this is to use 'is_end' method of the column store iterator. Even the 'end' iterator is a relatively heavy data structure that stores a collection of iterators internally (one per column). You can avoid creating the 'end' iterator altogether by checking 'is_end'.

Lazin · 2025-08-26T07:58:52Z

I just double checked and it looks like I was wrong about flushing (the one at the top level which is needed to support replacement of elements). The delta_for "frame" iterator is quite heavy. It contains a copy of the write buffer. The write buffer is 128 bytes. And we have 12 columns. This is why the main reason why materializing iterator is so "heavy". The column store exposes individual columns so the cstore users can iterate on columns to avoid materializing every record. So another way to fix the performance issues is to operate on column level.

WillemKauf · 2025-08-26T18:56:51Z

So another way to fix the performance issues is to operate on column level.

Should this PR be reworked to do that, or do you (and @dotnwat think) that the proposed changes are acceptable in their current form?

Happy to make any changes if there's a clear path here.

Lazin

The PR looks good.

WillemKauf added 9 commits August 20, 2025 12:57

cloud_storage: reduce calls to end() in anomalies_detector

5533208

cloud_storage: reduce calls to end() in partition_manifest

0855e5e

cloud_storage: reduce calls to end() in remote_partition

ae928f8

cloud_storage: reduce calls to end() in segment_meta_cstore

a5eb301

cloud_storage: reduce calls to end() in adjacent_segment_merger

1af4557

cloud_storage: reduce calls to end() in archival_metadata_stm

22c4f27

cloud_storage: reduce calls to end() in retention_calculator

301a1fb

cloud_storage: reduce calls to end() in segment_reupload

baf5d15

github-actions bot added the area/redpanda label Aug 20, 2025

WillemKauf changed the title ~~cloud_storage: reduce calls to segmnet_meta_cstore::end()~~ cloud_storage: reduce calls to segment_meta_cstore::end() Aug 20, 2025

WillemKauf requested review from Lazin, nvartolomei and oleiman August 20, 2025 21:15

dotnwat reviewed Aug 21, 2025

View reviewed changes

WillemKauf requested a review from dotnwat August 21, 2025 20:33

Lazin approved these changes Aug 27, 2025

View reviewed changes

	std::unique_ptr<segment_meta_materializing_iterator::impl> end() const {
	flush_write_buffer();
	return std::make_unique<segment_meta_materializing_iterator::impl>(
	_col.end());
	}

	/// Return iterator to the first element after the end of the sequence
	auto end() const -> iterators_t {
	return std::apply(
	[](auto&&... col) { return iterators_t(col.end()...); }, columns());
	}

cloud_storage: reduce calls to segment_meta_cstore::end() #27311

Are you sure you want to change the base?

cloud_storage: reduce calls to segment_meta_cstore::end() #27311

Conversation

WillemKauf commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

segment_meta_cstore::impl::end():

column_store::end():

Backports Required

Release Notes

Uh oh!

vbotbuildovich commented Aug 20, 2025

CI test results

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

WillemKauf commented Aug 21, 2025

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

WillemKauf commented Aug 21, 2025

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

WillemKauf commented Aug 21, 2025

Uh oh!

WillemKauf commented Aug 21, 2025

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

Lazin commented Aug 21, 2025

Uh oh!

Lazin commented Aug 21, 2025

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

Lazin commented Aug 21, 2025

Uh oh!

WillemKauf commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oleiman commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

WillemKauf commented Aug 21, 2025

Uh oh!

dotnwat commented Aug 21, 2025

Uh oh!

WillemKauf commented Aug 22, 2025

Uh oh!

dotnwat commented Aug 25, 2025

Uh oh!

WillemKauf commented Aug 25, 2025

Uh oh!

Lazin commented Aug 26, 2025

Uh oh!

Lazin commented Aug 26, 2025

Uh oh!

WillemKauf commented Aug 26, 2025

Uh oh!

Lazin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

`cloud_storage`: reduce calls to `segment_meta_cstore::end()` #27311

`cloud_storage`: reduce calls to `segment_meta_cstore::end()` #27311

WillemKauf commented Aug 20, 2025 •

edited

Loading

`segment_meta_cstore::impl::end()`:

`column_store::end()`:

WillemKauf commented Aug 21, 2025 •

edited

Loading

oleiman commented Aug 21, 2025 •

edited

Loading