Skip to content

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Aug 21, 2025

ref https://redpandadata.atlassian.net/browse/INC-923

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • datalake: handle corrupted records by writing them to DLQ table.

@nvartolomei nvartolomei requested review from andrwng and bharathv and removed request for andrwng August 21, 2025 15:05
@nvartolomei nvartolomei requested a review from andrwng August 21, 2025 15:05
@nvartolomei nvartolomei force-pushed the nv/datalake-dlq-corrupted-records branch from 6c68128 to 0d6313c Compare August 21, 2025 17:50
@@ -609,4 +639,42 @@ record_multiplexer::handle_invalid_record(
co_return std::nullopt;
}
}

ss::future<result<void, writer_error>>
record_multiplexer::handle_corrupted_record(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this would be less surprising for callers / more accurate if it were called handle_corrupted_batch(), given we're treating the rest of the batch as corrupted, not just one record

Comment on lines +147 to +149
_log.warn,
"Error reading record from batch: {} at index {}. Err: {}",
batch.header(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably warrants an "error" log given the rarity of this situation.

translation_probe::invalid_record_cause::corrupted_record,
o,
std::nullopt,
data_copy.share(0, data_copy.size_bytes()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can only have the data payload for the first offset..

In the particular case where this crashed, the batch was > 100 MiB, seems wasteful to copy it over so many times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants