-
Notifications
You must be signed in to change notification settings - Fork 45
Iceberg value_schema_latest mode #1068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
PR Change SummaryEnhanced Iceberg integration documentation in Redpanda with a focus on the new
Modified Files
How can I customize these reviews?Check out the Hyperlint AI Reviewer docs for more information on how to customize the review. If you just want to ignore it on this PR, you can add the Note specifically for link checks, we only check the first 30 links in a file and we cache the results for several hours (for instance, if you just added a page, you might experience this). Our recommendation is to add |
@@ -43,7 +43,7 @@ endif::[] | |||
{"user_id": 2324, "event_type": "BUTTON_CLICK", "ts": "2024-11-25T20:23:59.380Z"} | |||
---- | |||
|
|||
=== Topic with schema (`value_schema_id_prefix` mode) | |||
=== Topic with schema (`value_schema_id_prefix` or `value_schema_latest` mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a separate section for value_schema_latest? See https://deploy-preview-1068--redpanda-docs-preview.netlify.app/current/manage/iceberg/query-iceberg-topics/#topic-with-schema-value_schema_id_prefix-or-value_schema_latest-mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because rpk only produces using the schema registry wire format and the other mode is how to do it without the wire format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Tyler last week and agreed that a new section for value_schema_latest would be a nice to have for later if we want to demonstrate producing to a topic without using rpk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a couple of small suggestions.
|
||
=== value_schema_latest | ||
|
||
Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the xref:manage:schema-reg/schema-reg-overview.adoc[Schema Registry]. Unlike the `value_schema_id_prefix` mode, `value_schema_latest` does not require that producers use the wire format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the latest schema is cached periodically. The cache period is defined by the cluster config iceberg_latest_schema_cache_ttl_ms
which defaults to 5min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this config in our docs yet - we'll have to re-run our config script and double check that it gets pulled in.
[[override-value-schema-latest-default]] | ||
=== Override `value_schema_latest` default | ||
|
||
In `value_schema_latest` mode, only the string `value_schema_latest` is required in the property value. This sets `value_schema_latest` mode to its default behavior, which derives the subject for the topic using xref:manage:schema-reg/schema-id-validation.adoc#set-subject-name-strategy-per-topic[TopicNameStrategy]. For Protobuf data, the default behavior also deserializes records using the first message within the corresponding Protobuf schema in the Schema Registry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth while to give an example of TopicNamingStrategy: if your topic is named foo
the schema is looked up in foo-value
.
@@ -76,8 +76,7 @@ rpk registry schema create ClickEvent-value --schema path/to/schema.avsc --type | |||
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent --format='%k %v\n' --schema-id=topic | |||
---- | |||
+ | |||
The `value_schema_id_prefix` requires that you produce to a topic using the Schema Registry wire format, which includes the magic byte and schema ID in the prefix of the message payload. This allows Redpanda to identify the correct schema version in the Schema Registry for a record. See the https://www.redpanda.com/blog/schema-registry-kafka-streaming#how-does-serialization-work-with-schema-registry-in-kafka[Understanding Apache Kafka Schema Registry^] blog post to learn more. | |||
|
|||
The `value_schema_id_prefix` mode requires that you produce to a topic using the Schema Registry wire format, which includes the magic byte and schema ID in the prefix of the message payload. This allows Redpanda to identify the correct schema version in the Schema Registry for a record. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to examples like in the modes doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added link to new section on wire format
Redpanda Schema Registry uses the default port 8081. | ||
Redpanda Schema Registry uses the default port 8081. | ||
|
||
== Serialization and deserialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rockwotj @mattschumpert Does this subheading make sense or does it need to specifically mention the wire format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling it the wire format makes sense, because you can serialize/deserialize without it by having another mechanism to map a topic record to a schema: static mapping of topic to latest schema in your producer/consumer, communicating the schema ID using some other out of band mechanism (message header, control messages, etc).
Generally this is the "eco system standard" way of doing it.
The wire format is a sequence of bytes consisting of the following: | ||
|
||
. The "magic byte," a single byte that always contains the value of 0. | ||
. A four-byte integer containing the schema ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically for protobuf there is additionally a series of variants as well encoding which protobuf message in the protobuf schema was used. I don't feel strongly about if we need to call that out however.
|
||
Creates an Iceberg table whose structure matches the Redpanda schema for the topic, with columns corresponding to each field. You must register a schema in the xref:manage:schema-reg/schema-reg-overview.adoc[Schema Registry] and producers must write to the topic using the Schema Registry wire format. | ||
|
||
In the xref:manage:schema-reg/schema-reg-overview.adoc#serialization-and-deserialization[Schema Registry wire format], a "magic byte" and schema ID are embedded in the message payload header. Producers to the topic must use the wire format in the serialization process so Redpanda can determine the schema used for each record, use the schema to define the Iceberg table, and store the topic values in the corresponding table columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no def/link for "magic byte"?
) | ||
---- | ||
|
||
Use `key_value` mode if the topic data is in JSON or if you can use the Iceberg data in its semi-structured format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to read this sentence 3-4 times, and am not 100% clear on its meaning.
Use key_value
mode if the topic data is in JSON, or if you can, use the Iceberg data in its semi-structured format.
Use key_value
mode if the topic data is in JSON, or the Iceberg data in its semi-structured format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased
|
||
The wire format is a sequence of bytes consisting of the following: | ||
|
||
. The "magic byte," a single byte that always contains the value of 0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good--you defined it here. thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job Kat.
WalkthroughThis update introduces a new documentation page detailing the supported Iceberg integration modes in Redpanda, updates navigation and cross-references to include this new content, and refines existing documentation to clarify the configuration and schema translation for Iceberg-enabled topics. The release notes and topic property references are expanded to enumerate new features and modes, including support for a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Redpanda
participant SchemaRegistry
User->>Redpanda: Create or alter topic with redpanda.iceberg.mode
alt value_schema_id_prefix mode
User->>SchemaRegistry: Register schema (if needed)
User->>Redpanda: Produce message with Schema Registry wire format
Redpanda->>Redpanda: Parse message using schema ID from header
Redpanda->>Iceberg: Map fields to table columns
else value_schema_latest mode
User->>SchemaRegistry: Register schema (if needed)
User->>Redpanda: Produce message (no wire format required)
Redpanda->>SchemaRegistry: Fetch latest schema for subject
Redpanda->>Iceberg: Map fields to table columns
else key_value mode
User->>Redpanda: Produce message
Redpanda->>Iceberg: Store key and value as columns
else disabled mode
User->>Redpanda: Produce message
Redpanda->>Iceberg: Iceberg integration disabled
end
Poem
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (3)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (11)
modules/manage/partials/iceberg/query-iceberg-topics.adoc (1)
4-5
: Simplify table naming description.The sentence "Redpanda generates an Iceberg table that has the same name as the topic name." is wordy and repetitive. Consider refactoring to:
Redpanda generates an Iceberg table with the same name as the topic.This improves readability.
modules/manage/partials/iceberg/about-iceberg-topics.adoc (2)
121-122
: Link new modes to the detailed mode guide.You’ve added
value_schema_id_prefix
andvalue_schema_latest
modes here, but these entries lack cross‑references to the more detailed configuration and schema‑translation guidance on the new “Choose an Iceberg Mode” page. Consider xref‑linking each mode name to that page (e.g.,xref:manage/iceberg/choose-iceberg-mode.adoc[value_schema_latest]
).
139-139
: Include link to Schema Registry doc.The step to register a schema is clear, but you may want to xref the exact Schema Registry API or UI page (e.g.,
xref:manage:schema-reg/schema-reg-overview.adoc[Schema Registry wire format]
) so users know where to go next.modules/manage/pages/schema-reg/schema-reg-overview.adoc (4)
7-7
: Consider relocating this paragraph.The new sentence on message exchange sits just above the design overview. It might fit more naturally under the “Serialization format” section to maintain topical flow.
36-36
: Clarify default port note.You’ve added “Redpanda Schema Registry uses the default port 8081.” To highlight this, consider wrapping it in an AsciiDoc
[NOTE]
block for greater visibility.
50-56
: Unify conditional serialization blocks.The non‑cloud and cloud variants for the serializer description are identical except for minor naming differences. Consider merging them into one block using xref macros or a single conditional, to reduce duplication.
60-61
: Use precise terminology for prefixing.Instead of “pads the beginning of the message,” you may want to say “prepends the magic byte and schema ID to the message payload” to avoid ambiguity.
modules/manage/pages/iceberg/choose-iceberg-mode.adoc (4)
3-3
: Trim page categories.You’ve listed six categories—consider narrowing this to the most relevant (e.g., Iceberg and Integration) to avoid over‑categorization.
36-37
: Clarify “message payload header.”There’s no separate header wrapper—this is simply prefixed data. Consider rephrasing to “embedded at the start of the message payload” for accuracy.
42-43
: Link the TTL configuration.You mention
iceberg_latest_schema_cache_ttl_ms
—xref the cluster property reference (e.g.,xref:reference/cluster-properties.adoc#iceberg_latest_schema_cache_ttl_ms
) so users can find details on adjusting this TTL.
67-73
: Merge override blocks for clarity.The
ifndef::env-cloud[]
andifdef::env-cloud[]
sections are identical. Combining them—or moving the shared content outside the conditional—would simplify maintenance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
modules/shared/images/schema-registry-wire-format.png
is excluded by!**/*.png
📒 Files selected for processing (8)
modules/ROOT/nav.adoc
(1 hunks)modules/get-started/pages/release-notes/redpanda.adoc
(1 hunks)modules/manage/pages/iceberg/choose-iceberg-mode.adoc
(1 hunks)modules/manage/pages/iceberg/query-iceberg-topics.adoc
(1 hunks)modules/manage/pages/schema-reg/schema-reg-overview.adoc
(2 hunks)modules/manage/partials/iceberg/about-iceberg-topics.adoc
(3 hunks)modules/manage/partials/iceberg/query-iceberg-topics.adoc
(2 hunks)modules/reference/pages/properties/topic-properties.adoc
(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (3)
- GitHub Check: Redirect rules - redpanda-docs-preview
- GitHub Check: Header rules - redpanda-docs-preview
- GitHub Check: Pages changed - redpanda-docs-preview
🔇 Additional comments (10)
modules/ROOT/nav.adoc (1)
184-184
: Added Iceberg mode navigation entry looks correct.The new
Choose Iceberg Mode
page link is properly placed under the Iceberg section and follows the existing indentation and naming conventions.modules/manage/pages/iceberg/query-iceberg-topics.adoc (1)
10-10
: Updated cross-reference to the new Iceberg mode page is accurate.The paragraph now points to the
choose-iceberg-mode.adoc
page for mode details, which improves discoverability. The removal of the redundant phrase enhances clarity.modules/manage/partials/iceberg/query-iceberg-topics.adoc (1)
79-80
:✅ Verification successful
Verify anchor for the Schema Registry wire format link.
Please confirm that the target anchor
#wire-format
exists inmodules/manage/pages/schema-reg/schema-reg-overview.adoc
. If the actual anchor differs (e.g.,#serialization-and-deserialization
), update thexref
accordingly.
🏁 Script executed:
#!/bin/bash # Verify the anchor for wire format in the Schema Registry overview grep -R "wire-format" modules/manage/pages/schema-reg/schema-reg-overview.adocLength of output: 159
Continuing verification to locate the actual section anchor:
🏁 Script executed:
#!/bin/bash # List all explicit Asciidoc anchors grep -R "\[\[" -n modules/manage/pages/schema-reg/schema-reg-overview.adoc # List all section headings grep -R "^=+" -n modules/manage/pages/schema-reg/schema-reg-overview.adoc # Search for serialization-related headings grep -R "[sS]erialization" -n modules/manage/pages/schema-reg/schema-reg-overview.adoc # Search for any occurrence of “format” grep -R "format" -n modules/manage/pages/schema-reg/schema-reg-overview.adocLength of output: 5430
Anchor Confirmed – No Change Required
The
== Wire format
heading inmodules/manage/pages/schema-reg/schema-reg-overview.adoc
(line 38) auto‑generates thewire‑format
anchor. The existingxref:manage:schema-reg/schema-reg-overview.adoc#wire-format[Schema Registry wire format]is therefore correct.
modules/manage/partials/iceberg/about-iceberg-topics.adoc (3)
22-22
: Verify the manifest file format.You’ve streamlined the manifest files section to state they are in JSON format. Please confirm that Redpanda’s Iceberg integration indeed emits JSON manifest files (the Iceberg spec defaults to Avro manifests).
136-137
: Good addition: cross‑reference to the new guide.Linking out to
choose-iceberg-mode.adoc
here helps users find the in‑depth mode explanations.
166-166
: Validate the schema‑translation cross‑reference.Ensure that the anchor
#schema-types-translation
exists in the target page (choose-iceberg-mode.adoc
) so the link resolves correctly.modules/manage/pages/schema-reg/schema-reg-overview.adoc (1)
38-47
: Great addition of the wire-format section.This new “Wire format” section clearly defines the magic byte and schema ID prefix. It fills the previous documentation gap regarding message framing.
modules/manage/pages/iceberg/choose-iceberg-mode.adoc (3)
1-5
: File header looks good.The title, description, and single-source tag are correctly set for this new page.
14-16
: Clear introduction.The link to the about-iceberg-topics page and the property description provides good context.
52-56
: Verify CLI syntax.Please confirm that the
--topic-config=
and--set
flags shown here match the latestrpk
version’s syntax, as these occasionally change between releases.
Iceberg-enabled topics now support the following: | ||
|
||
- xref:manage:iceberg/about-iceberg-topics.adoc#use-custom-partitioning[Custom partitioning] for improved query performance. | ||
- Snapshot expiry. | ||
- xref:manage:iceberg/about-iceberg-topics.adoc#manage-dead-letter-queue[Dead-letter queue] for invalid records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Link snapshot expiry to its documentation anchor.
The bullet for Snapshot expiry is missing a cross-reference. Please add an xref
to the corresponding section in about-iceberg-topics.adoc
(e.g., #snapshot-expiry
) for consistency with the other bullets.
| sfixed64 | int | ||
| string | string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct Protobuf fixed‑width mapping.
Currently sfixed64
is mapped to int
; it should map to long
.
-| sfixed64 | int
+| sfixed64 | long
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sfixed64 | int | |
| string | string | |
| sfixed64 | long | |
| string | string |
| string | string | ||
| record | struct | ||
| array | list | ||
| maps | list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix Avro map mapping.
The table lists Avro maps
→ Iceberg list
, but Iceberg supports native map
types. Please update to:
-| maps | list
+| maps | map
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| maps | list | |
| maps | map |
@@ -7,7 +7,7 @@ | |||
include::shared:partial$enterprise-license.adoc[] | |||
==== | |||
|
|||
When you access Iceberg topics from a data lakehouse or other Iceberg-compatible tools, how you consume the data depends on the topic xref:manage:iceberg/about-iceberg-topics.adoc#enable-iceberg-integration[Iceberg mode] and whether you've registered a schema for the topic in the xref:manage:schema-reg/schema-reg-overview.adoc[Redpanda Schema Registry]. In either mode, you do not need to rely on complex ETL jobs or pipelines to access real-time data from Redpanda. | |||
When you access Iceberg topics from a data lakehouse or other Iceberg-compatible tools, how you consume the data depends on the topic xref:manage:iceberg/choose-iceberg-mode.adoc[Iceberg mode] and whether you've registered a schema for the topic in the xref:manage:schema-reg/schema-reg-overview.adoc[Redpanda Schema Registry]. You do not need to rely on complex ETL jobs or pipelines to access real-time data from Redpanda. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbatuigas this page has examples using the other mode but no mention of "value_schema_latest" mode at all. Even if we don't have an example it's probably worth a mention how querying works in this mode (essentially the same as with value_schema_id_prefix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added note
Description
PR to add to Cloud docs: redpanda-data/cloud-docs#260
This pull request introduces significant enhancements to Iceberg integration in Redpanda, including new documentation on supported Iceberg modes, updates to existing Iceberg-related pages, and improvements to the Schema Registry documentation. The changes aim to provide clearer guidance on configuring and using Iceberg modes, enhance usability, and ensure consistency across documentation.
Iceberg Integration Enhancements:
choose-iceberg-mode.adoc
, detailing supported Iceberg modes (key_value
,value_schema_id_prefix
,value_schema_latest
, anddisabled
), their configurations, and how they translate to table formats. This page provides examples and explains schema translation for Avro and Protobuf data.nav.adoc
to include a link to the new "Choose Iceberg Mode" page.Documentation Updates:
redpanda.adoc
to list new features for Iceberg-enabled topics, such as custom partitioning, snapshot expiry, dead-letter queues, schema evolution, and structured Iceberg tables for Avro/Protobuf data without Schema Registry wire format.about-iceberg-topics.adoc
to reflect changes in supported Iceberg modes and removed outdated details about custom partitioning. Added a cross-reference to the new "Choose Iceberg Mode" page. [1] [2] [3]query-iceberg-topics.adoc
to reference the new "Choose Iceberg Mode" page for clarity on consuming Iceberg topics.Schema Registry Documentation:
schema-reg-overview.adoc
with a new section on serialization and deserialization, explaining the Schema Registry wire format and its role in message processing.Resolves https://redpandadata.atlassian.net/browse/
Review deadline: 10 April
Page previews
Choose an Iceberg Mode
Checks
Summary by CodeRabbit
New Features
value_schema_latest
, enabling Iceberg table creation from the latest schema in the Schema Registry without requiring the wire format.Documentation