-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
fix: Fix incorrect native Iceberg scan from tables with renamed/dropped columns/fields #23713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix incorrect native Iceberg scan from tables with renamed/dropped columns/fields #23713
Conversation
predicate: None, | ||
}; | ||
|
||
let schema_before_selection = if incoming_schema.len() == final_output_schema.len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the column selectors are applied after attaching the row_index
column. This is prone to name collisions as the underlying file can contain a column called "row_index" that gets renamed to a different column.
This PR changes the ordering of operations to always apply the column selectors first before any other operation that adds/removes columns (e.g. row_index) to resolve this problem. As a result this code here is no longer needed.
@@ -586,4 +597,194 @@ impl ColumnSelectorBuilder { | |||
|
|||
mismatch_err("") | |||
} | |||
|
|||
/// Adds transforms on top of the `input_selector` if necessary. | |||
pub fn attach_iceberg_transforms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the existing attach_transforms
, but builds the transform based the (physical_id: u32, dtype: IcebergColumnType)
instead of (name: PlSmallStr, dtype: DataType)
.
&mut incoming_schema.iter_names().map(|x| x.as_str()), | ||
) | ||
#[derive(Debug, Clone)] | ||
pub enum ForbidExtraColumns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initialize and use a ForbidExtraColumns
instead of passing the ExtraColumnPolicy
- for Iceberg this will properly check for extra columns using the physical ID.
@@ -604,58 +605,14 @@ impl ReaderStarter { | |||
..Default::default() | |||
}; | |||
|
|||
let mut extra_ops_post = extra_ops_this_file; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code has been moved below to the spawned async fn start_reader_impl
to reduce blocking in this loop.
[Upstream Issue] | ||
PyIceberg writes NULL as empty lists into the Parquet file. | ||
* Issue on Polars repo - https://github.com/pola-rs/polars/issues/23715 | ||
* Issue on PyIceberg repo - https://github.com/apache/iceberg-python/issues/2246 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
/// Provides projections for columns that are sourced from the file. | ||
#[derive(Debug, Clone)] | ||
pub enum ProjectionBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builds Projection
per-file based on the schema in the file.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #23713 +/- ##
==========================================
+ Coverage 81.28% 81.37% +0.08%
==========================================
Files 1644 1650 +6
Lines 223249 224060 +811
Branches 2841 2851 +10
==========================================
+ Hits 181479 182338 +859
+ Misses 41071 41013 -58
- Partials 699 709 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
5cdf788
to
70eb121
Compare
Iceberg complexity. :') |
…ed columns/fields (pola-rs#23713)
…ed columns/fields (pola-rs#23713)
Adds execution-level implementation to properly apply Iceberg column mappings.
Related prior work
Changes
Projection::(Plain|Mapped)
Introduces a new
Projection
enum that is used instead of the existingprojected_file_schema: SchemaRef
. This encapsulates e.g. resolved Iceberg column transforms (ColumnSelector
), which can be retrieved viaget_mapped_projection_ref_*
. This interface is designed to also be used by the Parquet reader in the future to support filters with casting/renaming.This enum also replaces the existing
projection: SchemaRef
parameter inside ofBeginReadArgs
. Readers that do not indicate the newMAPPED_COLUMN_PROJECTION
capability are guaranteed to be passedProjection::Plain
and can use the innerSchemaRef
directly.missing_columns.rs
has been deleted - insertion of missing columns is now handled byColumnSelector::Constant
.Column Transform