-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Description
Sometimes we want to generate Parquet files to be read by tools that expect a specific Parquet schema, and at the moment this requires exporting to Arrow and then using PyArrow to cast to the desired format first. It would be great if we could write Parquet directly from Polars.
In particular, it would be useful to be able to specify the nullability of columns, and also of any types nested within other columns like array elements. Currently Polars assumes everything is nullable. This causes problems when reading Parquet files in C#/.NET with ParquetSharp or Parquet.Net for example, as non-nullable integers will be read as the int
type, but nullable integers need to be read as Nullable<int>
.
One potential solution would be to add a new arrow_schema
parameter to write_parquet
and sink_parquet
, which can be None
or an instance of pyarrow.Schema
. There are some edge cases where the Parquet schema can't be fully constrained by the Arrow schema, eg. I think writing decimals with integer Parquet primitives rather than fixed length byte arrays is one thing that can't be defined in the Arrow schema, but PyArrow doesn't expose this level of customisation either so I think that's fine.
If write_parquet(use_pyarrow=True)
is used, the implementation can simply cast the Arrow table to the desired schema before writing it. This won't unnecessarily copy any buffers and will check that any non-nullable columns don't contain any nulls. Handling the Rust ParquetWriter
is a bit trickier. We could maybe enforce that the schema matches the result of schema_to_arrow_checked
besides any nullability differences, and then validate that any non-nullable columns do not contain nulls when writing.
I'm happy to implement this, but want to check that this feature would be accepted first.