Provide more control over the Parquet schema when writing to Parquet

### Description

Sometimes we want to generate Parquet files to be read by tools that expect a specific Parquet schema, and at the moment this requires exporting to Arrow and then using PyArrow to cast to the desired format first. It would be great if we could write Parquet directly from Polars.

In particular, it would be useful to be able to specify the nullability of columns, and also of any types nested within other columns like array elements. Currently Polars assumes everything is nullable. This causes problems when reading Parquet files in C#/.NET with ParquetSharp or Parquet.Net for example, as non-nullable integers will be read as the `int` type, but nullable integers need to be read as `Nullable<int>`.

One potential solution would be to add a new `arrow_schema` parameter to `write_parquet` and `sink_parquet`, which can be `None` or an instance of `pyarrow.Schema`. There are some edge cases where the Parquet schema can't be fully constrained by the Arrow schema, eg. I think writing decimals with integer Parquet primitives rather than fixed length byte arrays is one thing that can't be defined in the Arrow schema, but PyArrow doesn't expose this level of customisation either so I think that's fine.

If `write_parquet(use_pyarrow=True)` is used, the implementation can simply cast the Arrow table to the desired schema before writing it. This won't unnecessarily copy any buffers and will check that any non-nullable columns don't contain any nulls. Handling the Rust `ParquetWriter` is a bit trickier. We could maybe enforce that the schema matches the result of `schema_to_arrow_checked` besides any nullability differences, and then validate that any non-nullable columns do not contain nulls when writing.

I'm happy to implement this, but want to check that this feature would be accepted first.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Provide more control over the Parquet schema when writing to Parquet #17418

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Provide more control over the Parquet schema when writing to Parquet #17418

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions