Skip to content

Provide more control over the Parquet schema when writing to Parquet #17418

@adamreeve

Description

@adamreeve

Description

Sometimes we want to generate Parquet files to be read by tools that expect a specific Parquet schema, and at the moment this requires exporting to Arrow and then using PyArrow to cast to the desired format first. It would be great if we could write Parquet directly from Polars.

In particular, it would be useful to be able to specify the nullability of columns, and also of any types nested within other columns like array elements. Currently Polars assumes everything is nullable. This causes problems when reading Parquet files in C#/.NET with ParquetSharp or Parquet.Net for example, as non-nullable integers will be read as the int type, but nullable integers need to be read as Nullable<int>.

One potential solution would be to add a new arrow_schema parameter to write_parquet and sink_parquet, which can be None or an instance of pyarrow.Schema. There are some edge cases where the Parquet schema can't be fully constrained by the Arrow schema, eg. I think writing decimals with integer Parquet primitives rather than fixed length byte arrays is one thing that can't be defined in the Arrow schema, but PyArrow doesn't expose this level of customisation either so I think that's fine.

If write_parquet(use_pyarrow=True) is used, the implementation can simply cast the Arrow table to the desired schema before writing it. This won't unnecessarily copy any buffers and will check that any non-nullable columns don't contain any nulls. Handling the Rust ParquetWriter is a bit trickier. We could maybe enforce that the schema matches the result of schema_to_arrow_checked besides any nullability differences, and then validate that any non-nullable columns do not contain nulls when writing.

I'm happy to implement this, but want to check that this feature would be accepted first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or an improvement of an existing featureneeds decisionAwaiting decision by a maintainer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions