Skip to content

[Parquet] Concurrent writes with ArrowWriter.get_column_writers should parallelize across row groups #8115

@rok

Description

@rok

#8029 introduced ArrowWriter.get_column_writers to expose Vec<ArrowColumnWriter> of a the "in progress" ArrowRowGroupWriter. This was to enable downstream libraries to concurrently write columns and row groups. However only one ArrowRowGroupWriter will exist at a time and all ArrowColumnWriters need to complete before a new RowGroup can proceed to be serialized. This can be solved with locking but is not ideal. See apache/datafusion#16738 (comment).

We could:

  1. Have downstream users locking and only serialize one RowGroup at a time.
  2. Have ArrowWriter keep a Vec<ArrowRowGroupWriter> for all RowGroups currently being serialized.
  3. Expose ArrowRowGroupWriterFactory of active ArrowWriter

Additionally we should introduce a write_parquet_with_small_rg_size with encryption to sufficiently test this codepath.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions