Iterating over parquet dataset in batches #47988

ikrommyd · 2025-10-28T20:00:43Z

ikrommyd
Oct 28, 2025

Hello,

I would like to iterate over a parquet dataset in batches.
I get my parquet dataset like this

import pyarrow.parquet as pq

dataset = pq.ParquetDataset(source_path, filters=cat_dict[cat]["cat_filter"])

However, the pyarrow.parquet.ParquetDataset class doesn't seem to have a batched iteration method.
After very briefly looking at its source code, I found that I can access the underlying pyarrow.dataset.Dataset by with the _dataset attribute and that has a to_batches method.
Therefore I'm doing this to iterate

for batch in dataset._dataset.to_batches(filter=dataset._filter_expression):
    ...

That feels a bit hacky to me because I'm depending on internals like dataset._dataset and dataset._filter_expression. Is this the proper way to do this? Is there a better API that I couldn't find that users should be using?

Thanks in advance!

Answered by sidneymau

Oct 28, 2025

I believe the dataset module is the preferred way to do this:

import pyarrow.dataset as ds

dataset = ds.dataset(source_path)

for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]):
    ...

See https://arrow.apache.org/docs/python/dataset.html for some more docs

View full answer

sidneymau · 2025-10-28T20:14:13Z

sidneymau
Oct 28, 2025

I believe the dataset module is the preferred way to do this:

import pyarrow.dataset as ds

dataset = ds.dataset(source_path)

for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]):
    ...

See https://arrow.apache.org/docs/python/dataset.html for some more docs

5 replies

ikrommyd Oct 28, 2025
Author

I believe the dataset module is the preferred way to do this:
import pyarrow.dataset as ds

dataset = ds.dataset(source_path)

for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]):
    ...
See https://arrow.apache.org/docs/python/dataset.html for some more docs

Yeah I had seen that. The problem I had encountered was this

In [11]: filters = [("pt", ">", -1.0)]

In [12]: dataset = pq.ParquetDataset("storage/NTuples/BBHto2G_M-125/nominal/", filters=filters)

In [13]: batch = next(dataset._dataset.to_batches(filter=dataset._filter_expression))

In [14]: type(batch)
Out[14]: pyarrow.lib.RecordBatch

In [15]: dataset = ds.dataset("storage/NTuples/BBHto2G_M-125/nominal/")

In [16]: type(dataset)
Out[16]: pyarrow._dataset.FileSystemDataset

In [17]: batch = next(dataset.to_batches(filter=filters))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 batch = next(dataset.to_batches(filter=filters))

TypeError: Argument 'filter' has incorrect type (expected pyarrow._compute.Expression, got list)

ParquetDataset can get a list for its filters argument while to_batches expects an Expression as it's filter argument.
Is there a public API to compile a list into an Expression? There definitely is a private one since ParquetDataset can do it.

ikrommyd Oct 28, 2025
Author

Ah....looking at the source code, I found this:

arrow/python/pyarrow/parquet/core.py

Line 134 in 7623887

def filters_to_expression(filters):

It's not in the documentation, but looks "public" enough as it is accessible like pyarrow.parquet.filters_to_expression

ikrommyd Oct 28, 2025
Author

That looks like it solves my problem and I can use "public" API only but maybe filters_to_expression should be added to the API reference in that case?

sidneymau Oct 28, 2025

Yes, pq.filters_to_expression is the only way I am aware for converting filters written in that way to compute expressions, and I have used it to do similar things. I think the more idiomatic solution would be to directly write the compute expression (e.g., filter = ds.field("pt") > -1.0, but I agree that pq.filters_to_expression is useful and ought to be documented

ikrommyd Oct 28, 2025
Author

Great! Closing the discussion then. Will open an issue regarding the filters_to_expression being part of the API reference. Thanks!

Iterating over parquet dataset in batches #47988

Uh oh!

Uh oh!

ikrommyd Oct 28, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

sidneymau Oct 28, 2025

Uh oh!

Uh oh!

ikrommyd Oct 28, 2025 Author

Uh oh!

Uh oh!

ikrommyd Oct 28, 2025 Author

Uh oh!

ikrommyd Oct 28, 2025 Author

Uh oh!

sidneymau Oct 28, 2025

Uh oh!

ikrommyd Oct 28, 2025 Author

ikrommyd
Oct 28, 2025

Replies: 1 comment 5 replies

sidneymau
Oct 28, 2025

ikrommyd Oct 28, 2025
Author

ikrommyd Oct 28, 2025
Author

ikrommyd Oct 28, 2025
Author

ikrommyd Oct 28, 2025
Author