-
|
Hello, I would like to iterate over a parquet dataset in batches. import pyarrow.parquet as pq
dataset = pq.ParquetDataset(source_path, filters=cat_dict[cat]["cat_filter"])However, the for batch in dataset._dataset.to_batches(filter=dataset._filter_expression):
...That feels a bit hacky to me because I'm depending on internals like Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
|
I believe the import pyarrow.dataset as ds
dataset = ds.dataset(source_path)
for batch in dataset.to_batches(filter=cat_dict[cat]["cat_filter"]):
...See https://arrow.apache.org/docs/python/dataset.html for some more docs |
Beta Was this translation helpful? Give feedback.
I believe the
datasetmodule is the preferred way to do this:See https://arrow.apache.org/docs/python/dataset.html for some more docs