Suggestions and problems about ArrowReaderBuilder
(orParquetRecordBatchStreamBuilder
)
#4674
Replies: 3 comments 4 replies
-
ArrowReaderBuilder reads and provides access to the ParquetMetadata, including the page index if you enable it? I would recommend checking out DataFusion's ParquetExec which shows how these APIs can be used
I'm not sure why you got this impression, but it is not true. If you provide a RowSelection, derived from the page index or otherwise, it will use this to elide IO and decode Note: I do hope to provide better APIs for interacting with the parquet statistics in futures (#4328) but I've not had sufficient bandwidth lately |
Beta Was this translation helpful? Give feedback.
-
For my point 2 and 3, I have no questions now. |
Beta Was this translation helpful? Give feedback.
-
new_builder
public for more flexible operations.It's more flexible to allow user to pass
ParquetMetaData
manually. For example:If we want to analyze
ParquetMetaData
first (for collecting stats, pruning row groups...), we can pass thisParquetMetaData
to build a reader directly to avoid reading it twice.If we want to prune row groups, we need to call
with_row_groups
onArrowReaderBuilder
. But only if we read the parquet metadata can we know which row groups to prune.ArrowReaderOptions
containspage_index
butArrowReaderBuilder
doesn't use it.After reading the codes I found that neither sync and async
ParquetRecordBatchReader
s can use page index to optimize IO.ArrowReader
have different read options. And the APIs are quite confusing.We can find that if we create a reader by
ArrowReaderBuilder
, we will passArrowReaderOptions
to it.However, if we want to create a sync reader,
ArrowReaderOptions
will be converted toReadOptions
(https://github.com/apache/arrow-rs/blob/master/parquet/src/file/serialized_reader.rs#L172)I think there are some problems:
page_index
fromArrowReaderOptions
to constructReadOptions
. Other options likeReadGroupPredicate
do not exist. So we cannot prune row groups by passing predicates if we create reader byArrowReaderBuilder
.ReadOptions
, which may cause async reader missing some optimizations.I think we should unify them and expose more reasonable APIs.
Beta Was this translation helpful? Give feedback.
All reactions