@Evolving public interface ParquetHandler
| Modifier and Type | Method and Description |
|---|---|
CloseableIterator<ColumnarBatch> |
readParquetFiles(CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read the Parquet format files at the given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema. |
CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
ColumnarBatch with the columns requested by physicalSchema.
If physicalSchema has a StructField with column name
StructField.METADATA_ROW_INDEX_COLUMN_NAME and the field is a metadata column
StructField.isMetadataColumn() the column must be populated with the file row index.
How does a column in physicalSchema match to the column in the Parquet file?
If the StructField has a field id in the metadata with key `parquet.field.id`
the column is attempted to match by id. If the column is not found by id, the column is
matched by name. When trying to find the column in Parquet by name,
first case-sensitive match is used. If not found then a case-insensitive match is attempted.
fileIter - Iterator of files to read data from.physicalSchema - Select list of columns to read from the Parquet file.predicate - Optional predicate which the Parquet reader can optionally use to prune
rows that don't satisfy the predicate. Because pruning is optional and
may be incomplete, caller is still responsible apply the predicate on
the data returned by this method.ColumnarBatchs containing the data in columnar format.
It is the responsibility of the caller to close the iterator. The data returned is in the
same as the order of files given in scanFileIter.java.io.IOException - if an I/O error occurs during the read.