package parquet
- Alphabetic
- Public
- All
Type Members
-
case class
ParquetColumn(sparkType: DataType, descriptor: Option[ColumnDescriptor], repetitionLevel: Int, definitionLevel: Int, required: Boolean, path: Seq[String], children: Seq[ParquetColumn]) extends Product with Serializable
Rich information for a Parquet column together with its SparkSQL type.
- final class ParquetDictionary extends Dictionary
- class ParquetFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable
-
class
ParquetFilters extends AnyRef
Some utility function to convert Spark data source filters to Parquet filters.
-
class
ParquetFooterReader extends AnyRef
ParquetFooterReaderis a util class which encapsulates the helper methods of reading parquet file footer -
class
ParquetOptions extends FileSourceOptions with Logging
Options for the Parquet data source.
- class ParquetOutputWriter extends OutputWriter
-
class
ParquetReadSupport extends ReadSupport[InternalRow] with Logging
A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.
A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.
The API interface of ReadSupport is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), ReadSupport need to be instantiated and initialized twice on both driver side and executor side. The init() method is for driver side initialization, while prepareForRead() is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and ReadSupport is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.
Due to this reason, we no longer rely on ReadContext to pass requested schema from init() to prepareForRead(), but use a private
varfor simplicity. -
class
ParquetToSparkSchemaConverter extends AnyRef
This converter class is used to convert Parquet MessageType to Spark SQL StructType (via the
convertmethod) as well as ParquetColumn (via theconvertParquetColumnmethod).This converter class is used to convert Parquet MessageType to Spark SQL StructType (via the
convertmethod) as well as ParquetColumn (via theconvertParquetColumnmethod). The latter contains richer information about the Parquet type, including its associated repetition & definition level, column path, column descriptor etc.Parquet format backwards-compatibility rules are respected when converting Parquet MessageType schemas.
- See also
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
- trait ParquetVectorUpdater extends AnyRef
- class ParquetVectorUpdaterFactory extends AnyRef
-
class
ParquetWriteSupport extends WriteSupport[InternalRow] with Logging
A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages.
A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages. This class can write Parquet data in two modes:
- Standard mode: Parquet data are written in standard format defined in parquet-format spec.
- Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.
This behavior can be controlled by SQL option
spark.sql.parquet.writeLegacyFormat. The value of this option is propagated to this class by theinit()method and its Hadoop configuration argument. -
class
SparkToParquetSchemaConverter extends AnyRef
This converter class is used to convert Spark SQL StructType to Parquet MessageType.
-
abstract
class
SpecificParquetRecordReaderBase[T] extends RecordReader[Void, T]
Base class for custom RecordReaders for Parquet that directly materialize to
T.Base class for custom RecordReaders for Parquet that directly materialize to
T. This class handles computing row groups, filtering on them, setting up the column readers, etc. This is heavily based on parquet-mr's RecordReader. TODO: move this to the parquet-mr project. There are performance benefits of doing it this way, albeit at a higher cost to implement. This base class is reusable. -
class
VectorizedColumnReader extends AnyRef
Decoder to return values from a single column.
-
class
VectorizedDeltaBinaryPackedReader extends VectorizedReaderBase
An implementation of the Parquet DELTA_BINARY_PACKED decoder that supports the vectorized interface.
An implementation of the Parquet DELTA_BINARY_PACKED decoder that supports the vectorized interface. DELTA_BINARY_PACKED is a delta encoding for integer and long types that stores values as a delta between consecutive values. Delta values are themselves bit packed. Similar to RLE but is more effective in the case of large variation of values in the encoded column.
DELTA_BINARY_PACKED is the default encoding for integer and long columns in Parquet V2.
Supported Types: INT32, INT64
-
class
VectorizedDeltaByteArrayReader extends VectorizedReaderBase with VectorizedValuesReader with RequiresPreviousReader
An implementation of the Parquet DELTA_BYTE_ARRAY decoder that supports the vectorized interface.
-
class
VectorizedDeltaLengthByteArrayReader extends VectorizedReaderBase with VectorizedValuesReader
An implementation of the Parquet DELTA_LENGTH_BYTE_ARRAY decoder that supports the vectorized interface.
-
class
VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase[AnyRef]
A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs.
A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs. This is somewhat based on parquet-mr's ColumnReader.
TODO: decimal requiring more than 8 bytes, INT96. Schema mismatch. All of these can be handled efficiently and easily with codegen.
This class can either return InternalRows or ColumnarBatches. With whole stage codegen enabled, this class returns ColumnarBatches which offers significant performance gains. TODO: make this always return ColumnarBatches.
-
class
VectorizedPlainValuesReader extends ValuesReader with VectorizedValuesReader
An implementation of the Parquet PLAIN decoder that supports the vectorized interface.
-
class
VectorizedReaderBase extends ValuesReader with VectorizedValuesReader
Base class for implementations of VectorizedValuesReader.
Base class for implementations of VectorizedValuesReader. Mainly to avoid duplication of methods that are not supported by concrete implementations
-
final
class
VectorizedRleValuesReader extends ValuesReader with VectorizedValuesReader
A values reader for Parquet's run-length encoded data.
A values reader for Parquet's run-length encoded data. This is based off of the version in parquet-mr with these changes:
- Supports the vectorized interface.
- Works on byte arrays(byte[]) instead of making byte streams.
This encoding is used in multiple places:
- Definition/Repetition levels
- Dictionary ids.
- Boolean type values of Parquet DataPageV2
-
trait
VectorizedValuesReader extends AnyRef
Interface for value decoding that supports vectorized (aka batched) decoding.
Interface for value decoding that supports vectorized (aka batched) decoding. TODO: merge this into parquet-mr.
Value Members
- object ParquetColumn extends Serializable
- object ParquetFileFormat extends Logging with Serializable
- object ParquetOptions extends DataSourceOptions with Serializable
- object ParquetReadSupport extends Logging
- object ParquetRowIndexUtil
- object ParquetUtils extends Logging
- object ParquetWriteSupport