Packages

package parquet

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class ParquetColumn(sparkType: DataType, descriptor: Option[ColumnDescriptor], repetitionLevel: Int, definitionLevel: Int, required: Boolean, path: Seq[String], children: Seq[ParquetColumn]) extends Product with Serializable

    Rich information for a Parquet column together with its SparkSQL type.

  2. final class ParquetDictionary extends Dictionary
  3. class ParquetFileFormat extends FileFormat with DataSourceRegister with Logging with Serializable
  4. class ParquetFilters extends AnyRef

    Some utility function to convert Spark data source filters to Parquet filters.

  5. class ParquetFooterReader extends AnyRef

    ParquetFooterReader is a util class which encapsulates the helper methods of reading parquet file footer

  6. class ParquetOptions extends FileSourceOptions

    Options for the Parquet data source.

  7. class ParquetOutputWriter extends OutputWriter
  8. class ParquetReadSupport extends ReadSupport[InternalRow] with Logging

    A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.

    A Parquet ReadSupport implementation for reading Parquet records as Catalyst InternalRows.

    The API interface of ReadSupport is a little bit over complicated because of historical reasons. In older versions of parquet-mr (say 1.6.0rc3 and prior), ReadSupport need to be instantiated and initialized twice on both driver side and executor side. The init() method is for driver side initialization, while prepareForRead() is for executor side. However, starting from parquet-mr 1.6.0, it's no longer the case, and ReadSupport is only instantiated and initialized on executor side. So, theoretically, now it's totally fine to combine these two methods into a single initialization method. The only reason (I could think of) to still have them here is for parquet-mr API backwards-compatibility.

    Due to this reason, we no longer rely on ReadContext to pass requested schema from init() to prepareForRead(), but use a private var for simplicity.

  9. class ParquetToSparkSchemaConverter extends AnyRef

    This converter class is used to convert Parquet MessageType to Spark SQL StructType (via the convert method) as well as ParquetColumn (via the convertParquetColumn method).

    This converter class is used to convert Parquet MessageType to Spark SQL StructType (via the convert method) as well as ParquetColumn (via the convertParquetColumn method). The latter contains richer information about the Parquet type, including its associated repetition & definition level, column path, column descriptor etc.

    Parquet format backwards-compatibility rules are respected when converting Parquet MessageType schemas.

    See also

    https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

  10. trait ParquetVectorUpdater extends AnyRef
  11. class ParquetVectorUpdaterFactory extends AnyRef
  12. class ParquetWriteSupport extends WriteSupport[InternalRow] with Logging

    A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages.

    A Parquet WriteSupport implementation that writes Catalyst InternalRows as Parquet messages. This class can write Parquet data in two modes:

    • Standard mode: Parquet data are written in standard format defined in parquet-format spec.
    • Legacy mode: Parquet data are written in legacy format compatible with Spark 1.4 and prior.

    This behavior can be controlled by SQL option spark.sql.parquet.writeLegacyFormat. The value of this option is propagated to this class by the init() method and its Hadoop configuration argument.

  13. class SparkToParquetSchemaConverter extends AnyRef

    This converter class is used to convert Spark SQL StructType to Parquet MessageType.

  14. abstract class SpecificParquetRecordReaderBase[T] extends RecordReader[Void, T]

    Base class for custom RecordReaders for Parquet that directly materialize to T.

    Base class for custom RecordReaders for Parquet that directly materialize to T. This class handles computing row groups, filtering on them, setting up the column readers, etc. This is heavily based on parquet-mr's RecordReader. TODO: move this to the parquet-mr project. There are performance benefits of doing it this way, albeit at a higher cost to implement. This base class is reusable.

  15. class VectorizedColumnReader extends AnyRef

    Decoder to return values from a single column.

  16. class VectorizedDeltaBinaryPackedReader extends VectorizedReaderBase

    An implementation of the Parquet DELTA_BINARY_PACKED decoder that supports the vectorized interface.

    An implementation of the Parquet DELTA_BINARY_PACKED decoder that supports the vectorized interface. DELTA_BINARY_PACKED is a delta encoding for integer and long types that stores values as a delta between consecutive values. Delta values are themselves bit packed. Similar to RLE but is more effective in the case of large variation of values in the encoded column.

    DELTA_BINARY_PACKED is the default encoding for integer and long columns in Parquet V2.

    Supported Types: INT32, INT64

    See also

    Parquet format encodings: DELTA_BINARY_PACKED

  17. class VectorizedDeltaByteArrayReader extends VectorizedReaderBase with VectorizedValuesReader with RequiresPreviousReader

    An implementation of the Parquet DELTA_BYTE_ARRAY decoder that supports the vectorized interface.

  18. class VectorizedDeltaLengthByteArrayReader extends VectorizedReaderBase with VectorizedValuesReader

    An implementation of the Parquet DELTA_LENGTH_BYTE_ARRAY decoder that supports the vectorized interface.

  19. class VectorizedParquetRecordReader extends SpecificParquetRecordReaderBase[AnyRef]

    A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs.

    A specialized RecordReader that reads into InternalRows or ColumnarBatches directly using the Parquet column APIs. This is somewhat based on parquet-mr's ColumnReader.

    TODO: decimal requiring more than 8 bytes, INT96. Schema mismatch. All of these can be handled efficiently and easily with codegen.

    This class can either return InternalRows or ColumnarBatches. With whole stage codegen enabled, this class returns ColumnarBatches which offers significant performance gains. TODO: make this always return ColumnarBatches.

  20. class VectorizedPlainValuesReader extends ValuesReader with VectorizedValuesReader

    An implementation of the Parquet PLAIN decoder that supports the vectorized interface.

  21. class VectorizedReaderBase extends ValuesReader with VectorizedValuesReader

    Base class for implementations of VectorizedValuesReader.

    Base class for implementations of VectorizedValuesReader. Mainly to avoid duplication of methods that are not supported by concrete implementations

  22. final class VectorizedRleValuesReader extends ValuesReader with VectorizedValuesReader

    A values reader for Parquet's run-length encoded data.

    A values reader for Parquet's run-length encoded data. This is based off of the version in parquet-mr with these changes:

    • Supports the vectorized interface.
    • Works on byte arrays(byte[]) instead of making byte streams.

    This encoding is used in multiple places:

    • Definition/Repetition levels
    • Dictionary ids.
    • Boolean type values of Parquet DataPageV2
  23. trait VectorizedValuesReader extends AnyRef

    Interface for value decoding that supports vectorized (aka batched) decoding.

    Interface for value decoding that supports vectorized (aka batched) decoding. TODO: merge this into parquet-mr.

Value Members

  1. object ParquetColumn extends Serializable
  2. object ParquetFileFormat extends Logging with Serializable
  3. object ParquetOptions extends DataSourceOptions with Serializable
  4. object ParquetReadSupport extends Logging
  5. object ParquetRowIndexUtil
  6. object ParquetUtils extends Logging
  7. object ParquetWriteSupport

Ungrouped