class MultiFileCloudParquetPartitionReader extends MultiFileCloudPartitionReaderBase with ParquetPartitionReaderBase
A PartitionReader that can read multiple Parquet files in parallel. This is most efficient running in a cloud environment where the I/O of reading is slow.
Efficiently reading a Parquet split on the GPU requires re-constructing the Parquet file in memory that contains just the column chunks that are needed. This avoids sending unnecessary data to the GPU and saves GPU memory.
- Alphabetic
- By Inheritance
- MultiFileCloudParquetPartitionReader
- ParquetPartitionReaderBase
- MultiFileReaderFunctions
- MultiFileCloudPartitionReaderBase
- FilePartitionReaderBase
- Arm
- ScanWithMetrics
- Logging
- PartitionReader
- Closeable
- AutoCloseable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
MultiFileCloudParquetPartitionReader(conf: Configuration, files: Array[PartitionedFile], isSchemaCaseSensitive: Boolean, readDataSchema: StructType, debugDumpPrefix: String, maxReadBatchSizeRows: Integer, maxReadBatchSizeBytes: Long, execMetrics: Map[String, GpuMetric], partitionSchema: StructType, numThreads: Int, maxNumFileProcessed: Int, filterHandler: GpuParquetFileFilterHandler, filters: Array[Filter])
- conf
the Hadoop configuration
- files
the partitioned files to read
- isSchemaCaseSensitive
whether schema is case sensitive
- readDataSchema
the Spark schema describing what will be read
- debugDumpPrefix
a path prefix to use for dumping the fabricated Parquet data or null
- maxReadBatchSizeRows
soft limit on the maximum number of rows the reader reads per batch
- maxReadBatchSizeBytes
soft limit on the maximum number of bytes the reader reads per batch
- execMetrics
metrics
- partitionSchema
Schema of partitions.
- numThreads
the size of the threadpool
- maxNumFileProcessed
the maximum number of files to read on the CPU side and waiting to be processed on the GPU. This affects the amount of host memory used.
- filterHandler
GpuParquetFileFilterHandler used to filter the parquet blocks
- filters
filters passed into the filterHandler
Type Members
- case class HostMemoryBuffersWithMetaData(partitionedFile: PartitionedFile, memBuffersAndSizes: Array[(HostMemoryBuffer, Long)], bytesRead: Long, isCorrectRebaseMode: Boolean, clippedSchema: MessageType) extends HostMemoryBuffersWithMetaDataBase with Product with Serializable
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
addPartitionValues(batch: Option[ColumnarBatch], inPartitionValues: InternalRow, partitionSchema: StructType): Option[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
-
def
areNamesEquiv(groups: GroupType, index: Int, otherName: String, isCaseSensitive: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
var
batch: Option[ColumnarBatch]
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
def
calculateParquetFooterSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
calculateParquetOutputSize(currentChunkedBlocks: Seq[BlockMetaData], schema: MessageType, handleCoalesceFiles: Boolean): Long
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
clone(): AnyRef
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
close(): Unit
- Definition Classes
- MultiFileCloudPartitionReaderBase → FilePartitionReaderBase → Closeable → AutoCloseable
-
def
closeOnExcept[T <: AutoCloseable, V](r: ArrayBuffer[T])(block: (ArrayBuffer[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: Array[T])(block: (Array[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: Seq[T])(block: (Seq[T]) ⇒ V): V
Executes the provided code block, closing the resources only if an exception occurs
Executes the provided code block, closing the resources only if an exception occurs
- Definition Classes
- Arm
-
def
closeOnExcept[T <: AutoCloseable, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block, closing the resource only if an exception occurs
Executes the provided code block, closing the resource only if an exception occurs
- Definition Classes
- Arm
-
val
conf: Configuration
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
copyBlocksData(in: FSDataInputStream, out: HostMemoryOutputStream, blocks: Seq[BlockMetaData], realStartOffset: Long): Seq[BlockMetaData]
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output.
Copies the data corresponding to the clipped blocks in the original file and compute the block metadata for the output. The output blocks will contain the same column chunk metadata but with the file offsets updated to reflect the new position of the column data as written to the output.
- in
the input stream for the original Parquet file
- out
the output stream to receive the data
- returns
updated block metadata corresponding to the output
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
val
copyBufferSize: Int
- Definition Classes
- ParquetPartitionReaderBase
-
def
copyDataRange(range: CopyRange, in: FSDataInputStream, out: OutputStream, copyBuffer: Array[Byte]): Unit
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
var
currentFileHostBuffers: Option[HostMemoryBuffersWithMetaDataBase]
- Attributes
- protected
- Definition Classes
- MultiFileCloudPartitionReaderBase
-
def
dumpDataToFile(hmb: HostMemoryBuffer, dataLength: Long, splits: Array[PartitionedFile], debugDumpPrefix: Option[String] = None, format: Option[String] = None): Unit
Dump the data from HostMemoryBuffer to a file named by debugDumpPrefix + random + format
Dump the data from HostMemoryBuffer to a file named by debugDumpPrefix + random + format
- hmb
host data to be dumped
- dataLength
data size
- splits
PartitionedFile to be handled
- debugDumpPrefix
file name prefix, if it is None, will not dump
- format
file name suffix, if it is None, will not dump
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
evolveSchemaIfNeededAndClose(inputTable: Table, filePath: String, clippedSchema: MessageType): Table
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
fileSystemBytesRead(): Long
- Attributes
- protected
- Definition Classes
- MultiFileReaderFunctions
-
def
finalize(): Unit
- Attributes
- protected[java.lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
freeOnExcept[T <: RapidsBuffer, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block, freeing the RapidsBuffer only if an exception occurs
Executes the provided code block, freeing the RapidsBuffer only if an exception occurs
- Definition Classes
- Arm
-
def
get(): ColumnarBatch
- Definition Classes
- FilePartitionReaderBase → PartitionReader
-
def
getBatchRunner(file: PartitionedFile, conf: Configuration, filters: Array[Filter]): Callable[HostMemoryBuffersWithMetaDataBase]
File reading logic in a Callable which will be running in a thread pool
File reading logic in a Callable which will be running in a thread pool
- file
file to be read
- conf
configuration
- filters
push down filters
- returns
Callable[HostMemoryBuffersWithMetaDataBase]
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
getFileFormatShortName: String
File format short name used for logging and other things to uniquely identity which file format is being used.
File format short name used for logging and other things to uniquely identity which file format is being used.
- returns
the file format short name
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
def
getPrecisionsList(fields: Seq[Type]): Seq[Int]
- Definition Classes
- ParquetPartitionReaderBase
-
def
getThreadPool(numThreads: Int): ThreadPoolExecutor
Get ThreadPoolExecutor to run the Callable.
Get ThreadPoolExecutor to run the Callable.
- numThreads
max number of threads to create
- returns
ThreadPoolExecutor
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
isDone: Boolean
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
val
isSchemaCaseSensitive: Boolean
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
var
maxDeviceMemory: Long
- Attributes
- protected
- Definition Classes
- FilePartitionReaderBase
-
val
metrics: Map[String, GpuMetric]
- Definition Classes
- ScanWithMetrics
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
next(): Boolean
- Definition Classes
- MultiFileCloudPartitionReaderBase → PartitionReader
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
populateCurrentBlockChunk(blockIter: BufferedIterator[BlockMetaData], maxReadBatchSizeRows: Int, maxReadBatchSizeBytes: Long): Seq[BlockMetaData]
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
def
readBatch(fileBufsAndMeta: HostMemoryBuffersWithMetaDataBase): Option[ColumnarBatch]
Decode HostMemoryBuffers by GPU
Decode HostMemoryBuffers by GPU
- fileBufsAndMeta
the file HostMemoryBuffer read from a PartitionedFile
- returns
Option[ColumnarBatch]
- Definition Classes
- MultiFileCloudParquetPartitionReader → MultiFileCloudPartitionReaderBase
-
val
readDataSchema: StructType
- Definition Classes
- MultiFileCloudParquetPartitionReader → ParquetPartitionReaderBase
-
def
readPartFile(blocks: Seq[BlockMetaData], clippedSchema: MessageType, filePath: Path): (HostMemoryBuffer, Long)
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @native() @throws( ... )
-
def
withResource[T <: AutoCloseable, V](r: ArrayBuffer[T])(block: (ArrayBuffer[T]) ⇒ V): V
Executes the provided code block and then closes the array buffer of resources
Executes the provided code block and then closes the array buffer of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Array[T])(block: (Array[T]) ⇒ V): V
Executes the provided code block and then closes the array of resources
Executes the provided code block and then closes the array of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Seq[T])(block: (Seq[T]) ⇒ V): V
Executes the provided code block and then closes the sequence of resources
Executes the provided code block and then closes the sequence of resources
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: Option[T])(block: (Option[T]) ⇒ V): V
Executes the provided code block and then closes the Option[resource]
Executes the provided code block and then closes the Option[resource]
- Definition Classes
- Arm
-
def
withResource[T <: AutoCloseable, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block and then closes the resource
Executes the provided code block and then closes the resource
- Definition Classes
- Arm
-
def
withResourceIfAllowed[T, V](r: T)(block: (T) ⇒ V): V
Executes the provided code block and then closes the value if it is AutoCloseable
Executes the provided code block and then closes the value if it is AutoCloseable
- Definition Classes
- Arm
-
def
writeFooter(out: OutputStream, blocks: Seq[BlockMetaData], schema: MessageType): Unit
- Attributes
- protected
- Definition Classes
- ParquetPartitionReaderBase