public class DFSHoodieDatasetInputReader extends DFSDeltaInputReader
| Constructor and Description |
|---|
DFSHoodieDatasetInputReader(org.apache.spark.api.java.JavaSparkContext jsc,
String basePath,
String schemaStr) |
| Modifier and Type | Method and Description |
|---|---|
protected long |
analyzeSingleFile(String filePath)
Implementation of
DeltaInputReaders to provide a way to read a single file on DFS and provide an
average number of records across N files. |
protected List<String> |
getPartitions(org.apache.hudi.common.util.Option<Integer> partitionsLimit) |
org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> |
read(int numPartitions,
int numFiles,
double percentageRecordsPerFile) |
org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> |
read(int numPartitions,
int numFiles,
long numRecords) |
org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> |
read(int numPartitions,
long approxNumRecords) |
org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> |
read(long numRecords)
Attempts to reads an approximate number of records close to approxNumRecords.
|
getFilePathsToRead, getFileStatusIndexRangeprotected List<String> getPartitions(org.apache.hudi.common.util.Option<Integer> partitionsLimit) throws IOException
IOExceptionprotected long analyzeSingleFile(String filePath)
DFSDeltaInputReaderDeltaInputReaders to provide a way to read a single file on DFS and provide an
average number of records across N files.analyzeSingleFile in class DFSDeltaInputReaderpublic org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> read(long numRecords)
throws IOException
DeltaInputReaderIOExceptionpublic org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> read(int numPartitions,
long approxNumRecords)
throws IOException
IOException - Attempts to read approx number of records (exact if equal or more records available)
across requested number of
partitions.public org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> read(int numPartitions,
int numFiles,
long numRecords)
throws IOException
IOException - Attempts to read approx number of records (exact if equal or more records available)
across requested number of
partitions and number of files.
1. Find numFiles across numPartitions
2. numRecordsToReadPerFile = approxNumRecords / numFilespublic org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord> read(int numPartitions,
int numFiles,
double percentageRecordsPerFile)
throws IOException
IOException - Attempts to a % of records per file across requested number of partitions and number of files.
1. Find numFiles across numPartitions
2. numRecordsToReadPerFile = approxNumRecordsPerFile * percentageRecordsPerFileCopyright © 2023 The Apache Software Foundation. All rights reserved.