public static class RewriteOptions.Builder extends Object
RewriteOptions which is used for constructing ParquetRewriter.| Constructor and Description |
|---|
Builder(org.apache.hadoop.conf.Configuration conf,
List<org.apache.hadoop.fs.Path> inputFiles,
List<org.apache.hadoop.fs.Path> inputFilesToJoin,
org.apache.hadoop.fs.Path outputFile)
Create a builder to create a RewriterOptions.
|
Builder(org.apache.hadoop.conf.Configuration conf,
List<org.apache.hadoop.fs.Path> inputFiles,
org.apache.hadoop.fs.Path outputFile)
Create a builder to create a RewriterOptions.
|
Builder(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputFile,
org.apache.hadoop.fs.Path outputFile)
Create a builder to create a RewriterOptions.
|
Builder(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputFile,
org.apache.hadoop.fs.Path inputFileToJoin,
org.apache.hadoop.fs.Path outputFile)
Create a builder to create a RewriterOptions.
|
Builder(ParquetConfiguration conf,
InputFile inputFile,
InputFile inputFileToJoin,
OutputFile outputFile)
Create a builder to create a RewriterOptions.
|
Builder(ParquetConfiguration conf,
InputFile inputFile,
OutputFile outputFile)
Create a builder to create a RewriterOptions.
|
Builder(ParquetConfiguration conf,
List<InputFile> inputFiles,
List<InputFile> inputFilesToJoin,
OutputFile outputFile)
Create a builder to create a RewriterOptions.
|
Builder(ParquetConfiguration conf,
List<InputFile> inputFiles,
OutputFile outputFile)
Create a builder to create a RewriterOptions.
|
| Modifier and Type | Method and Description |
|---|---|
RewriteOptions.Builder |
addInputFile(InputFile inputFile)
Add an input file to read from.
|
RewriteOptions.Builder |
addInputFile(org.apache.hadoop.fs.Path path)
Add an input file to read from.
|
RewriteOptions.Builder |
addInputFilesToJoin(InputFile fileToJoin)
Add an input file to join.
|
RewriteOptions.Builder |
addInputFileToJoinColumns(org.apache.hadoop.fs.Path path)
Add an input join file to read from.
|
RewriteOptions |
build()
Build the RewriterOptions.
|
RewriteOptions.Builder |
encrypt(List<String> encryptColumns)
Set the columns to encrypt.
|
RewriteOptions.Builder |
encryptionProperties(FileEncryptionProperties fileEncryptionProperties)
Set the encryption properties to use for the output file.
|
RewriteOptions.Builder |
ignoreJoinFilesMetadata(boolean ignoreJoinFilesMetadata)
Set a flag whether metadata from join files should be ignored.
|
RewriteOptions.Builder |
indexCacheStrategy(IndexCache.CacheStrategy cacheStrategy)
Set the index(ColumnIndex, Offset and BloomFilter) cache strategy.
|
RewriteOptions.Builder |
mask(Map<String,MaskMode> maskColumns)
Set the columns to mask.
|
RewriteOptions.Builder |
overwriteInputWithJoinColumns(boolean overwriteInputWithJoinColumns)
Set a flag whether columns from join files need to overwrite columns from the main input files.
|
RewriteOptions.Builder |
prune(List<String> columns)
Set the columns to prune.
|
RewriteOptions.Builder |
renameColumns(Map<String,String> renameColumns)
Set the columns to be renamed.
|
RewriteOptions.Builder |
transform(CompressionCodecName newCodecName)
Set the compression codec to use for the output file.
|
public Builder(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputFile,
org.apache.hadoop.fs.Path inputFileToJoin,
org.apache.hadoop.fs.Path outputFile)
conf - configuration for reading from input files and writing to output fileinputFile - input file path to read frominputFileToJoin - input join file path to read fromoutputFile - output file path to rewrite topublic Builder(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path inputFile,
org.apache.hadoop.fs.Path outputFile)
conf - configuration for reading from input files and writing to output fileinputFile - input file path to read fromoutputFile - output file path to rewrite topublic Builder(ParquetConfiguration conf, InputFile inputFile, OutputFile outputFile)
conf - configuration for reading from input files and writing to output fileinputFile - input file to read fromoutputFile - output file to rewrite topublic Builder(ParquetConfiguration conf, InputFile inputFile, InputFile inputFileToJoin, OutputFile outputFile)
conf - configuration for reading from input files and writing to output fileinputFile - input file to read frominputFileToJoin - input join file to read fromoutputFile - output file to rewrite topublic Builder(org.apache.hadoop.conf.Configuration conf,
List<org.apache.hadoop.fs.Path> inputFiles,
org.apache.hadoop.fs.Path outputFile)
Please note that if merging more than one file, the schema of all files must be the same. Otherwise, the rewrite will fail.
The rewrite will keep original row groups from all input files. This may not be optimal if row groups are very small and will not solve small file problems. Instead, it will make it worse to have a large file footer in the output file. TODO: support rewrite by record to break the original row groups into reasonable ones.
See ParquetRewriter for more details.
conf - configuration for reading from input files and writing to output fileinputFiles - list of input file paths to read fromoutputFile - output file path to rewrite topublic Builder(ParquetConfiguration conf, List<InputFile> inputFiles, OutputFile outputFile)
Please note that if merging more than one file, the schema of all files must be the same. Otherwise, the rewrite will fail.
The rewrite will keep original row groups from all input files. This may not be optimal if row groups are very small and will not solve small file problems. Instead, it will make it worse to have a large file footer in the output file. TODO: support rewrite by record to break the original row groups into reasonable ones.
See ParquetRewriter for more details.
conf - configuration for reading from input files and writing to output fileinputFiles - list of input file paths to read fromoutputFile - output file path to rewrite topublic Builder(org.apache.hadoop.conf.Configuration conf,
List<org.apache.hadoop.fs.Path> inputFiles,
List<org.apache.hadoop.fs.Path> inputFilesToJoin,
org.apache.hadoop.fs.Path outputFile)
Please note the schema of all files in each file group inputFiles and inputFilesToJoin
must be the same while those two schemas can be different in comparison with each other.
Otherwise, the rewrite will fail.
The rewrite will keep original row groups from all input files. This may not be optimal if row groups are very small and will not solve small file problems. Instead, it will make it worse to have a large file footer in the output file. TODO: support rewrite by record to break the original row groups into reasonable ones.
See ParquetRewriter for more details.
conf - configuration for reading from input files and writing to output fileinputFiles - list of input file paths to read frominputFilesToJoin - list of input join file paths to read fromoutputFile - output file path to rewrite topublic Builder(ParquetConfiguration conf, List<InputFile> inputFiles, List<InputFile> inputFilesToJoin, OutputFile outputFile)
Please note the schema of all files in each file group inputFiles and inputFilesToJoin
must be the same while those two schemas can be different in comparison with each other.
Otherwise, the rewrite will fail.
The rewrite will keep original row groups from all input files. This may not be optimal if row groups are very small and will not solve small file problems. Instead, it will make it worse to have a large file footer in the output file.
See ParquetRewriter for more details.
conf - configuration for reading from input files and writing to output fileinputFiles - list of input file paths to read frominputFilesToJoin - list of input join file paths to read fromoutputFile - output file path to rewrite topublic RewriteOptions.Builder prune(List<String> columns)
By default, all columns are kept.
columns - list of columns to prunepublic RewriteOptions.Builder transform(CompressionCodecName newCodecName)
By default, the codec is the same as the input file.
newCodecName - compression codec to usepublic RewriteOptions.Builder mask(Map<String,MaskMode> maskColumns)
By default, no columns are masked.
maskColumns - map of columns to mask to the masking modepublic RewriteOptions.Builder renameColumns(Map<String,String> renameColumns)
Note that nested columns can't be renamed, in case of GroupType column only top level column can be renamed.
renameColumns - map where keys are original names and values are new namespublic RewriteOptions.Builder encrypt(List<String> encryptColumns)
By default, no columns are encrypted.
encryptColumns - list of columns to encryptpublic RewriteOptions.Builder encryptionProperties(FileEncryptionProperties fileEncryptionProperties)
This is required if encrypting columns are not empty.
fileEncryptionProperties - encryption properties to usepublic RewriteOptions.Builder addInputFile(org.apache.hadoop.fs.Path path)
path - input file path to read frompublic RewriteOptions.Builder addInputFileToJoinColumns(org.apache.hadoop.fs.Path path)
path - input file path to read frompublic RewriteOptions.Builder addInputFile(InputFile inputFile)
inputFile - input file to read frompublic RewriteOptions.Builder addInputFilesToJoin(InputFile fileToJoin)
fileToJoin - input file to joinpublic RewriteOptions.Builder indexCacheStrategy(IndexCache.CacheStrategy cacheStrategy)
This could reduce the random seek while rewriting with PREFETCH_BLOCK strategy, NONE by default.
cacheStrategy - the index cache strategy, supports: IndexCache.CacheStrategy#NONE or
IndexCache.CacheStrategy#PREFETCH_BLOCKpublic RewriteOptions.Builder overwriteInputWithJoinColumns(boolean overwriteInputWithJoinColumns)
By default, join files columns do not overwrite the main input file columns.
overwriteInputWithJoinColumns - a flag if columns from join files should overwrite columns
from the main input filespublic RewriteOptions.Builder ignoreJoinFilesMetadata(boolean ignoreJoinFilesMetadata)
By default, metadata is not ignored.
ignoreJoinFilesMetadata - a flag if metadata from join files should be ignoredpublic RewriteOptions build()
Copyright © 2024 The Apache Software Foundation. All rights reserved.