T - the type of the materialized recordspublic class ParquetOutputFormat<T> extends org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>
It requires a WriteSupport to convert the actual records to the underlying format.
It requires the schema of the incoming records. (provided by the write support)
It allows storing extra metadata in the footer (for example: for schema compatibility purpose when converting from a different schema language).
The format configuration settings in the job configuration:
# The block size is the size of a row group being buffered in memory # this limits the memory usage when writing # Larger values will improve the IO when reading but consume more memory when writing parquet.block.size=134217728 # in bytes, default = 128 * 1024 * 1024 # The page size is for compression. When reading, each page can be decompressed independently. # A block is composed of pages. The page is the smallest unit that must be read fully to access a single record. # If this value is too small, the compression will deteriorate parquet.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 # There is one dictionary page per column per row group when dictionary encoding is used. # The dictionary page size works like the page size but for dictionary parquet.dictionary.page.size=1048576 # in bytes, default = 1 * 1024 * 1024 # The compression algorithm used to compress pages parquet.compression=UNCOMPRESSED # one of: UNCOMPRESSED, SNAPPY, GZIP, LZO, ZSTD. Default: UNCOMPRESSED. Supersedes mapred.output.compress* # The write support class to convert the records written to the OutputFormat into the events accepted by the record consumer # Usually provided by a specific ParquetOutputFormat subclass parquet.write.support.class= # fully qualified name # To enable/disable dictionary encoding parquet.enable.dictionary=true # false to disable dictionary encoding # To enable/disable summary metadata aggregation at the end of a MR job # The default is true (enabled) parquet.enable.summary-metadata=true # false to disable summary aggregation # Maximum size (in bytes) allowed as padding to align row groups # This is also the minimum size of a row group. Default: 8388608 parquet.writer.max-padding=8388608 # 8 MB
If parquet.compression is not set, the following properties are checked (FileOutputFormat behavior). Note that we explicitely disallow custom Codecs
mapred.output.compress=true mapred.output.compression.codec=org.apache.hadoop.io.compress.SomeCodec # the codec must be one of Snappy, GZip or LZO
if none of those is set the data is uncompressed.
| Modifier and Type | Class and Description |
|---|---|
static class |
ParquetOutputFormat.JobSummaryLevel |
| Constructor and Description |
|---|
ParquetOutputFormat()
used when directly using the output format and configuring the write support implementation
using parquet.write.support.class
|
ParquetOutputFormat(S writeSupport)
constructor used when this OutputFormat in wrapped in another one (In Pig for example)
|
| Modifier and Type | Method and Description |
|---|---|
static FileEncryptionProperties |
createEncryptionProperties(org.apache.hadoop.conf.Configuration fileHadoopConfig,
org.apache.hadoop.fs.Path tempFilePath,
WriteSupport.WriteContext fileWriteContext) |
static boolean |
getAdaptiveBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf) |
static int |
getBlockSize(org.apache.hadoop.conf.Configuration configuration)
Deprecated.
|
static int |
getBlockSize(org.apache.hadoop.mapreduce.JobContext jobContext) |
static boolean |
getBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf) |
static int |
getBloomFilterMaxBytes(org.apache.hadoop.conf.Configuration conf) |
static CompressionCodecName |
getCompression(org.apache.hadoop.conf.Configuration configuration) |
static CompressionCodecName |
getCompression(org.apache.hadoop.mapreduce.JobContext jobContext) |
static int |
getDictionaryPageSize(org.apache.hadoop.conf.Configuration configuration) |
static int |
getDictionaryPageSize(org.apache.hadoop.mapreduce.JobContext jobContext) |
static boolean |
getEnableDictionary(org.apache.hadoop.conf.Configuration configuration) |
static boolean |
getEnableDictionary(org.apache.hadoop.mapreduce.JobContext jobContext) |
static boolean |
getEstimatePageSizeCheck(org.apache.hadoop.conf.Configuration configuration) |
static ParquetOutputFormat.JobSummaryLevel |
getJobSummaryLevel(org.apache.hadoop.conf.Configuration conf) |
static long |
getLongBlockSize(org.apache.hadoop.conf.Configuration configuration) |
static int |
getMaxRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration) |
static MemoryManager |
getMemoryManager() |
static int |
getMinRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration) |
org.apache.hadoop.mapreduce.OutputCommitter |
getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context) |
static int |
getPageSize(org.apache.hadoop.conf.Configuration configuration) |
static int |
getPageSize(org.apache.hadoop.mapreduce.JobContext jobContext) |
static boolean |
getPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path file,
CompressionCodecName codec) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.conf.Configuration conf,
org.apache.hadoop.fs.Path file,
CompressionCodecName codec,
ParquetFileWriter.Mode mode) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext,
ParquetFileWriter.Mode mode) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext,
org.apache.hadoop.fs.Path file) |
org.apache.hadoop.mapreduce.RecordWriter<Void,T> |
getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext,
org.apache.hadoop.fs.Path file,
ParquetFileWriter.Mode mode) |
static boolean |
getValidation(org.apache.hadoop.conf.Configuration configuration) |
static boolean |
getValidation(org.apache.hadoop.mapreduce.JobContext jobContext) |
static int |
getValueCountThreshold(org.apache.hadoop.conf.Configuration configuration) |
static ParquetProperties.WriterVersion |
getWriterVersion(org.apache.hadoop.conf.Configuration configuration) |
WriteSupport<T> |
getWriteSupport(org.apache.hadoop.conf.Configuration configuration) |
static Class<?> |
getWriteSupportClass(org.apache.hadoop.conf.Configuration configuration) |
static boolean |
isCompressionSet(org.apache.hadoop.conf.Configuration configuration) |
static boolean |
isCompressionSet(org.apache.hadoop.mapreduce.JobContext jobContext) |
static void |
setBlockSize(org.apache.hadoop.mapreduce.Job job,
int blockSize) |
static void |
setColumnIndexTruncateLength(org.apache.hadoop.conf.Configuration conf,
int length) |
static void |
setColumnIndexTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext,
int length) |
static void |
setCompression(org.apache.hadoop.mapreduce.Job job,
CompressionCodecName compression) |
static void |
setDictionaryPageSize(org.apache.hadoop.mapreduce.Job job,
int pageSize) |
static void |
setEnableDictionary(org.apache.hadoop.mapreduce.Job job,
boolean enableDictionary) |
static void |
setMaxPaddingSize(org.apache.hadoop.conf.Configuration conf,
int maxPaddingSize) |
static void |
setMaxPaddingSize(org.apache.hadoop.mapreduce.JobContext jobContext,
int maxPaddingSize) |
static void |
setPageRowCountLimit(org.apache.hadoop.conf.Configuration conf,
int rowCount) |
static void |
setPageRowCountLimit(org.apache.hadoop.mapreduce.JobContext jobContext,
int rowCount) |
static void |
setPageSize(org.apache.hadoop.mapreduce.Job job,
int pageSize) |
static void |
setPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf,
boolean val) |
static void |
setPageWriteChecksumEnabled(org.apache.hadoop.mapreduce.JobContext jobContext,
boolean val) |
static void |
setStatisticsTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext,
int length) |
static void |
setValidation(org.apache.hadoop.conf.Configuration configuration,
boolean validating) |
static void |
setValidation(org.apache.hadoop.mapreduce.JobContext jobContext,
boolean validating) |
static void |
setWriteSupportClass(org.apache.hadoop.mapreduce.Job job,
Class<?> writeSupportClass) |
static void |
setWriteSupportClass(org.apache.hadoop.mapred.JobConf job,
Class<?> writeSupportClass) |
checkOutputSpecs, getCompressOutput, getDefaultWorkFile, getOutputCompressorClass, getOutputName, getOutputPath, getPathForWorkFile, getUniqueFile, getWorkOutputPath, setCompressOutput, setOutputCompressorClass, setOutputName, setOutputPath@Deprecated public static final String ENABLE_JOB_SUMMARY
public static final String JOB_SUMMARY_LEVEL
ParquetOutputFormat.JobSummaryLevel (case insensitive)public static final String BLOCK_SIZE
public static final String PAGE_SIZE
public static final String COMPRESSION
public static final String WRITE_SUPPORT_CLASS
public static final String DICTIONARY_PAGE_SIZE
public static final String ENABLE_DICTIONARY
public static final String VALIDATION
public static final String WRITER_VERSION
public static final String MEMORY_POOL_RATIO
public static final String MIN_MEMORY_ALLOCATION
public static final String MAX_PADDING_BYTES
public static final String MIN_ROW_COUNT_FOR_PAGE_SIZE_CHECK
public static final String MAX_ROW_COUNT_FOR_PAGE_SIZE_CHECK
public static final String PAGE_VALUE_COUNT_THRESHOLD
public static final String ESTIMATE_PAGE_SIZE_CHECK
public static final String COLUMN_INDEX_TRUNCATE_LENGTH
public static final String STATISTICS_TRUNCATE_LENGTH
public static final String BLOOM_FILTER_ENABLED
public static final String BLOOM_FILTER_EXPECTED_NDV
public static final String BLOOM_FILTER_MAX_BYTES
public static final String BLOOM_FILTER_FPP
public static final String ADAPTIVE_BLOOM_FILTER_ENABLED
public static final String BLOOM_FILTER_CANDIDATES_NUMBER
public static final String PAGE_ROW_COUNT_LIMIT
public static final String PAGE_WRITE_CHECKSUM_ENABLED
public ParquetOutputFormat(S writeSupport)
S - the Java write support typewriteSupport - the class used to convert the incoming recordspublic ParquetOutputFormat()
S - the Java write support typepublic static ParquetOutputFormat.JobSummaryLevel getJobSummaryLevel(org.apache.hadoop.conf.Configuration conf)
public static void setWriteSupportClass(org.apache.hadoop.mapreduce.Job job,
Class<?> writeSupportClass)
public static void setWriteSupportClass(org.apache.hadoop.mapred.JobConf job,
Class<?> writeSupportClass)
public static Class<?> getWriteSupportClass(org.apache.hadoop.conf.Configuration configuration)
public static void setBlockSize(org.apache.hadoop.mapreduce.Job job,
int blockSize)
public static void setPageSize(org.apache.hadoop.mapreduce.Job job,
int pageSize)
public static void setDictionaryPageSize(org.apache.hadoop.mapreduce.Job job,
int pageSize)
public static void setCompression(org.apache.hadoop.mapreduce.Job job,
CompressionCodecName compression)
public static void setEnableDictionary(org.apache.hadoop.mapreduce.Job job,
boolean enableDictionary)
public static boolean getEnableDictionary(org.apache.hadoop.mapreduce.JobContext jobContext)
public static int getBloomFilterMaxBytes(org.apache.hadoop.conf.Configuration conf)
public static boolean getBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf)
public static boolean getAdaptiveBloomFilterEnabled(org.apache.hadoop.conf.Configuration conf)
public static int getBlockSize(org.apache.hadoop.mapreduce.JobContext jobContext)
public static int getPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)
public static int getDictionaryPageSize(org.apache.hadoop.mapreduce.JobContext jobContext)
public static CompressionCodecName getCompression(org.apache.hadoop.mapreduce.JobContext jobContext)
public static boolean isCompressionSet(org.apache.hadoop.mapreduce.JobContext jobContext)
public static void setValidation(org.apache.hadoop.mapreduce.JobContext jobContext,
boolean validating)
public static boolean getValidation(org.apache.hadoop.mapreduce.JobContext jobContext)
public static boolean getEnableDictionary(org.apache.hadoop.conf.Configuration configuration)
public static int getMinRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
public static int getMaxRowCountForPageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
public static int getValueCountThreshold(org.apache.hadoop.conf.Configuration configuration)
public static boolean getEstimatePageSizeCheck(org.apache.hadoop.conf.Configuration configuration)
@Deprecated public static int getBlockSize(org.apache.hadoop.conf.Configuration configuration)
public static long getLongBlockSize(org.apache.hadoop.conf.Configuration configuration)
public static int getPageSize(org.apache.hadoop.conf.Configuration configuration)
public static int getDictionaryPageSize(org.apache.hadoop.conf.Configuration configuration)
public static ParquetProperties.WriterVersion getWriterVersion(org.apache.hadoop.conf.Configuration configuration)
public static CompressionCodecName getCompression(org.apache.hadoop.conf.Configuration configuration)
public static boolean isCompressionSet(org.apache.hadoop.conf.Configuration configuration)
public static void setValidation(org.apache.hadoop.conf.Configuration configuration,
boolean validating)
public static boolean getValidation(org.apache.hadoop.conf.Configuration configuration)
public static void setMaxPaddingSize(org.apache.hadoop.mapreduce.JobContext jobContext,
int maxPaddingSize)
public static void setMaxPaddingSize(org.apache.hadoop.conf.Configuration conf,
int maxPaddingSize)
public static void setColumnIndexTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext,
int length)
public static void setColumnIndexTruncateLength(org.apache.hadoop.conf.Configuration conf,
int length)
public static void setStatisticsTruncateLength(org.apache.hadoop.mapreduce.JobContext jobContext,
int length)
public static void setPageRowCountLimit(org.apache.hadoop.mapreduce.JobContext jobContext,
int rowCount)
public static void setPageRowCountLimit(org.apache.hadoop.conf.Configuration conf,
int rowCount)
public static void setPageWriteChecksumEnabled(org.apache.hadoop.mapreduce.JobContext jobContext,
boolean val)
public static void setPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf,
boolean val)
public static boolean getPageWriteChecksumEnabled(org.apache.hadoop.conf.Configuration conf)
public org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
getRecordWriter in class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>IOExceptionInterruptedExceptionpublic org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
IOExceptionInterruptedExceptionpublic org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file) throws IOException, InterruptedException
IOExceptionInterruptedExceptionpublic org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
IOExceptionInterruptedExceptionpublic org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, CompressionCodecName codec) throws IOException, InterruptedException
IOExceptionInterruptedExceptionpublic org.apache.hadoop.mapreduce.RecordWriter<Void,T> getRecordWriter(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file, CompressionCodecName codec, ParquetFileWriter.Mode mode) throws IOException, InterruptedException
IOExceptionInterruptedExceptionpublic WriteSupport<T> getWriteSupport(org.apache.hadoop.conf.Configuration configuration)
configuration - to find the configuration for the write support classpublic org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext context)
throws IOException
getOutputCommitter in class org.apache.hadoop.mapreduce.lib.output.FileOutputFormat<Void,T>IOExceptionpublic static MemoryManager getMemoryManager()
public static FileEncryptionProperties createEncryptionProperties(org.apache.hadoop.conf.Configuration fileHadoopConfig, org.apache.hadoop.fs.Path tempFilePath, WriteSupport.WriteContext fileWriteContext)
Copyright © 2023 The Apache Software Foundation. All rights reserved.