Class AvroSource<T>
- java.lang.Object
-
- org.apache.beam.sdk.io.Source<T>
-
- org.apache.beam.sdk.io.BoundedSource<T>
-
- org.apache.beam.sdk.io.OffsetBasedSource<T>
-
- org.apache.beam.sdk.io.FileBasedSource<T>
-
- org.apache.beam.sdk.io.BlockBasedSource<T>
-
- org.apache.beam.sdk.extensions.avro.io.AvroSource<T>
-
- Type Parameters:
T- The type of records to be read from the source.
- All Implemented Interfaces:
java.io.Serializable,org.apache.beam.sdk.transforms.display.HasDisplayData
public class AvroSource<T> extends org.apache.beam.sdk.io.BlockBasedSource<T>Do not use in pipelines directly: most users should useAvroIO.Read.A
FileBasedSourcefor reading Avro files.To read a
PCollectionof objects from one or more Avro files, usefrom(org.apache.beam.sdk.options.ValueProvider<java.lang.String>)to specify the path(s) of the files to read. TheAvroSourcethat is returned will read objects of typeGenericRecordwith the schema(s) that were written at file creation. To further configure theAvroSourceto read with a user-defined schema, or to return records of a type other thanGenericRecord, usewithSchema(Schema)(using an AvroSchema),withSchema(String)(using a JSON schema), orwithSchema(Class)(to return objects of the Avro-generated class specified).An
AvroSourcecan be read from using theReadtransform. For example:AvroSource<MyType> source = AvroSource.from(file.toPath()).withSchema(MyType.class); PCollection<MyType> records = Read.from(mySource);This class's implementation is based on the Avro 1.7.7 specification and implements parsing of some parts of Avro Object Container Files. The rationale for doing so is that the Avro API does not provide efficient ways of computing the precise offsets of blocks within a file, which is necessary to support dynamic work rebalancing. However, whenever it is possible to use the Avro API in a way that supports maintaining precise offsets, this class uses the Avro API.
Avro Object Container files store records in blocks. Each block contains a collection of records. Blocks may be encoded (e.g., with bzip2, deflate, snappy, etc.). Blocks are delineated from one another by a 16-byte sync marker.
An
AvroSourcefor a subrange of a single file contains records in the blocks such that the start offset of the block is greater than or equal to the start offset of the source and less than the end offset of the source.To use XZ-encoded Avro files, please include an explicit dependency on
xz-1.8.jar, which has been marked as optional in the Mavensdk/pom.xml.<dependency> <groupId>org.tukaani</groupId> <artifactId>xz</artifactId> <version>1.8</version> </dependency>Permissions
Permission requirements depend on the
PipelineRunnerthat is used to execute the pipeline. Please refer to the documentation of correspondingPipelineRunners for more details.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classAvroSource.AvroReader<T>ABlockBasedSource.BlockBasedReaderfor reading blocks from Avro files.static interfaceAvroSource.DatumReaderFactory<T>-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.BlockBasedSource
org.apache.beam.sdk.io.BlockBasedSource.Block<T extends java.lang.Object>, org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T extends java.lang.Object>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.FileBasedSource
org.apache.beam.sdk.io.FileBasedSource.FileBasedReader<T extends java.lang.Object>
-
Nested classes/interfaces inherited from class org.apache.beam.sdk.io.OffsetBasedSource
org.apache.beam.sdk.io.OffsetBasedSource.OffsetBasedReader<T extends java.lang.Object>
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods Modifier and Type Method Description org.apache.beam.sdk.io.BlockBasedSource<T>createForSubrangeOfFile(java.lang.String fileName, long start, long end)Deprecated.Used by Dataflow workerorg.apache.beam.sdk.io.BlockBasedSource<T>createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata fileMetadata, long start, long end)protected org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T>createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions options)static AvroSource<org.apache.avro.generic.GenericRecord>from(java.lang.String fileNameOrPattern)Likefrom(ValueProvider).static AvroSource<org.apache.avro.generic.GenericRecord>from(org.apache.beam.sdk.io.fs.MatchResult.Metadata metadata)static AvroSource<org.apache.avro.generic.GenericRecord>from(org.apache.beam.sdk.options.ValueProvider<java.lang.String> fileNameOrPattern)Reads from the given file name or pattern ("glob").org.apache.beam.sdk.coders.Coder<T>getOutputCoder()voidvalidate()AvroSource<T>withCoder(org.apache.beam.sdk.coders.Coder<T> coder)Specifies the coder for the result of theAvroSource.AvroSource<T>withDatumReaderFactory(AvroSource.DatumReaderFactory<?> factory)Sets a customAvroSource.DatumReaderFactoryfor reading.AvroSource<T>withEmptyMatchTreatment(org.apache.beam.sdk.io.fs.EmptyMatchTreatment emptyMatchTreatment)AvroSource<T>withMinBundleSize(long minBundleSize)Sets the minimum bundle size.<X> AvroSource<X>withParseFn(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.generic.GenericRecord,X> parseFn, org.apache.beam.sdk.coders.Coder<X> coder)ReadsGenericRecordof unspecified schema and maps them to instances of a custom type using the givenparseFnand encoded using the given coder.<X> AvroSource<X>withSchema(java.lang.Class<X> clazz)Reads files containing records of the given class.AvroSource<org.apache.avro.generic.GenericRecord>withSchema(java.lang.String schema)Reads files containing records that conform to the given schema.AvroSource<org.apache.avro.generic.GenericRecord>withSchema(org.apache.avro.Schema schema)LikewithSchema(String).-
Methods inherited from class org.apache.beam.sdk.io.FileBasedSource
createReader, createSourceForSubrange, getEmptyMatchTreatment, getEstimatedSizeBytes, getFileOrPatternSpec, getFileOrPatternSpecProvider, getMaxEndOffset, getMode, getSingleFileMetadata, isSplittable, populateDisplayData, split, toString
-
-
-
-
Method Detail
-
from
public static AvroSource<org.apache.avro.generic.GenericRecord> from(org.apache.beam.sdk.options.ValueProvider<java.lang.String> fileNameOrPattern)
Reads from the given file name or pattern ("glob"). The returned source needs to be further configured by callingwithSchema(java.lang.String)to return a type other thanGenericRecord.
-
from
public static AvroSource<org.apache.avro.generic.GenericRecord> from(org.apache.beam.sdk.io.fs.MatchResult.Metadata metadata)
-
from
public static AvroSource<org.apache.avro.generic.GenericRecord> from(java.lang.String fileNameOrPattern)
Likefrom(ValueProvider).
-
withEmptyMatchTreatment
public AvroSource<T> withEmptyMatchTreatment(org.apache.beam.sdk.io.fs.EmptyMatchTreatment emptyMatchTreatment)
-
withSchema
public AvroSource<org.apache.avro.generic.GenericRecord> withSchema(java.lang.String schema)
Reads files containing records that conform to the given schema.
-
withSchema
public AvroSource<org.apache.avro.generic.GenericRecord> withSchema(org.apache.avro.Schema schema)
LikewithSchema(String).
-
withSchema
public <X> AvroSource<X> withSchema(java.lang.Class<X> clazz)
Reads files containing records of the given class.
-
withParseFn
public <X> AvroSource<X> withParseFn(org.apache.beam.sdk.transforms.SerializableFunction<org.apache.avro.generic.GenericRecord,X> parseFn, org.apache.beam.sdk.coders.Coder<X> coder)
ReadsGenericRecordof unspecified schema and maps them to instances of a custom type using the givenparseFnand encoded using the given coder.
-
withMinBundleSize
public AvroSource<T> withMinBundleSize(long minBundleSize)
Sets the minimum bundle size. Refer toOffsetBasedSourcefor a description ofminBundleSizeand its use.
-
withDatumReaderFactory
public AvroSource<T> withDatumReaderFactory(AvroSource.DatumReaderFactory<?> factory)
Sets a customAvroSource.DatumReaderFactoryfor reading. Pass aAvroDatumFactoryto also use the factory for theAvroCoder
-
withCoder
public AvroSource<T> withCoder(org.apache.beam.sdk.coders.Coder<T> coder)
Specifies the coder for the result of theAvroSource.
-
validate
public void validate()
- Overrides:
validatein classorg.apache.beam.sdk.io.FileBasedSource<T>
-
createForSubrangeOfFile
@Deprecated public org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end) throws java.io.IOException
Deprecated.Used by Dataflow workerUsed by the Dataflow worker. Do not introduce new usages. Do not delete without confirming that Dataflow ValidatesRunner tests pass.- Throws:
java.io.IOException
-
createForSubrangeOfFile
public org.apache.beam.sdk.io.BlockBasedSource<T> createForSubrangeOfFile(org.apache.beam.sdk.io.fs.MatchResult.Metadata fileMetadata, long start, long end)
- Specified by:
createForSubrangeOfFilein classorg.apache.beam.sdk.io.BlockBasedSource<T>
-
createSingleFileReader
protected org.apache.beam.sdk.io.BlockBasedSource.BlockBasedReader<T> createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions options)
- Specified by:
createSingleFileReaderin classorg.apache.beam.sdk.io.BlockBasedSource<T>
-
-