package writer
Type Members
-
class
DeltaWriter[IN] extends SinkWriter[IN, DeltaCommittable, DeltaWriterBucketState] with ProcessingTimeCallback
A
SinkWriterimplementation forio.delta.flink.sink.DeltaSink.A
SinkWriterimplementation forio.delta.flink.sink.DeltaSink.It writes data to and manages the different active
bucketsin theio.delta.flink.sink.DeltaSink.Most of the logic for this class was sourced from
FileWriteras the behaviour is very similar. The main differences are use of custom implementations for some member classes and also managingio.delta.standalone.DeltaLogtransactional ids:DeltaWriter#appIdandDeltaWriter#nextCheckpointId.Lifecycle of instances of this class is as follows:
- Every instance is being created via
io.delta.flink.sink.DeltaSink#createWritermethod - Writers' life span is the same as the application's (unless the worker node gets unresponding and the job manager needs to create a new instance to satisfy the parallelism)
- Number of instances are managed globally by a job manager and this number is equal to the parallelism of the sink.
- See also
- Every instance is being created via
-
class
DeltaWriterBucket[IN] extends AnyRef
Internal implementation for writing the actual events to the underlying files in the correct buckets / partitions.
Internal implementation for writing the actual events to the underlying files in the correct buckets / partitions.
In reference to the Flink's
org.apache.flink.api.connector.sink.Sinktopology one of its main components isorg.apache.flink.api.connector.sink.SinkWriterwhich in case of DeltaSink is implemented asDeltaWriter. However, to comply with DeltaLake's support for partitioning tables a new component was added in the form ofDeltaWriterBucketthat is responsible for handling writes to only one of the buckets (aka partitions). Such bucket writers are managed byDeltaWriterwhich works as a proxy between higher order frameworks commands (write, prepareCommit etc.) and actual writes' implementation inDeltaWriterBucket. Thanks to this solution events within oneDeltaWriteroperator received during particular checkpoint interval are always grouped and flushed to the currently opened in-progress file.The implementation was sourced from the
org.apache.flink.connector.file.sink.FileSinkthat utilizes same concept and implementsorg.apache.flink.connector.file.sink.writer.FileWriterwith its FileWriterBucket implementation. All differences between DeltaSink's and FileSink's writer buckets are explained in particular method's below.Lifecycle of instances of this class is as follows:
- Every instance is being created via
DeltaWriter#writemethod whenever writer receives first event that belongs to the bucket represented by givenDeltaWriterBucketinstance. Or in case of non-partitioned tables whenever writer receives the very first event as in such cases there is only oneDeltaWriterBucketrepresenting the root path of the table DeltaWriterinstance can create zero, one or multiple instances ofDeltaWriterBucketduring one checkpoint interval. It creates none if it hasn't received any events (thus didn't have to create buckets for them). It creates one when it has received events belonging only to one bucket (same if the table is not partitioned). Finally, it creates multiple when it has received events belonging to more than one bucket.- Life span of one
DeltaWriterBucketmay hold through one or more checkpoint intervals. It remains "active" as long as it receives data. If e.g. for given checkpoint interval an instance ofDeltaWriterhasn't received any events belonging to given bucket, thenDeltaWriterBucketrepresenting this bucket is de-listed from the writer's internal bucket's iterator. If in future checkpoint interval givenDeltaWriterwill receive some more events for given bucket then it will create new instance ofDeltaWriterBucketrepresenting this bucket.
- Every instance is being created via
-
class
DeltaWriterBucketState extends AnyRef
State of a
DeltaWriterBucketthat will become part of each application's snapshot created during pre-commit phase of a checkpoint process or manually on demand by the user.State of a
DeltaWriterBucketthat will become part of each application's snapshot created during pre-commit phase of a checkpoint process or manually on demand by the user. seeFault Tolerance via State Snapshotssection on- See also
this page This class is partially inspired by
org.apache.flink.connector.file.sink.writer.FileWriterBucketStatebut with some modifications like:- removed snapshotting in-progress file's state because
io.delta.flink.sink.DeltaSinkis supposed to always roll part files on checkpoint so there is no need to recover any in-progress files' states - extends the state by adding application's unique identifier to guarantee the idempotent
file writes and commits to the
io.delta.standalone.DeltaLogLifecycle of instances of this class is as follows: - Every instance is being created via
DeltaWriter#snapshotState()method at the finish phase of each checkpoint interval and serialized as a part of snapshotted app's state. - It can be also created by the Flink framework itself during failure/snapshot recovery
when it's deserialized from the snapshotted state and provided as input param collection
to
io.delta.flink.sink.DeltaSink#createWriter
- removed snapshotting in-progress file's state because
-
class
DeltaWriterBucketStateSerializer extends SimpleVersionedSerializer[DeltaWriterBucketState]
Versioned serializer for
DeltaWriterBucketState.