Packages

package writer

Type Members

  1. class DeltaWriter[IN] extends SinkWriter[IN, DeltaCommittable, DeltaWriterBucketState] with ProcessingTimeCallback

    A SinkWriter implementation for io.delta.flink.sink.DeltaSink.

    A SinkWriter implementation for io.delta.flink.sink.DeltaSink.

    It writes data to and manages the different active buckets in the io.delta.flink.sink.DeltaSink.

    Most of the logic for this class was sourced from FileWriter as the behaviour is very similar. The main differences are use of custom implementations for some member classes and also managing io.delta.standalone.DeltaLog transactional ids: DeltaWriter#appId and DeltaWriter#nextCheckpointId.

    Lifecycle of instances of this class is as follows:

    • Every instance is being created via io.delta.flink.sink.DeltaSink#createWriter method
    • Writers' life span is the same as the application's (unless the worker node gets unresponding and the job manager needs to create a new instance to satisfy the parallelism)
    • Number of instances are managed globally by a job manager and this number is equal to the parallelism of the sink.
    See also

    Flink's parallel execution

  2. class DeltaWriterBucket[IN] extends AnyRef

    Internal implementation for writing the actual events to the underlying files in the correct buckets / partitions.

    Internal implementation for writing the actual events to the underlying files in the correct buckets / partitions.

    In reference to the Flink's org.apache.flink.api.connector.sink.Sink topology one of its main components is org.apache.flink.api.connector.sink.SinkWriter which in case of DeltaSink is implemented as DeltaWriter. However, to comply with DeltaLake's support for partitioning tables a new component was added in the form of DeltaWriterBucket that is responsible for handling writes to only one of the buckets (aka partitions). Such bucket writers are managed by DeltaWriter which works as a proxy between higher order frameworks commands (write, prepareCommit etc.) and actual writes' implementation in DeltaWriterBucket. Thanks to this solution events within one DeltaWriter operator received during particular checkpoint interval are always grouped and flushed to the currently opened in-progress file.

    The implementation was sourced from the org.apache.flink.connector.file.sink.FileSink that utilizes same concept and implements org.apache.flink.connector.file.sink.writer.FileWriter with its FileWriterBucket implementation. All differences between DeltaSink's and FileSink's writer buckets are explained in particular method's below.

    Lifecycle of instances of this class is as follows:

    • Every instance is being created via DeltaWriter#write method whenever writer receives first event that belongs to the bucket represented by given DeltaWriterBucket instance. Or in case of non-partitioned tables whenever writer receives the very first event as in such cases there is only one DeltaWriterBucket representing the root path of the table
    • DeltaWriter instance can create zero, one or multiple instances of DeltaWriterBucket during one checkpoint interval. It creates none if it hasn't received any events (thus didn't have to create buckets for them). It creates one when it has received events belonging only to one bucket (same if the table is not partitioned). Finally, it creates multiple when it has received events belonging to more than one bucket.
    • Life span of one DeltaWriterBucket may hold through one or more checkpoint intervals. It remains "active" as long as it receives data. If e.g. for given checkpoint interval an instance of DeltaWriter hasn't received any events belonging to given bucket, then DeltaWriterBucket representing this bucket is de-listed from the writer's internal bucket's iterator. If in future checkpoint interval given DeltaWriter will receive some more events for given bucket then it will create new instance of DeltaWriterBucket representing this bucket.
  3. class DeltaWriterBucketState extends AnyRef

    State of a DeltaWriterBucket that will become part of each application's snapshot created during pre-commit phase of a checkpoint process or manually on demand by the user.

    State of a DeltaWriterBucket that will become part of each application's snapshot created during pre-commit phase of a checkpoint process or manually on demand by the user. see Fault Tolerance via State Snapshots section on

    See also

    this page This class is partially inspired by org.apache.flink.connector.file.sink.writer.FileWriterBucketState but with some modifications like:

    • removed snapshotting in-progress file's state because io.delta.flink.sink.DeltaSink is supposed to always roll part files on checkpoint so there is no need to recover any in-progress files' states
    • extends the state by adding application's unique identifier to guarantee the idempotent file writes and commits to the io.delta.standalone.DeltaLog Lifecycle of instances of this class is as follows:
    • Every instance is being created via DeltaWriter#snapshotState() method at the finish phase of each checkpoint interval and serialized as a part of snapshotted app's state.
    • It can be also created by the Flink framework itself during failure/snapshot recovery when it's deserialized from the snapshotted state and provided as input param collection to io.delta.flink.sink.DeltaSink#createWriter
  4. class DeltaWriterBucketStateSerializer extends SimpleVersionedSerializer[DeltaWriterBucketState]

    Versioned serializer for DeltaWriterBucketState.

Ungrouped