Packages

package filesystem

Type Members

  1. class DeltaBulkBucketWriter[IN, BucketID] extends BulkBucketWriter[IN, BucketID]

    A factory that creates DeltaBulkPartWriters.

    A factory that creates DeltaBulkPartWriters.

    This class is provided as a part of workaround for getting actual file size.

    Compared to its original version BulkPartWriter it changes only the return types for methods DeltaBulkBucketWriter#resumeFrom and DeltaBulkBucketWriter#openNew to a custom implementation of BulkPartWriter that is DeltaBulkPartWriter.

  2. class DeltaBulkPartWriter[IN, BucketID] extends AbstractPartFileWriter[IN, BucketID]

    This class is an implementation of InProgressFileWriter for writing elements to a part using BulkPartWriter.

    This class is an implementation of InProgressFileWriter for writing elements to a part using BulkPartWriter. This also implements the PartFileInfo.

    An instance of this class represents one in-progress files that is currently "opened" by one of the io.delta.flink.sink.internal.writer.DeltaWriterBucket instance.

    It's provided as a workaround for getting actual size of in-progress file right before transitioning it to a pending state ("closing").

    The changed behaviour compared to the original BulkPartWriter includes adding DeltaBulkPartWriter#closeWriter method which is called first during "close" operation for in-progress file. After calling it we can safely get the actual file size and then call DeltaBulkPartWriter#closeForCommit() method.

    This workaround is needed because for Parquet format the writer's buffer needs to be explicitly flushed before getting the file size (and there is also no easy why to track the bytes send to the writer). If such a flush will not be performed then PartFileInfo#getSize will show file size without considering data buffered in writer's memory (which in most cases are all the events consumed within given checkpoint interval).

    Lifecycle of instances of this class is as follows:

    • Since it's a class member of DeltaInProgressPart it shares its life span as well
    • Instances of this class are being created inside io.delta.flink.sink.internal.writer.DeltaWriterBucket method every time a bucket processes the first event or if the previously opened file met conditions for rolling (e.g. size threshold)
    • Its life span holds as long as the underlying file stays in an in-progress state (so until it's "rolled"), but no longer then single checkpoint interval.
    • During pre-commit phase every existing DeltaInProgressPart instance is automatically transformed ("rolled") into a DeltaPendingFile instance

    This class is almost exact copy of OutputStreamBasedPartFileWriter. The only modified behaviour is extending DeltaBulkPartWriter#closeWriter() method with flushing of the internal buffer.

  3. class DeltaInProgressPart[IN] extends AnyRef

    Wrapper class for part files in the io.delta.flink.sink.DeltaSink.

    Wrapper class for part files in the io.delta.flink.sink.DeltaSink. Part files are files that are currently "opened" for writing new data. Similar behaviour might be observed in the org.apache.flink.connector.file.sink.FileSink however as opposite to the FileSink, in DeltaSink we need to keep the name of the file attached to the opened file in order to be further able to transform DeltaInProgressPart instance into DeltaPendingFile instance and finally to commit the written file to the io.delta.standalone.DeltaLog during global commit phase.

    Additionally, we need a custom implementation of DeltaBulkPartWriter as a workaround for getting actual file size (what is currently not possible for bulk formats when operating on an interface level of PartFileInfo, see DeltaBulkPartWriter for details).

    Lifecycle of instances of this class is as follows:

    • Instances of this class are being created inside io.delta.flink.sink.internal.writer.DeltaWriterBucket#rollPartFile method every time a bucket processes the first event or if the previously opened file met conditions for rolling (e.g. size threshold)
    • It's life span holds as long as the underlying file stays in an in-progress state (so until it's "rolled"), but no longer then single checkpoint interval.
    • During pre-commit phase every existing DeltaInProgressPart instance is automatically transformed ("rolled") into a DeltaPendingFile instance
  4. class DeltaPendingFile extends AnyRef

    Wrapper class for InProgressFileWriter.PendingFileRecoverable object.

    Wrapper class for InProgressFileWriter.PendingFileRecoverable object. This class carries the internal committable information to be used during the checkpoint/commit phase.

    As similar to org.apache.flink.connector.file.sink.FileSink we need to carry InProgressFileWriter.PendingFileRecoverable information to perform "local" commit on file that the sink has written data to. However, as opposite to mentioned FileSink, in DeltaSink we need to perform also "global" commit to the io.delta.standalone.DeltaLog and for that additional file metadata must be provided. Hence, this class provides the required information for both types of commits by wrapping pending file and attaching file's metadata.

    Lifecycle of instances of this class is as follows:

    • Instances of this class are being created inside io.delta.flink.sink.internal.writer.DeltaWriterBucket#closePartFile method every time when any in-progress is called to be closed. This happens either when some conditions for closing are met or at the end of every checkpoint interval during a pre-commit phase when we are closing all the open files in all buckets
    • Its life span holds only until the end of a checkpoint interval
    • During pre-commit phase (and after closing every in-progress files) every existing DeltaPendingFile instance is automatically transformed into a DeltaCommittable instance

Ungrouped