Packages

p

io.delta.flink.sink

internal

package internal

Type Members

  1. class DeltaBucketAssigner[T] extends BucketAssigner[T, String]

    Custom implementation of BucketAssigner class required to provide behaviour on how to map particular events to buckets (aka partitions).

    Custom implementation of BucketAssigner class required to provide behaviour on how to map particular events to buckets (aka partitions).

    This implementation can be perceived as a utility class for complying to the DeltaLake's partitioning style (that follows Apache Hive's partitioning style by providing the partitioning column's and its values as FS directories paths, e.g. "/some_path/table_1/date=2020-01-01") It's still possible for users to roll out their own version of BucketAssigner and pass it to the DeltaSinkBuilder during creation of the sink.

    This DeltaBucketAssigner is applicable only to DeltaSinkBuilder and not to RowDataDeltaSinkBuilder. The former lets you use this DeltaBucketAssigner to provide the required custom bucketing behaviour, while the latter doesn't expose a custom bucketing API, and you can provide the partition column keys only.

    Thus, this DeltaBucketAssigner is currently not exposed to the user through any public API.

    In the future, if you'd like to implement your own custom bucketing...

        /////////////////////////////////////////////////////////////////////////////////
        // implements a custom partition computer
        /////////////////////////////////////////////////////////////////////////////////
        static class CustomPartitionColumnComputer implements DeltaPartitionComputer<RowData> {
    
            @Override
            public LinkedHashMap<String, String> generatePartitionValues(
                    RowData element, BucketAssigner.Context context) {
                String f1 = element.getString(0).toString();
                int f3 = element.getInt(2);
                LinkedHashMap<String, String> partitionSpec = new LinkedHashMap<>();
                partitionSpec.put("f1", f1);
                partitionSpec.put("f3", Integer.toString(f3));
                return partitionSpec;
            }
        }
        ...
        /////////////////////////////////////////
        // creates partition assigner for a custom partition computer
        /////////////////////////////////////////
        DeltaBucketAssignerInternal<RowData> partitionAssigner =
                    new DeltaBucketAssignerInternal<>(new CustomPartitionColumnComputer());
    
        ...
    
        /////////////////////////////////////////////////////////////////////////////////
        // create the builder
        /////////////////////////////////////////////////////////////////////////////////
    
        DeltaSinkBuilder<RowData></RowData> foo =
         new DeltaSinkBuilder.DefaultDeltaFormatBuilder<>(
            ...,
            partitionAssigner,
            ...)
    

  2. trait DeltaPartitionComputer[T] extends Serializable
  3. class DeltaSinkBuilder[IN] extends Serializable

    A builder class for DeltaSinkInternal.

    A builder class for DeltaSinkInternal.

    For most common use cases use DeltaSink#forRowData utility method to instantiate the sink. This builder should be used only if you need to provide custom writer factory instance or configure some low level settings for the sink.

    Example how to use this class for the stream of RowData:

        RowType rowType = ...;
        Configuration conf = new Configuration();
        conf.set("parquet.compression", "SNAPPY");
        ParquetWriterFactory<RowData> writerFactory =
            ParquetRowDataBuilder.createWriterFactory(rowType, conf, true);
    
        DeltaSinkBuilder<RowData> sinkBuilder = new DeltaSinkBuilder(
            basePath,
            conf,
            bucketCheckInterval,
            writerFactory,
            new BasePathBucketAssigner<>(),
            OnCheckpointRollingPolicy.build(),
            OutputFileConfig.builder().withPartSuffix(".snappy.parquet").build(),
            appId,
            rowType,
            mergeSchema
        );
    
        DeltaSink<RowData> sink = sinkBuilder.build();
    
    

  4. class DeltaSinkInternal[IN] extends Sink[IN, DeltaCommittable, DeltaWriterBucketState, DeltaGlobalCommittable]

    A unified sink that emits its input elements to file system files within buckets using Parquet format and commits those files to the io.delta.standalone.DeltaLog.

    A unified sink that emits its input elements to file system files within buckets using Parquet format and commits those files to the io.delta.standalone.DeltaLog. This sink achieves exactly-once semantics for both BATCH and STREAMING.

    Behaviour of this sink splits down upon two phases. The first phase takes place between application's checkpoints when records are being flushed to files (or appended to writers' buffers) where the behaviour is almost identical as in case of org.apache.flink.connector.file.sink.FileSink.

    Next during the checkpoint phase files are "closed" (renamed) by the independent instances of io.delta.flink.sink.internal.committer.DeltaCommitter that behave very similar to org.apache.flink.connector.file.sink.committer.FileCommitter. When all the parallel committers are done, then all the files are committed at once by single-parallelism io.delta.flink.sink.internal.committer.DeltaGlobalCommitter.

    This DeltaSinkInternal sources many specific implementations from the org.apache.flink.connector.file.sink.FileSink so for most of the low level behaviour one may refer to the docs from this module. The most notable differences to the FileSinks are:

    • tightly coupling DeltaSink to the Bulk-/ParquetFormat
    • extending committable information with files metadata (name, size, rows, last update timestamp)
    • providing DeltaLake-specific behaviour which is mostly contained in the io.delta.flink.sink.internal.committer.DeltaGlobalCommitter implementing the commit to the io.delta.standalone.DeltaLog at the final stage of each checkpoint.
  5. class DeltaSinkOptions extends AnyRef

    This class contains all available options for io.delta.flink.sink.DeltaSink with their type and default values.

  6. class SchemaConverter extends AnyRef

    This is a utility class to convert from Flink's specific RowType into DeltaLake's specific StructType which is used for schema-matching comparisons during io.delta.standalone.DeltaLog commits.

Ungrouped