Class KafkaIO.WriteRecords<K,​V>

  • All Implemented Interfaces:
    java.io.Serializable, org.apache.beam.sdk.transforms.display.HasDisplayData
    Enclosing class:
    KafkaIO

    public abstract static class KafkaIO.WriteRecords<K,​V>
    extends org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>>,​org.apache.beam.sdk.values.PDone>
    A PTransform to write to a Kafka topic with ProducerRecord's. See KafkaIO for more information on usage and configuration.
    See Also:
    Serialized Form
    • Field Summary

      • Fields inherited from class org.apache.beam.sdk.transforms.PTransform

        annotations, displayData, name, resourceHints
    • Constructor Summary

      Constructors 
      Constructor Description
      WriteRecords()  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods Deprecated Methods 
      Modifier and Type Method Description
      org.apache.beam.sdk.values.PDone expand​(org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> input)  
      abstract org.apache.beam.sdk.transforms.errorhandling.ErrorHandler<org.apache.beam.sdk.transforms.errorhandling.BadRecord,​?> getBadRecordErrorHandler()  
      abstract org.apache.beam.sdk.transforms.errorhandling.BadRecordRouter getBadRecordRouter()  
      abstract @Nullable org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​? extends org.apache.kafka.clients.consumer.Consumer<?,​?>> getConsumerFactoryFn()  
      abstract @Nullable java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<K>> getKeySerializer()  
      abstract int getNumShards()  
      abstract java.util.Map<java.lang.String,​java.lang.Object> getProducerConfig()  
      abstract @Nullable org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​org.apache.kafka.clients.producer.Producer<K,​V>> getProducerFactoryFn()  
      abstract @Nullable KafkaPublishTimestampFunction<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> getPublishTimestampFunction()  
      abstract @Nullable java.lang.String getSinkGroupId()  
      abstract @Nullable java.lang.String getTopic()  
      abstract @Nullable java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<V>> getValueSerializer()  
      abstract boolean isEOS()  
      void populateDisplayData​(org.apache.beam.sdk.transforms.display.DisplayData.Builder builder)  
      KafkaIO.WriteRecords<K,​V> updateProducerProperties​(java.util.Map<java.lang.String,​java.lang.Object> configUpdates)
      Deprecated.
      as of version 2.13.
      void validate​(@Nullable org.apache.beam.sdk.options.PipelineOptions options)  
      KafkaIO.WriteRecords<K,​V> withBadRecordErrorHandler​(org.apache.beam.sdk.transforms.errorhandling.ErrorHandler<org.apache.beam.sdk.transforms.errorhandling.BadRecord,​?> badRecordErrorHandler)  
      KafkaIO.WriteRecords<K,​V> withBootstrapServers​(java.lang.String bootstrapServers)
      Returns a new KafkaIO.Write transform with Kafka producer pointing to bootstrapServers.
      KafkaIO.WriteRecords<K,​V> withConsumerFactoryFn​(org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​? extends org.apache.kafka.clients.consumer.Consumer<?,​?>> consumerFactoryFn)
      When exactly-once semantics are enabled (see withEOS(int, String)), the sink needs to fetch previously stored state with Kafka topic.
      KafkaIO.WriteRecords<K,​V> withEOS​(int numShards, java.lang.String sinkGroupId)
      Provides exactly-once semantics while writing to Kafka, which enables applications with end-to-end exactly-once guarantees on top of exactly-once semantics within Beam pipelines.
      KafkaIO.WriteRecords<K,​V> withInputTimestamp()
      The timestamp for each record being published is set to timestamp of the element in the pipeline.
      KafkaIO.WriteRecords<K,​V> withKeySerializer​(java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<K>> keySerializer)
      Sets a Serializer for serializing key (if any) to bytes.
      KafkaIO.WriteRecords<K,​V> withProducerConfigUpdates​(java.util.Map<java.lang.String,​java.lang.Object> configUpdates)
      Update configuration for the producer.
      KafkaIO.WriteRecords<K,​V> withProducerFactoryFn​(org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​org.apache.kafka.clients.producer.Producer<K,​V>> producerFactoryFn)
      Sets a custom function to create Kafka producer.
      KafkaIO.WriteRecords<K,​V> withPublishTimestampFunction​(KafkaPublishTimestampFunction<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> timestampFunction)
      Deprecated.
      use ProducerRecords to set publish timestamp.
      KafkaIO.WriteRecords<K,​V> withTopic​(java.lang.String topic)
      Sets the default Kafka topic to write to.
      KafkaIO.WriteRecords<K,​V> withValueSerializer​(java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<V>> valueSerializer)
      Sets a Serializer for serializing value to bytes.
      • Methods inherited from class org.apache.beam.sdk.transforms.PTransform

        addAnnotation, compose, compose, getAdditionalInputs, getAnnotations, getDefaultOutputCoder, getDefaultOutputCoder, getDefaultOutputCoder, getKindString, getName, getResourceHints, setDisplayData, setResourceHints, toString, validate
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
    • Constructor Detail

      • WriteRecords

        public WriteRecords()
    • Method Detail

      • getTopic

        @Pure
        public abstract @Nullable java.lang.String getTopic()
      • getProducerConfig

        @Pure
        public abstract java.util.Map<java.lang.String,​java.lang.Object> getProducerConfig()
      • getProducerFactoryFn

        @Pure
        public abstract @Nullable org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​org.apache.kafka.clients.producer.Producer<K,​V>> getProducerFactoryFn()
      • getKeySerializer

        @Pure
        public abstract @Nullable java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<K>> getKeySerializer()
      • getValueSerializer

        @Pure
        public abstract @Nullable java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<V>> getValueSerializer()
      • getPublishTimestampFunction

        @Pure
        public abstract @Nullable KafkaPublishTimestampFunction<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> getPublishTimestampFunction()
      • isEOS

        @Pure
        public abstract boolean isEOS()
      • getSinkGroupId

        @Pure
        public abstract @Nullable java.lang.String getSinkGroupId()
      • getNumShards

        @Pure
        public abstract int getNumShards()
      • getConsumerFactoryFn

        @Pure
        public abstract @Nullable org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​? extends org.apache.kafka.clients.consumer.Consumer<?,​?>> getConsumerFactoryFn()
      • getBadRecordRouter

        @Pure
        public abstract org.apache.beam.sdk.transforms.errorhandling.BadRecordRouter getBadRecordRouter()
      • getBadRecordErrorHandler

        @Pure
        public abstract org.apache.beam.sdk.transforms.errorhandling.ErrorHandler<org.apache.beam.sdk.transforms.errorhandling.BadRecord,​?> getBadRecordErrorHandler()
      • withBootstrapServers

        public KafkaIO.WriteRecords<K,​V> withBootstrapServers​(java.lang.String bootstrapServers)
        Returns a new KafkaIO.Write transform with Kafka producer pointing to bootstrapServers.
      • withTopic

        public KafkaIO.WriteRecords<K,​V> withTopic​(java.lang.String topic)
        Sets the default Kafka topic to write to. Use ProducerRecords to set topic name per published record.
      • withKeySerializer

        public KafkaIO.WriteRecords<K,​V> withKeySerializer​(java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<K>> keySerializer)
        Sets a Serializer for serializing key (if any) to bytes.

        A key is optional while writing to Kafka. Note when a key is set, its hash is used to determine partition in Kafka (see ProducerRecord for more details).

      • withValueSerializer

        public KafkaIO.WriteRecords<K,​V> withValueSerializer​(java.lang.Class<? extends org.apache.kafka.common.serialization.Serializer<V>> valueSerializer)
        Sets a Serializer for serializing value to bytes.
      • updateProducerProperties

        @Deprecated
        public KafkaIO.WriteRecords<K,​V> updateProducerProperties​(java.util.Map<java.lang.String,​java.lang.Object> configUpdates)
        Deprecated.
        as of version 2.13. Use withProducerConfigUpdates(Map) instead.
        Adds the given producer properties, overriding old values of properties with the same key.
      • withProducerConfigUpdates

        public KafkaIO.WriteRecords<K,​V> withProducerConfigUpdates​(java.util.Map<java.lang.String,​java.lang.Object> configUpdates)
        Update configuration for the producer. Note that the default producer properties will not be completely overridden. This method only updates the value which has the same key.

        By default, the producer uses the configuration from DEFAULT_PRODUCER_PROPERTIES.

      • withProducerFactoryFn

        public KafkaIO.WriteRecords<K,​V> withProducerFactoryFn​(org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​org.apache.kafka.clients.producer.Producer<K,​V>> producerFactoryFn)
        Sets a custom function to create Kafka producer. Primarily used for tests. Default is KafkaProducer
      • withInputTimestamp

        public KafkaIO.WriteRecords<K,​V> withInputTimestamp()
        The timestamp for each record being published is set to timestamp of the element in the pipeline. This is equivalent to withPublishTimestampFunction((e, ts) -> ts).
        NOTE: Kafka's retention policies are based on message timestamps. If the pipeline is processing messages from the past, they might be deleted immediately by Kafka after being published if the timestamps are older than Kafka cluster's log.retention.hours.
      • withPublishTimestampFunction

        @Deprecated
        public KafkaIO.WriteRecords<K,​V> withPublishTimestampFunction​(KafkaPublishTimestampFunction<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> timestampFunction)
        Deprecated.
        use ProducerRecords to set publish timestamp.
        A function to provide timestamp for records being published.
        NOTE: Kafka's retention policies are based on message timestamps. If the pipeline is processing messages from the past, they might be deleted immediately by Kafka after being published if the timestamps are older than Kafka cluster's log.retention.hours.
      • withEOS

        public KafkaIO.WriteRecords<K,​V> withEOS​(int numShards,
                                                       java.lang.String sinkGroupId)
        Provides exactly-once semantics while writing to Kafka, which enables applications with end-to-end exactly-once guarantees on top of exactly-once semantics within Beam pipelines. It ensures that records written to sink are committed on Kafka exactly once, even in the case of retries during pipeline execution even when some processing is retried. Retries typically occur when workers restart (as in failure recovery), or when the work is redistributed (as in an autoscaling event).

        Beam runners typically provide exactly-once semantics for results of a pipeline, but not for side effects from user code in transform. If a transform such as Kafka sink writes to an external system, those writes might occur more than once. When EOS is enabled here, the sink transform ties checkpointing semantics in compatible Beam runners and transactions in Kafka (version 0.11+) to ensure a record is written only once. As the implementation relies on runners checkpoint semantics, not all the runners are compatible. The sink throws an exception during initialization if the runner is not explicitly allowed. The Dataflow, Flink, and Spark runners are compatible.

        Note on performance: Exactly-once sink involves two shuffles of the records. In addition to cost of shuffling the records among workers, the records go through 2 serialization-deserialization cycles. Depending on volume and cost of serialization, the CPU cost might be noticeable. The CPU cost can be reduced by writing byte arrays (i.e. serializing them to byte before writing to Kafka sink).

        Parameters:
        numShards - Sets sink parallelism. The state metadata stored on Kafka is stored across this many virtual partitions using sinkGroupId. A good rule of thumb is to set this to be around number of partitions in Kafka topic.
        sinkGroupId - The group id used to store small amount of state as metadata on Kafka. It is similar to consumer group id used with a KafkaConsumer. Each job should use a unique group id so that restarts/updates of job preserve the state to ensure exactly-once semantics. The state is committed atomically with sink transactions on Kafka. See KafkaProducer.sendOffsetsToTransaction(Map, String) for more information. The sink performs multiple sanity checks during initialization to catch common mistakes so that it does not end up using state that does not seem to be written by the same job.
      • withConsumerFactoryFn

        public KafkaIO.WriteRecords<K,​V> withConsumerFactoryFn​(org.apache.beam.sdk.transforms.SerializableFunction<java.util.Map<java.lang.String,​java.lang.Object>,​? extends org.apache.kafka.clients.consumer.Consumer<?,​?>> consumerFactoryFn)
        When exactly-once semantics are enabled (see withEOS(int, String)), the sink needs to fetch previously stored state with Kafka topic. Fetching the metadata requires a consumer. Similar to KafkaIO.Read.withConsumerFactoryFn(SerializableFunction), a factory function can be supplied if required in a specific case. The default is KafkaConsumer.
      • withBadRecordErrorHandler

        public KafkaIO.WriteRecords<K,​V> withBadRecordErrorHandler​(org.apache.beam.sdk.transforms.errorhandling.ErrorHandler<org.apache.beam.sdk.transforms.errorhandling.BadRecord,​?> badRecordErrorHandler)
      • expand

        public org.apache.beam.sdk.values.PDone expand​(org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>> input)
        Specified by:
        expand in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>>,​org.apache.beam.sdk.values.PDone>
      • validate

        public void validate​(@Nullable org.apache.beam.sdk.options.PipelineOptions options)
        Overrides:
        validate in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>>,​org.apache.beam.sdk.values.PDone>
      • populateDisplayData

        public void populateDisplayData​(org.apache.beam.sdk.transforms.display.DisplayData.Builder builder)
        Specified by:
        populateDisplayData in interface org.apache.beam.sdk.transforms.display.HasDisplayData
        Overrides:
        populateDisplayData in class org.apache.beam.sdk.transforms.PTransform<org.apache.beam.sdk.values.PCollection<org.apache.kafka.clients.producer.ProducerRecord<K,​V>>,​org.apache.beam.sdk.values.PDone>