Class BoundedDatasetFactory


  • public class BoundedDatasetFactory
    extends java.lang.Object
    • Method Summary

      All Methods Static Methods Concrete Methods 
      Modifier and Type Method Description
      static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRDD​(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
      Create a Dataset for a BoundedSource via a Spark RDD.
      static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRows​(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
      Create a Dataset for a BoundedSource via a Spark Table.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • createDatasetFromRows

        public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRows​(org.apache.spark.sql.SparkSession session,
                                                                                                                        org.apache.beam.sdk.io.BoundedSource<T> source,
                                                                                                                        java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options,
                                                                                                                        org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
        Create a Dataset for a BoundedSource via a Spark Table.

        Unfortunately tables are expected to return an InternalRow, requiring serialization. This makes this approach at the time being significantly less performant than creating a dataset from an RDD.

      • createDatasetFromRDD

        public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRDD​(org.apache.spark.sql.SparkSession session,
                                                                                                                       org.apache.beam.sdk.io.BoundedSource<T> source,
                                                                                                                       java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options,
                                                                                                                       org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)
        Create a Dataset for a BoundedSource via a Spark RDD.

        This is currently the most efficient approach as it avoid any serialization overhead.