Class BoundedDatasetFactory
- java.lang.Object
-
- org.apache.beam.runners.spark.structuredstreaming.io.BoundedDatasetFactory
-
public class BoundedDatasetFactory extends java.lang.Object
-
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>>createDatasetFromRDD(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)Create aDatasetfor aBoundedSourcevia a SparkRDD.static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>>createDatasetFromRows(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)Create aDatasetfor aBoundedSourcevia a SparkTable.
-
-
-
Method Detail
-
createDatasetFromRows
public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRows(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)Create aDatasetfor aBoundedSourcevia a SparkTable.Unfortunately tables are expected to return an
InternalRow, requiring serialization. This makes this approach at the time being significantly less performant than creating a dataset from an RDD.
-
createDatasetFromRDD
public static <T> org.apache.spark.sql.Dataset<org.apache.beam.sdk.util.WindowedValue<T>> createDatasetFromRDD(org.apache.spark.sql.SparkSession session, org.apache.beam.sdk.io.BoundedSource<T> source, java.util.function.Supplier<org.apache.beam.sdk.options.PipelineOptions> options, org.apache.spark.sql.Encoder<org.apache.beam.sdk.util.WindowedValue<T>> encoder)Create aDatasetfor aBoundedSourcevia a SparkRDD.This is currently the most efficient approach as it avoid any serialization overhead.
-
-