Class SparkRunner


  • public final class SparkRunner
    extends org.apache.beam.sdk.PipelineRunner<SparkPipelineResult>
    The SparkRunner translate operations defined on a pipeline to a representation executable by Spark, and then submitting the job to Spark to be executed. If we wanted to run a Beam pipeline with the default options of a single threaded spark instance in local mode, we would do the following:

    Pipeline p = [logic for pipeline creation] SparkPipelineResult result = (SparkPipelineResult) p.run();

    To create a pipeline runner to run against a different spark cluster, with a custom master url we would do the following:

    Pipeline p = [logic for pipeline creation] SparkPipelineOptions options = SparkPipelineOptionsFactory.create(); options.setSparkMaster("spark://host:port"); SparkPipelineResult result = (SparkPipelineResult) p.run();

    • Method Detail

      • create

        public static SparkRunner create()
        Creates and returns a new SparkRunner with default options. In particular, against a spark instance running in local mode.
        Returns:
        A pipeline runner with default options.
      • create

        public static SparkRunner create​(SparkPipelineOptions options)
        Creates and returns a new SparkRunner with specified options.
        Parameters:
        options - The SparkPipelineOptions to use when executing the job.
        Returns:
        A pipeline runner that will execute with specified options.
      • fromOptions

        public static SparkRunner fromOptions​(org.apache.beam.sdk.options.PipelineOptions options)
        Creates and returns a new SparkRunner with specified options.
        Parameters:
        options - The PipelineOptions to use when executing the job.
        Returns:
        A pipeline runner that will execute with specified options.
      • initAccumulators

        public static void initAccumulators​(SparkPipelineOptions opts,
                                            org.apache.spark.api.java.JavaSparkContext jsc)
        Init Metrics/Aggregators accumulators. This method is idempotent.
      • updateCacheCandidates

        public static void updateCacheCandidates​(org.apache.beam.sdk.Pipeline pipeline,
                                                 SparkPipelineTranslator translator,
                                                 EvaluationContext evaluationContext)
        Evaluator that update/populate the cache candidates.