public class HiveIncrPullSource extends AvroSource
HiveIncrementalPuller, commit by commit and apply to the target table
The general idea here is to have commits sync across the data pipeline.
[Source Tables(s)] ====> HiveIncrementalScanner ==> incrPullRootPath ==> targetTable {c1,c2,c3,...} {c1,c2,c3,...} {c1,c2,c3,...}
This produces beautiful causality, that makes data issues in ETLs very easy to debug
Source.SourceTypeprops, sparkContext, sparkSession| Constructor and Description |
|---|
HiveIncrPullSource(TypedProperties props,
org.apache.spark.api.java.JavaSparkContext sparkContext,
org.apache.spark.sql.SparkSession sparkSession,
SchemaProvider schemaProvider) |
| Modifier and Type | Method and Description |
|---|---|
protected InputBatch<org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord>> |
fetchNewData(Option<String> lastCheckpointStr,
long sourceLimit) |
fetchNext, getSourceType, getSparkSessionpublic HiveIncrPullSource(TypedProperties props, org.apache.spark.api.java.JavaSparkContext sparkContext, org.apache.spark.sql.SparkSession sparkSession, SchemaProvider schemaProvider)
protected InputBatch<org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord>> fetchNewData(Option<String> lastCheckpointStr, long sourceLimit)
fetchNewData in class Source<org.apache.spark.api.java.JavaRDD<org.apache.avro.generic.GenericRecord>>Copyright © 2019 The Apache Software Foundation. All rights reserved.