org.apache.spark.sql.delta.commands.merge
MergeIntoMaterializeSource
Companion object MergeIntoMaterializeSource
trait MergeIntoMaterializeSource extends DeltaLogging with DeltaSparkPlanUtils
Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.
We materialize source if it is not safe to assume that it's deterministic (override with MERGE_SOURCE_MATERIALIZATION). Otherwise, if source changes between the phases of the MERGE, it can produce wrong results. We use local checkpointing for the materialization, which saves the source as a materialized RDD[InternalRow] on the executor local disks.
1st concern is that if an executor is lost, this data can be lost. When Spark executor decommissioning API is used, it should attempt to move this materialized data safely out before removing the executor.
2nd concern is that if an executor is lost for another reason (e.g. spot kill), we will still lose that data. To mitigate that, we implement a retry loop. The whole Merge operation needs to be restarted from the beginning in this case. When we retry, we increase the replication level of the materialized data from 1 to 2. (override with MERGE_SOURCE_MATERIALIZATION_RDD_STORAGE_LEVEL_RETRY). If it still fails after the maximum number of attempts (MERGE_MATERIALIZE_SOURCE_MAX_ATTEMPTS), we record the failure for tracking purposes.
3rd concern is that executors run out of disk space with the extra materialization. We record such failures for tracking purposes.
- Alphabetic
- By Inheritance
- MergeIntoMaterializeSource
- DeltaSparkPlanUtils
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Type Members
- type PlanOrExpression = Either[LogicalPlan, Expression]
- Definition Classes
- DeltaSparkPlanUtils
Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- val attempt: Int
Track which attempt or retry it is in runWithMaterializedSourceAndRetries
Track which attempt or retry it is in runWithMaterializedSourceAndRetries
- Attributes
- protected
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- def collectFirst[In, Out](input: Iterable[In], recurse: (In) => Option[Out]): Option[Out]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def containsDeterministicUDF(expr: Expression): Boolean
Returns whether an expression contains any deterministic UDFs.
Returns whether an expression contains any deterministic UDFs.
- Definition Classes
- DeltaSparkPlanUtils
- def containsDeterministicUDF(predicates: Seq[DeltaTableReadPredicate], partitionedOnly: Boolean): Boolean
Returns whether the read predicates of a transaction contain any deterministic UDFs.
Returns whether the read predicates of a transaction contain any deterministic UDFs.
- Definition Classes
- DeltaSparkPlanUtils
- def deltaAssert(check: => Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit
Helper method to check invariants in Delta code.
Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.
- Attributes
- protected
- Definition Classes
- DeltaLogging
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- def findFirstNonDeltaScan(source: LogicalPlan): Option[LogicalPlan]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def findFirstNonDeterministicChildNode(children: Seq[Expression], checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def findFirstNonDeterministicNode(child: Expression, checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def findFirstNonDeterministicNode(plan: LogicalPlan, checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
Returns a part of the
planthat does not have a safe level of determinism.Returns a part of the
planthat does not have a safe level of determinism. This is a conservative approximation ofplanbeing a truly deterministic query.- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
- Definition Classes
- DeltaLogging
- def getErrorData(e: Throwable): Map[String, Any]
- Definition Classes
- DeltaLogging
- def getMergeSource: MergeSource
Returns the prepared merge source.
Returns the prepared merge source.
- Attributes
- protected
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- def isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
- def log: Logger
- Attributes
- protected
- Definition Classes
- Logging
- def logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
- def logDebug(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logDebug(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logError(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logInfo(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logName: String
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logTrace(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
- def logWarning(msg: => String): Unit
- Attributes
- protected
- Definition Classes
- Logging
- val materializedSourceRDD: Option[RDD[InternalRow]]
If the source was materialized, reference to the checkpointed RDD.
If the source was materialized, reference to the checkpointed RDD.
- Attributes
- protected
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def planContainsOnlyDeltaScans(source: LogicalPlan): Boolean
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def planContainsUdf(plan: LogicalPlan): Boolean
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def planIsDeterministic(plan: LogicalPlan, checkDeterministicOptions: CheckDeterministicOptions): Boolean
Returns
trueifplanhas a safe level of determinism.Returns
trueifplanhas a safe level of determinism. This is a conservative approximation ofplanbeing a truly deterministic query.- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
- def prepareMergeSource(spark: SparkSession, source: LogicalPlan, condition: Expression, matchedClauses: Seq[DeltaMergeIntoMatchedClause], notMatchedClauses: Seq[DeltaMergeIntoNotMatchedClause], isInsertOnly: Boolean): Unit
If source needs to be materialized, prepare the materialized dataframe in sourceDF Otherwise, prepare regular dataframe.
If source needs to be materialized, prepare the materialized dataframe in sourceDF Otherwise, prepare regular dataframe.
- returns
the source materialization reason
- Attributes
- protected
- def recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: => A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordFrameProfile[T](group: String, name: String)(thunk: => T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
- def recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: => S): S
- Definition Classes
- DatabricksLogging
- def recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
- def recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- def recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
- def runWithMaterializedSourceLostRetries(spark: SparkSession, deltaLog: DeltaLog, metrics: Map[String, SQLMetric], runMergeFunc: (SparkSession) => Seq[Row]): Seq[Row]
Run the Merge with retries in case it detects an RDD block lost error of the materialized source RDD.
Run the Merge with retries in case it detects an RDD block lost error of the materialized source RDD. It will also record out of disk error, if such happens - possibly because of increased disk pressure from the materialized source RDD.
- Attributes
- protected
- def shouldMaterializeSource(spark: SparkSession, source: LogicalPlan, isInsertOnly: Boolean): (Boolean, MergeIntoMaterializeSourceReason)
- returns
pair of boolean whether source should be materialized and the source materialization reason
- Attributes
- protected
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- def withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: => T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter
- object RetryHandling extends Enumeration
- object SubqueryExpression
Extractor object for the subquery plan of expressions that contain subqueries.
Extractor object for the subquery plan of expressions that contain subqueries.
- Definition Classes
- DeltaSparkPlanUtils