org.apache.spark.sql.delta.commands.merge
MergeIntoMaterializeSource
Companion object MergeIntoMaterializeSource
trait MergeIntoMaterializeSource extends DeltaLogging with DeltaSparkPlanUtils
Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.
We materialize source if it is not safe to assume that it's deterministic (override with MERGE_SOURCE_MATERIALIZATION). Otherwise, if source changes between the phases of the MERGE, it can produce wrong results. We use local checkpointing for the materialization, which saves the source as a materialized RDD[InternalRow] on the executor local disks.
1st concern is that if an executor is lost, this data can be lost. When Spark executor decommissioning API is used, it should attempt to move this materialized data safely out before removing the executor.
2nd concern is that if an executor is lost for another reason (e.g. spot kill), we will still lose that data. To mitigate that, we implement a retry loop. The whole Merge operation needs to be restarted from the beginning in this case. When we retry, we increase the replication level of the materialized data from 1 to 2. (override with MERGE_SOURCE_MATERIALIZATION_RDD_STORAGE_LEVEL_RETRY). If it still fails after the maximum number of attempts (MERGE_MATERIALIZE_SOURCE_MAX_ATTEMPTS), we record the failure for tracking purposes.
3rd concern is that executors run out of disk space with the extra materialization. We record such failures for tracking purposes.
- Alphabetic
- By Inheritance
- MergeIntoMaterializeSource
- DeltaSparkPlanUtils
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- LoggingShims
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
implicit
class
LogStringContext extends AnyRef
- Definition Classes
- LoggingShims
-
type
PlanOrExpression = Either[LogicalPlan, Expression]
- Definition Classes
- DeltaSparkPlanUtils
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
val
attempt: Int
Track which attempt or retry it is in runWithMaterializedSourceAndRetries
Track which attempt or retry it is in runWithMaterializedSourceAndRetries
- Attributes
- protected
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
collectFirst[In, Out](input: Iterable[In], recurse: (In) ⇒ Option[Out]): Option[Out]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
containsDeterministicUDF(expr: Expression): Boolean
Returns whether an expression contains any deterministic UDFs.
Returns whether an expression contains any deterministic UDFs.
- Definition Classes
- DeltaSparkPlanUtils
-
def
containsDeterministicUDF(predicates: Seq[DeltaTableReadPredicate], partitionedOnly: Boolean): Boolean
Returns whether the read predicates of a transaction contain any deterministic UDFs.
Returns whether the read predicates of a transaction contain any deterministic UDFs.
- Definition Classes
- DeltaSparkPlanUtils
-
def
deltaAssert(check: ⇒ Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit
Helper method to check invariants in Delta code.
Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
findFirstNonDeltaScan(source: LogicalPlan): Option[LogicalPlan]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
findFirstNonDeterministicChildNode(children: Seq[Expression], checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
findFirstNonDeterministicNode(child: Expression, checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
findFirstNonDeterministicNode(plan: LogicalPlan, checkDeterministicOptions: CheckDeterministicOptions): Option[PlanOrExpression]
Returns a part of the
planthat does not have a safe level of determinism.Returns a part of the
planthat does not have a safe level of determinism. This is a conservative approximation ofplanbeing a truly deterministic query.- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
- Definition Classes
- DeltaLogging
-
def
getErrorData(e: Throwable): Map[String, Any]
- Definition Classes
- DeltaLogging
-
def
getMergeSource: MergeSource
Returns the prepared merge source.
Returns the prepared merge source.
- Attributes
- protected
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
-
def
logDebug(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
materializedSourceRDD: Option[RDD[InternalRow]]
If the source was materialized, reference to the checkpointed RDD.
If the source was materialized, reference to the checkpointed RDD.
- Attributes
- protected
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
planContainsOnlyDeltaScans(source: LogicalPlan): Boolean
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
planContainsUdf(plan: LogicalPlan): Boolean
- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
planIsDeterministic(plan: LogicalPlan, checkDeterministicOptions: CheckDeterministicOptions): Boolean
Returns
trueifplanhas a safe level of determinism.Returns
trueifplanhas a safe level of determinism. This is a conservative approximation ofplanbeing a truly deterministic query.- Attributes
- protected
- Definition Classes
- DeltaSparkPlanUtils
-
def
prepareMergeSource(spark: SparkSession, source: LogicalPlan, condition: Expression, matchedClauses: Seq[DeltaMergeIntoMatchedClause], notMatchedClauses: Seq[DeltaMergeIntoNotMatchedClause], isInsertOnly: Boolean): Unit
If source needs to be materialized, prepare the materialized dataframe in sourceDF Otherwise, prepare regular dataframe.
If source needs to be materialized, prepare the materialized dataframe in sourceDF Otherwise, prepare regular dataframe.
- returns
the source materialization reason
- Attributes
- protected
-
def
recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordFrameProfile[T](group: String, name: String)(thunk: ⇒ T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: ⇒ S): S
- Definition Classes
- DatabricksLogging
-
def
recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
def
recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
def
runWithMaterializedSourceLostRetries(spark: SparkSession, deltaLog: DeltaLog, metrics: Map[String, SQLMetric], runMergeFunc: (SparkSession) ⇒ Seq[Row]): Seq[Row]
Run the Merge with retries in case it detects an RDD block lost error of the materialized source RDD.
Run the Merge with retries in case it detects an RDD block lost error of the materialized source RDD. It will also record out of disk error, if such happens - possibly because of increased disk pressure from the materialized source RDD.
- Attributes
- protected
-
def
shouldMaterializeSource(spark: SparkSession, source: LogicalPlan, isInsertOnly: Boolean): (Boolean, MergeIntoMaterializeSourceReason)
- returns
pair of boolean whether source should be materialized and the source materialization reason
- Attributes
- protected
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: ⇒ T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter
- object RetryHandling extends Enumeration
-
object
SubqueryExpression
Extractor object for the subquery plan of expressions that contain subqueries.
Extractor object for the subquery plan of expressions that contain subqueries.
- Definition Classes
- DeltaSparkPlanUtils