package merge
- Alphabetic
- Public
- All
Type Members
-
trait
ClassicMergeExecutor extends MergeOutputGeneration
Trait with merge execution in two phases:
Trait with merge execution in two phases:
Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row. This is implemented as an inner-join using the given condition (see findTouchedFiles). In the special case that there is no update clause we write all the non-matching source data as new files and skip phase 2. Issues an error message when the ON search_condition of the MERGE statement can match a single row from the target table with multiple rows of the source table-reference.
Phase 2: Read the touched files again and write new files with updated and/or inserted rows. If there are updates, then use an outer join using the given condition to write the updates and inserts (see writeAllChanges()). If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()).
Note, when deletion vectors are enabled, phase 2 is split into two parts: 2.a. Read the touched files again and only write modified and new rows (see writeAllChanges()). 2.b. Read the touched files and generate deletion vectors for the modified rows (see writeDVs()).
If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()). This remains the same when DVs are enabled since there are no modified rows. Furthermore, eee InsertOnlyMergeExecutor for the optimized executor used in case there are only inserts.
-
case class
DeduplicateCDFDeletes(enabled: Boolean, includesInserts: Boolean) extends Product with Serializable
This class enables and configures the deduplication of CDF deletes in case the merge statement contains an unconditional delete statement that matches multiple target rows.
This class enables and configures the deduplication of CDF deletes in case the merge statement contains an unconditional delete statement that matches multiple target rows.
- enabled
CDF generation should be enabled and duplicate target matches are detected
- includesInserts
in addition to the unconditional deletes the merge also inserts rows
-
trait
InsertOnlyMergeExecutor extends MergeOutputGeneration
Trait with optimized execution for merges that only inserts new data.
Trait with optimized execution for merges that only inserts new data. There are two cases for inserts only: when there are no matched clauses for the merge command and when there is nothing matched for the merge command even if there are matched clauses.
-
case class
MergeClauseStats(condition: Option[String], actionType: String, actionExpr: Seq[String]) extends Product with Serializable
Represents the state of a single merge clause: - merge clause's (optional) predicate - action type (insert, update, delete) - action's expressions
- case class MergeDataSizes(rows: Option[Long] = None, files: Option[Long] = None, bytes: Option[Long] = None, partitions: Option[Long] = None) extends Product with Serializable
-
trait
MergeIntoMaterializeSource extends DeltaLogging with DeltaSparkPlanUtils
Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.
Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.
We materialize source if it is not safe to assume that it's deterministic (override with MERGE_SOURCE_MATERIALIZATION). Otherwise, if source changes between the phases of the MERGE, it can produce wrong results. We use local checkpointing for the materialization, which saves the source as a materialized RDD[InternalRow] on the executor local disks.
1st concern is that if an executor is lost, this data can be lost. When Spark executor decommissioning API is used, it should attempt to move this materialized data safely out before removing the executor.
2nd concern is that if an executor is lost for another reason (e.g. spot kill), we will still lose that data. To mitigate that, we implement a retry loop. The whole Merge operation needs to be restarted from the beginning in this case. When we retry, we increase the replication level of the materialized data from 1 to 2. (override with MERGE_SOURCE_MATERIALIZATION_RDD_STORAGE_LEVEL_RETRY). If it still fails after the maximum number of attempts (MERGE_MATERIALIZE_SOURCE_MAX_ATTEMPTS), we record the failure for tracking purposes.
3rd concern is that executors run out of disk space with the extra materialization. We record such failures for tracking purposes.
-
case class
MergeIntoMaterializeSourceError(errorType: String, attempt: Int, materializedSourceRDDStorageLevel: String) extends Product with Serializable
Structure with data for "delta.dml.merge.materializeSourceError" event.
Structure with data for "delta.dml.merge.materializeSourceError" event. Note: We log only errors that we want to track (out of disk or lost RDD blocks).
-
trait
MergeOutputGeneration extends AnyRef
Contains logic to transform the merge clauses into expressions that can be evaluated to obtain the output of the merge operation.
-
case class
MergeStats(conditionExpr: String, updateConditionExpr: String, updateExprs: Seq[String], insertConditionExpr: String, insertExprs: Seq[String], deleteConditionExpr: String, matchedStats: Seq[MergeClauseStats], notMatchedStats: Seq[MergeClauseStats], notMatchedBySourceStats: Seq[MergeClauseStats], executionTimeMs: Long, materializeSourceTimeMs: Long, scanTimeMs: Long, rewriteTimeMs: Long, source: MergeDataSizes, targetBeforeSkipping: MergeDataSizes, targetAfterSkipping: MergeDataSizes, sourceRowsInSecondScan: Option[Long], targetFilesRemoved: Long, targetFilesAdded: Long, targetChangeFilesAdded: Option[Long], targetChangeFileBytes: Option[Long], targetBytesRemoved: Option[Long], targetBytesAdded: Option[Long], targetPartitionsRemovedFrom: Option[Long], targetPartitionsAddedTo: Option[Long], targetRowsCopied: Long, targetRowsUpdated: Long, targetRowsMatchedUpdated: Long, targetRowsNotMatchedBySourceUpdated: Long, targetRowsInserted: Long, targetRowsDeleted: Long, targetRowsMatchedDeleted: Long, targetRowsNotMatchedBySourceDeleted: Long, numTargetDeletionVectorsAdded: Long, numTargetDeletionVectorsRemoved: Long, numTargetDeletionVectorsUpdated: Long, materializeSourceReason: Option[String] = None, materializeSourceAttempts: Option[Long] = None, numLogicalRecordsAdded: Option[Long], numLogicalRecordsRemoved: Option[Long], commitVersion: Option[Long] = None) extends Product with Serializable
State for a merge operation
Value Members
- object MergeClauseStats extends Serializable
- object MergeIntoMaterializeSource
- object MergeIntoMaterializeSourceError extends Serializable
- object MergeIntoMaterializeSourceErrorType extends Enumeration
-
object
MergeIntoMaterializeSourceReason extends Enumeration
Enumeration with possible reasons that source may be materialized in a MERGE command.
- object MergeOutputGeneration
- object MergeStats extends Serializable