Packages

package merge

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait ClassicMergeExecutor extends MergeOutputGeneration

    Trait with merge execution in two phases:

    Trait with merge execution in two phases:

    Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row. This is implemented as an inner-join using the given condition (see findTouchedFiles). In the special case that there is no update clause we write all the non-matching source data as new files and skip phase 2. Issues an error message when the ON search_condition of the MERGE statement can match a single row from the target table with multiple rows of the source table-reference.

    Phase 2: Read the touched files again and write new files with updated and/or inserted rows. If there are updates, then use an outer join using the given condition to write the updates and inserts (see writeAllChanges()). If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()).

    Note, when deletion vectors are enabled, phase 2 is split into two parts: 2.a. Read the touched files again and only write modified and new rows (see writeAllChanges()). 2.b. Read the touched files and generate deletion vectors for the modified rows (see writeDVs()).

    If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()). This remains the same when DVs are enabled since there are no modified rows. Furthermore, eee InsertOnlyMergeExecutor for the optimized executor used in case there are only inserts.

  2. case class DeduplicateCDFDeletes(enabled: Boolean, includesInserts: Boolean) extends Product with Serializable

    This class enables and configures the deduplication of CDF deletes in case the merge statement contains an unconditional delete statement that matches multiple target rows.

    This class enables and configures the deduplication of CDF deletes in case the merge statement contains an unconditional delete statement that matches multiple target rows.

    enabled

    CDF generation should be enabled and duplicate target matches are detected

    includesInserts

    in addition to the unconditional deletes the merge also inserts rows

  3. trait InsertOnlyMergeExecutor extends MergeOutputGeneration

    Trait with optimized execution for merges that only inserts new data.

    Trait with optimized execution for merges that only inserts new data. There are two cases for inserts only: when there are no matched clauses for the merge command and when there is nothing matched for the merge command even if there are matched clauses.

  4. case class MergeClauseStats(condition: Option[String], actionType: String, actionExpr: Seq[String]) extends Product with Serializable

    Represents the state of a single merge clause: - merge clause's (optional) predicate - action type (insert, update, delete) - action's expressions

  5. case class MergeDataSizes(rows: Option[Long] = None, files: Option[Long] = None, bytes: Option[Long] = None, partitions: Option[Long] = None) extends Product with Serializable
  6. trait MergeIntoMaterializeSource extends DeltaLogging with DeltaSparkPlanUtils

    Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.

    Trait with logic and utilities used for materializing a snapshot of MERGE source in case we can't guarantee deterministic repeated reads from it.

    We materialize source if it is not safe to assume that it's deterministic (override with MERGE_SOURCE_MATERIALIZATION). Otherwise, if source changes between the phases of the MERGE, it can produce wrong results. We use local checkpointing for the materialization, which saves the source as a materialized RDD[InternalRow] on the executor local disks.

    1st concern is that if an executor is lost, this data can be lost. When Spark executor decommissioning API is used, it should attempt to move this materialized data safely out before removing the executor.

    2nd concern is that if an executor is lost for another reason (e.g. spot kill), we will still lose that data. To mitigate that, we implement a retry loop. The whole Merge operation needs to be restarted from the beginning in this case. When we retry, we increase the replication level of the materialized data from 1 to 2. (override with MERGE_SOURCE_MATERIALIZATION_RDD_STORAGE_LEVEL_RETRY). If it still fails after the maximum number of attempts (MERGE_MATERIALIZE_SOURCE_MAX_ATTEMPTS), we record the failure for tracking purposes.

    3rd concern is that executors run out of disk space with the extra materialization. We record such failures for tracking purposes.

  7. case class MergeIntoMaterializeSourceError(errorType: String, attempt: Int, materializedSourceRDDStorageLevel: String) extends Product with Serializable

    Structure with data for "delta.dml.merge.materializeSourceError" event.

    Structure with data for "delta.dml.merge.materializeSourceError" event. Note: We log only errors that we want to track (out of disk or lost RDD blocks).

  8. trait MergeOutputGeneration extends AnyRef

    Contains logic to transform the merge clauses into expressions that can be evaluated to obtain the output of the merge operation.

  9. case class MergeStats(conditionExpr: String, updateConditionExpr: String, updateExprs: Seq[String], insertConditionExpr: String, insertExprs: Seq[String], deleteConditionExpr: String, matchedStats: Seq[MergeClauseStats], notMatchedStats: Seq[MergeClauseStats], notMatchedBySourceStats: Seq[MergeClauseStats], executionTimeMs: Long, materializeSourceTimeMs: Long, scanTimeMs: Long, rewriteTimeMs: Long, source: MergeDataSizes, targetBeforeSkipping: MergeDataSizes, targetAfterSkipping: MergeDataSizes, sourceRowsInSecondScan: Option[Long], targetFilesRemoved: Long, targetFilesAdded: Long, targetChangeFilesAdded: Option[Long], targetChangeFileBytes: Option[Long], targetBytesRemoved: Option[Long], targetBytesAdded: Option[Long], targetPartitionsRemovedFrom: Option[Long], targetPartitionsAddedTo: Option[Long], targetRowsCopied: Long, targetRowsUpdated: Long, targetRowsMatchedUpdated: Long, targetRowsNotMatchedBySourceUpdated: Long, targetRowsInserted: Long, targetRowsDeleted: Long, targetRowsMatchedDeleted: Long, targetRowsNotMatchedBySourceDeleted: Long, numTargetDeletionVectorsAdded: Long, numTargetDeletionVectorsRemoved: Long, numTargetDeletionVectorsUpdated: Long, materializeSourceReason: Option[String] = None, materializeSourceAttempts: Option[Long] = None, numLogicalRecordsAdded: Option[Long], numLogicalRecordsRemoved: Option[Long], commitVersion: Option[Long] = None) extends Product with Serializable

    State for a merge operation

Value Members

  1. object MergeClauseStats extends Serializable
  2. object MergeIntoMaterializeSource
  3. object MergeIntoMaterializeSourceError extends Serializable
  4. object MergeIntoMaterializeSourceErrorType extends Enumeration
  5. object MergeIntoMaterializeSourceReason extends Enumeration

    Enumeration with possible reasons that source may be materialized in a MERGE command.

  6. object MergeOutputGeneration
  7. object MergeStats extends Serializable

Ungrouped