t

org.apache.spark.sql.delta.commands.merge

ClassicMergeExecutor

trait ClassicMergeExecutor extends MergeOutputGeneration

Trait with merge execution in two phases:

Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row. This is implemented as an inner-join using the given condition (see findTouchedFiles). In the special case that there is no update clause we write all the non-matching source data as new files and skip phase 2. Issues an error message when the ON search_condition of the MERGE statement can match a single row from the target table with multiple rows of the source table-reference.

Phase 2: Read the touched files again and write new files with updated and/or inserted rows. If there are updates, then use an outer join using the given condition to write the updates and inserts (see writeAllChanges()). If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()).

Note, when deletion vectors are enabled, phase 2 is split into two parts: 2.a. Read the touched files again and only write modified and new rows (see writeAllChanges()). 2.b. Read the touched files and generate deletion vectors for the modified rows (see writeDVs()).

If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()). This remains the same when DVs are enabled since there are no modified rows. Furthermore, eee InsertOnlyMergeExecutor for the optimized executor used in case there are only inserts.

Self Type
ClassicMergeExecutor with MergeIntoCommandBase
Linear Supertypes
Known Subclasses
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. ClassicMergeExecutor
  2. MergeOutputGeneration
  3. AnyRef
  4. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. case class ProcessedClause(condition: Option[Expression], actions: Seq[Expression]) extends Product with Serializable

    Represents a merge clause after its condition and action expressions have been processed before generating the final output expression.

    Represents a merge clause after its condition and action expressions have been processed before generating the final output expression.

    condition

    Optional precomputed condition.

    actions

    List of output expressions generated from every action of the clause.

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  5. def clauseDisjunction(clauses: Seq[DeltaMergeIntoClause]): Expression

    Helper function that produces an expression by combining a sequence of clauses with OR.

    Helper function that produces an expression by combining a sequence of clauses with OR. Requires the sequence to be non-empty.

    Attributes
    protected
  6. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  7. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  8. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  9. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  10. def findTouchedFiles(spark: SparkSession, deltaTxn: OptimisticTransaction): (Seq[AddFile], DeduplicateCDFDeletes)

    Find the target table files that contain the rows that satisfy the merge condition.

    Find the target table files that contain the rows that satisfy the merge condition. This is implemented as an inner-join between the source query/table and the target table using the merge condition.

    Attributes
    protected
  11. def generateAllActionExprs(targetWriteCols: Seq[Expression], rowIdColumnExpressionOpt: Option[NamedExpression], rowCommitVersionColumnExpressionOpt: Option[NamedExpression], clausesWithPrecompConditions: Seq[DeltaMergeIntoClause], cdcEnabled: Boolean, shouldCountDeletedRows: Boolean): Seq[(ClassicMergeExecutor.this)#ProcessedClause]

    Generate expressions for every output column and every merge clause based on the corresponding UPDATE, DELETE and/or INSERT action(s).

    Generate expressions for every output column and every merge clause based on the corresponding UPDATE, DELETE and/or INSERT action(s).

    targetWriteCols

    List of output column expressions from the target table. Used to generate CDC data for DELETE.

    rowIdColumnExpressionOpt

    The optional Row ID preservation column with the physical Row ID name, it stores stable Row IDs of the table.

    rowCommitVersionColumnExpressionOpt

    The optional Row Commit Version preservation column with the physical Row Commit Version name, it stores stable Row Commit Versions.

    clausesWithPrecompConditions

    List of merge clauses with precomputed conditions. Action expressions are generated for each of these clauses.

    cdcEnabled

    Whether the generated expressions should include CDC information.

    shouldCountDeletedRows

    Whether metrics for number of deleted rows should be incremented here.

    returns

    For each merge clause, a list of ProcessedClause each with a precomputed condition and N+2 action expressions (N output columns + ROW_DROPPED_COL + CDC_TYPE_COLUMN_NAME) to apply on a row when that clause matches.

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration
  12. def generateCdcAndOutputRows(sourceDf: DataFrame, outputCols: Seq[Column], outputColNames: Seq[String], noopCopyExprs: Seq[Expression], rowIdColumnNameOpt: Option[String], rowCommitVersionColumnNameOpt: Option[String], deduplicateDeletes: DeduplicateCDFDeletes): DataFrame

    Build the full output as an array of packed rows, then explode into the final result.

    Build the full output as an array of packed rows, then explode into the final result. Based on the CDC type as originally marked, we produce both rows for the CDC_TYPE_NOT_CDC partition to be written to the main table and rows for the CDC partitions to be written as CDC files.

    See CDCReader for general details on how partitioning on the CDC type column works.

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration
  13. def generateClauseOutputExprs(numOutputCols: Int, clauses: Seq[(ClassicMergeExecutor.this)#ProcessedClause], noopExprs: Seq[Expression]): Seq[Expression]

    Generate the output expression for each output column to apply the correct action for a type of merge clause.

    Generate the output expression for each output column to apply the correct action for a type of merge clause. For each output column, the resulting expression dispatches the correct action based on all clause conditions.

    numOutputCols

    Number of output columns.

    clauses

    List of preprocessed merge clauses to bind together.

    noopExprs

    Default expression to apply when no condition holds.

    returns

    A list of one expression per output column to apply for a type of merge clause.

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration
  14. def generateFilterForModifiedRows(): Expression

    Returns the expression that can be used for selecting the modified rows generated by the merge operation.

    Returns the expression that can be used for selecting the modified rows generated by the merge operation. The expression is to designed to work irrespectively of the join type used between the source and target tables.

    The expression consists of two parts, one for each of the action clause types that produce row modifications: MATCHED, NOT MATCHED BY SOURCE. All actions of the same clause type form a disjunctive clause. The result is then conjucted to an expression that filters the rows of the particular action clause type. For example:

    MERGE INTO t USING s ON s.id = t.id WHEN MATCHED AND id < 5 THEN ... WHEN MATCHED AND id > 10 THEN ... WHEN NOT MATCHED BY SOURCE AND id > 20 THEN ...

    Produces the following expression:

    ((as.id = t.id) AND (id < 5 OR id > 10)) OR ((SOURCE TABLE IS NULL) AND (id > 20))

    Attributes
    protected
  15. def generateFilterForNewRows(): Expression

    Returns the expression that can be used for selecting the new rows generated by the merge operation.

    Returns the expression that can be used for selecting the new rows generated by the merge operation.

    Attributes
    protected
  16. def generatePrecomputedConditionsAndDF(sourceDF: DataFrame, clauses: Seq[DeltaMergeIntoClause]): (DataFrame, Seq[DeltaMergeIntoClause])

    Precompute conditions in MATCHED and NOT MATCHED clauses and generate the source data frame with precomputed boolean columns.

    Precompute conditions in MATCHED and NOT MATCHED clauses and generate the source data frame with precomputed boolean columns.

    sourceDF

    the source DataFrame.

    clauses

    the merge clauses to precompute.

    returns

    Generated sourceDF with precomputed boolean columns, matched clauses with possible rewritten clause conditions, insert clauses with possible rewritten clause conditions

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration
  17. def generateWriteAllChangesOutputCols(targetWriteCols: Seq[Expression], rowIdColumnExpressionOpt: Option[NamedExpression], rowCommitVersionColumnExpressionOpt: Option[NamedExpression], targetWriteColNames: Seq[String], noopCopyExprs: Seq[Expression], clausesWithPrecompConditions: Seq[DeltaMergeIntoClause], cdcEnabled: Boolean, shouldCountDeletedRows: Boolean = true): IndexedSeq[Column]

    Generate the expressions to process full-outer join output and generate target rows.

    Generate the expressions to process full-outer join output and generate target rows.

    To generate these N + 2 columns, we generate N + 2 expressions and apply them on the joinedDF. The CDC column will be either used for CDC generation or dropped before performing the final write, and the other column will always be dropped after executing the increment metric expression and filtering on ROW_DROPPED_COL.

    Attributes
    protected
    Definition Classes
    MergeOutputGeneration
  18. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  19. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  20. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  21. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  22. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  23. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  24. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  25. def toString(): String
    Definition Classes
    AnyRef → Any
  26. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  27. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  28. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  29. def writeAllChanges(spark: SparkSession, deltaTxn: OptimisticTransaction, filesToRewrite: Seq[AddFile], deduplicateCDFDeletes: DeduplicateCDFDeletes, writeUnmodifiedRows: Boolean): Seq[FileAction]

    Write new files by reading the touched files and updating/inserting data using the source query/table.

    Write new files by reading the touched files and updating/inserting data using the source query/table. This is implemented using a full-outer-join using the merge condition.

    Note that unlike the insert-only code paths with just one control column ROW_DROPPED_COL, this method has a second control column CDC_TYPE_COL_NAME used for handling CDC when enabled.

    Attributes
    protected
  30. def writeDVs(spark: SparkSession, deltaTxn: OptimisticTransaction, filesToRewrite: Seq[AddFile]): Seq[FileAction]

    Writes Deletion Vectors for rows modified by the merge operation.

    Writes Deletion Vectors for rows modified by the merge operation.

    Attributes
    protected

Inherited from MergeOutputGeneration

Inherited from AnyRef

Inherited from Any

Ungrouped