trait ClassicMergeExecutor extends MergeOutputGeneration
Trait with merge execution in two phases:
Phase 1: Find the input files in target that are touched by the rows that satisfy the condition and verify that no two source rows match with the same target row. This is implemented as an inner-join using the given condition (see findTouchedFiles). In the special case that there is no update clause we write all the non-matching source data as new files and skip phase 2. Issues an error message when the ON search_condition of the MERGE statement can match a single row from the target table with multiple rows of the source table-reference.
Phase 2: Read the touched files again and write new files with updated and/or inserted rows. If there are updates, then use an outer join using the given condition to write the updates and inserts (see writeAllChanges()). If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()).
Note, when deletion vectors are enabled, phase 2 is split into two parts: 2.a. Read the touched files again and only write modified and new rows (see writeAllChanges()). 2.b. Read the touched files and generate deletion vectors for the modified rows (see writeDVs()).
If there are no matches for updates, only inserts, then write them directly (see writeInsertsOnlyWhenNoMatches()). This remains the same when DVs are enabled since there are no modified rows. Furthermore, eee InsertOnlyMergeExecutor for the optimized executor used in case there are only inserts.
- Self Type
- ClassicMergeExecutor with MergeIntoCommandBase
- Alphabetic
- By Inheritance
- ClassicMergeExecutor
- MergeOutputGeneration
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
case class
ProcessedClause(condition: Option[Expression], actions: Seq[Expression]) extends Product with Serializable
Represents a merge clause after its condition and action expressions have been processed before generating the final output expression.
Represents a merge clause after its condition and action expressions have been processed before generating the final output expression.
- condition
Optional precomputed condition.
- actions
List of output expressions generated from every action of the clause.
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clauseDisjunction(clauses: Seq[DeltaMergeIntoClause]): Expression
Helper function that produces an expression by combining a sequence of clauses with OR.
Helper function that produces an expression by combining a sequence of clauses with OR. Requires the sequence to be non-empty.
- Attributes
- protected
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
findTouchedFiles(spark: SparkSession, deltaTxn: OptimisticTransaction): (Seq[AddFile], DeduplicateCDFDeletes)
Find the target table files that contain the rows that satisfy the merge condition.
Find the target table files that contain the rows that satisfy the merge condition. This is implemented as an inner-join between the source query/table and the target table using the merge condition.
- Attributes
- protected
-
def
generateAllActionExprs(targetWriteCols: Seq[Expression], rowIdColumnExpressionOpt: Option[NamedExpression], rowCommitVersionColumnExpressionOpt: Option[NamedExpression], clausesWithPrecompConditions: Seq[DeltaMergeIntoClause], cdcEnabled: Boolean, shouldCountDeletedRows: Boolean): Seq[(ClassicMergeExecutor.this)#ProcessedClause]
Generate expressions for every output column and every merge clause based on the corresponding UPDATE, DELETE and/or INSERT action(s).
Generate expressions for every output column and every merge clause based on the corresponding UPDATE, DELETE and/or INSERT action(s).
- targetWriteCols
List of output column expressions from the target table. Used to generate CDC data for DELETE.
- rowIdColumnExpressionOpt
The optional Row ID preservation column with the physical Row ID name, it stores stable Row IDs of the table.
- rowCommitVersionColumnExpressionOpt
The optional Row Commit Version preservation column with the physical Row Commit Version name, it stores stable Row Commit Versions.
- clausesWithPrecompConditions
List of merge clauses with precomputed conditions. Action expressions are generated for each of these clauses.
- cdcEnabled
Whether the generated expressions should include CDC information.
- shouldCountDeletedRows
Whether metrics for number of deleted rows should be incremented here.
- returns
For each merge clause, a list of ProcessedClause each with a precomputed condition and N+2 action expressions (N output columns + ROW_DROPPED_COL + CDC_TYPE_COLUMN_NAME) to apply on a row when that clause matches.
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
-
def
generateCdcAndOutputRows(sourceDf: DataFrame, outputCols: Seq[Column], outputColNames: Seq[String], noopCopyExprs: Seq[Expression], rowIdColumnNameOpt: Option[String], rowCommitVersionColumnNameOpt: Option[String], deduplicateDeletes: DeduplicateCDFDeletes): DataFrame
Build the full output as an array of packed rows, then explode into the final result.
Build the full output as an array of packed rows, then explode into the final result. Based on the CDC type as originally marked, we produce both rows for the CDC_TYPE_NOT_CDC partition to be written to the main table and rows for the CDC partitions to be written as CDC files.
See CDCReader for general details on how partitioning on the CDC type column works.
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
-
def
generateClauseOutputExprs(numOutputCols: Int, clauses: Seq[(ClassicMergeExecutor.this)#ProcessedClause], noopExprs: Seq[Expression]): Seq[Expression]
Generate the output expression for each output column to apply the correct action for a type of merge clause.
Generate the output expression for each output column to apply the correct action for a type of merge clause. For each output column, the resulting expression dispatches the correct action based on all clause conditions.
- numOutputCols
Number of output columns.
- clauses
List of preprocessed merge clauses to bind together.
- noopExprs
Default expression to apply when no condition holds.
- returns
A list of one expression per output column to apply for a type of merge clause.
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
-
def
generateFilterForModifiedRows(): Expression
Returns the expression that can be used for selecting the modified rows generated by the merge operation.
Returns the expression that can be used for selecting the modified rows generated by the merge operation. The expression is to designed to work irrespectively of the join type used between the source and target tables.
The expression consists of two parts, one for each of the action clause types that produce row modifications: MATCHED, NOT MATCHED BY SOURCE. All actions of the same clause type form a disjunctive clause. The result is then conjucted to an expression that filters the rows of the particular action clause type. For example:
MERGE INTO t USING s ON s.id = t.id WHEN MATCHED AND id < 5 THEN ... WHEN MATCHED AND id > 10 THEN ... WHEN NOT MATCHED BY SOURCE AND id > 20 THEN ...
Produces the following expression:
((as.id = t.id) AND (id < 5 OR id > 10)) OR ((SOURCE TABLE IS NULL) AND (id > 20))
- Attributes
- protected
-
def
generateFilterForNewRows(): Expression
Returns the expression that can be used for selecting the new rows generated by the merge operation.
Returns the expression that can be used for selecting the new rows generated by the merge operation.
- Attributes
- protected
-
def
generatePrecomputedConditionsAndDF(sourceDF: DataFrame, clauses: Seq[DeltaMergeIntoClause]): (DataFrame, Seq[DeltaMergeIntoClause])
Precompute conditions in MATCHED and NOT MATCHED clauses and generate the source data frame with precomputed boolean columns.
Precompute conditions in MATCHED and NOT MATCHED clauses and generate the source data frame with precomputed boolean columns.
- sourceDF
the source DataFrame.
- clauses
the merge clauses to precompute.
- returns
Generated sourceDF with precomputed boolean columns, matched clauses with possible rewritten clause conditions, insert clauses with possible rewritten clause conditions
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
-
def
generateWriteAllChangesOutputCols(targetWriteCols: Seq[Expression], rowIdColumnExpressionOpt: Option[NamedExpression], rowCommitVersionColumnExpressionOpt: Option[NamedExpression], targetWriteColNames: Seq[String], noopCopyExprs: Seq[Expression], clausesWithPrecompConditions: Seq[DeltaMergeIntoClause], cdcEnabled: Boolean, shouldCountDeletedRows: Boolean = true): IndexedSeq[Column]
Generate the expressions to process full-outer join output and generate target rows.
Generate the expressions to process full-outer join output and generate target rows.
To generate these N + 2 columns, we generate N + 2 expressions and apply them on the joinedDF. The CDC column will be either used for CDC generation or dropped before performing the final write, and the other column will always be dropped after executing the increment metric expression and filtering on ROW_DROPPED_COL.
- Attributes
- protected
- Definition Classes
- MergeOutputGeneration
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
writeAllChanges(spark: SparkSession, deltaTxn: OptimisticTransaction, filesToRewrite: Seq[AddFile], deduplicateCDFDeletes: DeduplicateCDFDeletes, writeUnmodifiedRows: Boolean): Seq[FileAction]
Write new files by reading the touched files and updating/inserting data using the source query/table.
Write new files by reading the touched files and updating/inserting data using the source query/table. This is implemented using a full-outer-join using the merge condition.
Note that unlike the insert-only code paths with just one control column ROW_DROPPED_COL, this method has a second control column CDC_TYPE_COL_NAME used for handling CDC when enabled.
- Attributes
- protected
-
def
writeDVs(spark: SparkSession, deltaTxn: OptimisticTransaction, filesToRewrite: Seq[AddFile]): Seq[FileAction]
Writes Deletion Vectors for rows modified by the merge operation.
Writes Deletion Vectors for rows modified by the merge operation.
- Attributes
- protected