trait CDCReaderImpl extends DeltaLogging
- Alphabetic
- By Inheritance
- CDCReaderImpl
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- LoggingShims
- Logging
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
implicit
class
LogStringContext extends AnyRef
- Definition Classes
- LoggingShims
-
case class
CDCVersionDiffInfo(fileChangeDf: DataFrame, numFiles: Long, numBytes: Long) extends Product with Serializable
Represents the changes between some start and end version of a Delta table
Represents the changes between some start and end version of a Delta table
- fileChangeDf
contains all of the file changes (AddFile, RemoveFile, AddCDCFile)
- numFiles
the number of AddFile + RemoveFile + AddCDCFiles that are in the df
- numBytes
the total size of the AddFile + RemoveFile + AddCDCFiles that are in the df
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
cdcReadSchema(deltaSchema: StructType): StructType
Append CDC metadata columns to the provided schema.
-
def
changesToBatchDF(deltaLog: DeltaLog, start: Long, end: Long, spark: SparkSession, readSchemaSnapshot: Option[Snapshot] = None, useCoarseGrainedCDC: Boolean = false, startVersionSnapshot: Option[SnapshotDescriptor] = None): DataFrame
Get the block of change data from start to end Delta log versions (both sides inclusive).
Get the block of change data from start to end Delta log versions (both sides inclusive). The returned DataFrame has isStreaming set to false.
- readSchemaSnapshot
The snapshot with the desired schema that will be used to serve this CDF batch. It is usually passed upstream from e.g. DeltaTableV2 as an effort to stablize the schema used for the batch DF. We don't actually use its data. If not set, it will fallback to the legacy behavior of using whatever deltaLog.unsafeVolatileSnapshot is. This should be avoided in production.
-
def
changesToDF(readSchemaSnapshot: SnapshotDescriptor, start: Long, end: Long, changes: Iterator[(Long, Seq[Action])], spark: SparkSession, isStreaming: Boolean = false, useCoarseGrainedCDC: Boolean = false, startVersionSnapshot: Option[SnapshotDescriptor] = None): CDCVersionDiffInfo
For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.
For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.
Builds the DataFrame using the following logic: Per each change of type (Long, Seq[Action]) in
changes, iterates over the actions and handles two cases. - If there are any CDC actions, then we ignore the AddFile and RemoveFile actions in that version and create an AddCDCFile instead. - If there are no CDC actions, then we must infer the CDC data from the AddFile and RemoveFile actions, taking only those withdataChange = true.These buffers of AddFile, RemoveFile, and AddCDCFile actions are then used to create corresponding FileIndexes (e.g. TahoeChangeFileIndex), where each is suited to use the given action type to read CDC data. These FileIndexes are then unioned to produce the final DataFrame.
- readSchemaSnapshot
- Snapshot for the table for which we are creating a CDF Dataframe, the schema of the snapshot is expected to be the change DF's schema. We have already adjusted this snapshot with the schema mode if there's any. We don't use its data actually.
- start
- startingVersion of the changes
- end
- endingVersion of the changes
- changes
- changes is an iterator of all FileActions for a particular commit version. Note that for log files where InCommitTimestamps are enabled, the iterator must also contain the CommitInfo action.
- spark
- SparkSession
- isStreaming
- indicates whether the DataFrame returned is a streaming DataFrame
- useCoarseGrainedCDC
- ignores checks related to CDC being disabled in any of the versions and computes CDC entirely from AddFiles/RemoveFiles (ignoring AddCDCFile actions)
- startVersionSnapshot
- The snapshot of the starting version.
- returns
CDCInfo which contains the DataFrame of the changes as well as the statistics related to the changes
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
deltaAssert(check: ⇒ Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit
Helper method to check invariants in Delta code.
Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
getBatchSchemaModeForTable(spark: SparkSession, columnMappingEnabled: Boolean): DeltaBatchCDFSchemaMode
Get the batch cdf schema mode for a table, considering whether it has column mapping enabled or not.
-
def
getCDCRelation(spark: SparkSession, snapshotToUse: Snapshot, isTimeTravelQuery: Boolean, conf: SQLConf, options: CaseInsensitiveStringMap): BaseRelation
Get a Relation that represents change data between two snapshots of the table.
Get a Relation that represents change data between two snapshots of the table.
- spark
Spark session
- snapshotToUse
Snapshot to use to provide read schema and version
- isTimeTravelQuery
Whether this CDC scan is used in conjunction with time-travel args
- conf
SQL conf
- options
CDC specific options
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
- Definition Classes
- DeltaLogging
-
def
getDeletedAndAddedRows(addFileSpecs: Seq[CDCDataSpec[AddFile]], removeFileSpecs: Seq[CDCDataSpec[RemoveFile]], deltaLog: DeltaLog, snapshot: SnapshotDescriptor, isStreaming: Boolean, spark: SparkSession): Seq[DataFrame]
Generate CDC rows by looking at added and removed files, together with Deletion Vectors they may have.
Generate CDC rows by looking at added and removed files, together with Deletion Vectors they may have.
When DV is used, the same file can be removed then added in the same version, and the only difference is the assigned DVs. The base method does not consider DVs in this case, thus will produce CDC that *all* rows in file being removed then *some* re-added. The correct answer, however, is to compare two DVs and apply the diff to the file to get removed and re-added rows.
Currently it is always the case that in the log "remove" comes first, followed by "add" -- which means that the file stays alive with a new DV. There's another possibility, though not make many senses, that a file is "added" to log then "removed" in the same version. If this becomes possible in future, we have to reconstruct the timeline considering the order of actions rather than simply matching files by path.
- Attributes
- protected
-
def
getErrorData(e: Throwable): Map[String, Any]
- Definition Classes
- DeltaLogging
-
def
getNonICTTimestampsByVersion(deltaLog: DeltaLog, start: Long, end: Long): Map[Long, Timestamp]
Builds a map from commit versions to associated commit timestamps where the timestamp is the modification time of the commit file.
Builds a map from commit versions to associated commit timestamps where the timestamp is the modification time of the commit file. Note that this function will not return InCommitTimestamps, it is up to the consumer of this function to decide whether the file modification time is the correct commit timestamp or whether they need to read the ICT.
- start
start commit version
- end
end commit version (inclusive)
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
isCDCEnabledOnTable(metadata: Metadata, spark: SparkSession): Boolean
Determine if the metadata provided has cdc enabled or not.
-
def
isCDCRead(options: CaseInsensitiveStringMap): Boolean
Based on the read options passed it indicates whether the read was a cdc read or not.
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
-
def
logDebug(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- def processDeletionVectorActions(addFilesMap: Map[FilePathWithTableVersion, AddFile], removeFilesMap: Map[FilePathWithTableVersion, RemoveFile], versionToCommitInfo: Map[Long, CommitInfo], deltaLog: DeltaLog, snapshot: SnapshotDescriptor, isStreaming: Boolean, spark: SparkSession): Seq[DataFrame]
-
def
recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordFrameProfile[T](group: String, name: String)(thunk: ⇒ T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: ⇒ S): S
- Definition Classes
- DatabricksLogging
-
def
recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
def
recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
def
scanIndex(spark: SparkSession, index: TahoeFileIndexWithSnapshotDescriptor, isStreaming: Boolean = false): DataFrame
Build a dataframe from the specified file index.
Build a dataframe from the specified file index. We can't use a DataFrame scan directly on the file names because that scan wouldn't include partition columns.
It can optionally take a customReadSchema for the dataframe generated.
- Attributes
- protected
-
def
shouldSkipFileActionsInCommit(commitInfo: CommitInfo): Boolean
Function to check if file actions should be skipped for no-op merges based on CommitInfo metrics.
Function to check if file actions should be skipped for no-op merges based on CommitInfo metrics. MERGE will sometimes rewrite files in a way which *could* have changed data (so dataChange = true) but did not actually do so (so no CDC will be produced). In this case the correct CDC output is empty - we shouldn't serve it from those files. This should be handled within the command, but as a hotfix-safe fix, we check the metrics. If the command reported 0 rows inserted, updated, or deleted, then CDC shouldn't be produced.
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: ⇒ T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter