Packages

object CDCReader extends CDCReaderImpl

The API that allows reading Change data between two versions of a table.

The basic abstraction here is the CDC type column defined by CDCReader.CDC_TYPE_COLUMN_NAME. When CDC is enabled, our writer will treat this column as a special partition column even though it's not part of the table. Writers should generate a query that has two types of rows in it: the main data in partition CDC_TYPE_NOT_CDC and the CDC data with the appropriate CDC type value.

org.apache.spark.sql.delta.files.DelayedCommitProtocol does special handling for this column, dispatching the main data to its normal location while the CDC data is sent to AddCDCFile entries.

Linear Supertypes
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. CDCReader
  2. CDCReaderImpl
  3. DeltaLogging
  4. DatabricksLogging
  5. DeltaProgressReporter
  6. LoggingShims
  7. Logging
  8. AnyRef
  9. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. implicit class LogStringContext extends AnyRef
    Definition Classes
    LoggingShims
  2. case class CDCDataSpec[T <: FileAction](version: Long, timestamp: Timestamp, actions: Seq[T], commitInfo: Option[CommitInfo]) extends Product with Serializable
  3. case class DeltaCDFRelation(snapshotWithSchemaMode: SnapshotWithSchemaMode, sqlContext: SQLContext, startingVersion: Option[Long], endingVersion: Option[Long]) extends BaseRelation with CatalystScan with Product with Serializable

    A special BaseRelation wrapper for CDF reads.

  4. case class FilePathWithTableVersion(path: String, commitInfo: Option[CommitInfo], version: Long, timestamp: Timestamp) extends Product with Serializable

    Path of a file of a Delta table, together with it's origin table version & timestamp.

  5. case class SnapshotWithSchemaMode(snapshot: Snapshot, schemaMode: DeltaBatchCDFSchemaMode) extends Product with Serializable
  6. case class TableVersion(version: Long, timestamp: Timestamp) extends Product with Serializable

    A version number of a Delta table, with the version's timestamp.

  7. case class CDCVersionDiffInfo(fileChangeDf: DataFrame, numFiles: Long, numBytes: Long) extends Product with Serializable

    Represents the changes between some start and end version of a Delta table

    Represents the changes between some start and end version of a Delta table

    fileChangeDf

    contains all of the file changes (AddFile, RemoveFile, AddCDCFile)

    numFiles

    the number of AddFile + RemoveFile + AddCDCFiles that are in the df

    numBytes

    the total size of the AddFile + RemoveFile + AddCDCFiles that are in the df

    Definition Classes
    CDCReaderImpl

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. val CDC_COLUMNS_IN_DATA: Seq[String]
  5. val CDC_COMMIT_TIMESTAMP: String
  6. val CDC_COMMIT_VERSION: String
  7. val CDC_LOCATION: String
  8. val CDC_PARTITION_COL: String
  9. val CDC_TYPE_COLUMN_NAME: String
  10. val CDC_TYPE_DELETE: Literal
  11. val CDC_TYPE_DELETE_STRING: String
  12. val CDC_TYPE_INSERT: String
  13. val CDC_TYPE_NOT_CDC: Literal
  14. val CDC_TYPE_UPDATE_POSTIMAGE: String
  15. val CDC_TYPE_UPDATE_PREIMAGE: String
  16. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  17. def cdcAttributes: Seq[Attribute]

    Append CDC metadata columns to the provided schema.

  18. def cdcReadSchema(deltaSchema: StructType): StructType

    Append CDC metadata columns to the provided schema.

    Append CDC metadata columns to the provided schema.

    Definition Classes
    CDCReaderImpl
  19. def changesToBatchDF(deltaLog: DeltaLog, start: Long, end: Long, spark: SparkSession, readSchemaSnapshot: Option[Snapshot] = None, useCoarseGrainedCDC: Boolean = false, startVersionSnapshot: Option[SnapshotDescriptor] = None): DataFrame

    Get the block of change data from start to end Delta log versions (both sides inclusive).

    Get the block of change data from start to end Delta log versions (both sides inclusive). The returned DataFrame has isStreaming set to false.

    readSchemaSnapshot

    The snapshot with the desired schema that will be used to serve this CDF batch. It is usually passed upstream from e.g. DeltaTableV2 as an effort to stablize the schema used for the batch DF. We don't actually use its data. If not set, it will fallback to the legacy behavior of using whatever deltaLog.unsafeVolatileSnapshot is. This should be avoided in production.

    Definition Classes
    CDCReaderImpl
  20. def changesToDF(readSchemaSnapshot: SnapshotDescriptor, start: Long, end: Long, changes: Iterator[(Long, Seq[Action])], spark: SparkSession, isStreaming: Boolean = false, useCoarseGrainedCDC: Boolean = false, startVersionSnapshot: Option[SnapshotDescriptor] = None): CDCVersionDiffInfo

    For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.

    For a sequence of changes(AddFile, RemoveFile, AddCDCFile) create a DataFrame that represents that captured change data between start and end inclusive.

    Builds the DataFrame using the following logic: Per each change of type (Long, Seq[Action]) in changes, iterates over the actions and handles two cases. - If there are any CDC actions, then we ignore the AddFile and RemoveFile actions in that version and create an AddCDCFile instead. - If there are no CDC actions, then we must infer the CDC data from the AddFile and RemoveFile actions, taking only those with dataChange = true.

    These buffers of AddFile, RemoveFile, and AddCDCFile actions are then used to create corresponding FileIndexes (e.g. TahoeChangeFileIndex), where each is suited to use the given action type to read CDC data. These FileIndexes are then unioned to produce the final DataFrame.

    readSchemaSnapshot

    - Snapshot for the table for which we are creating a CDF Dataframe, the schema of the snapshot is expected to be the change DF's schema. We have already adjusted this snapshot with the schema mode if there's any. We don't use its data actually.

    start

    - startingVersion of the changes

    end

    - endingVersion of the changes

    changes

    - changes is an iterator of all FileActions for a particular commit version. Note that for log files where InCommitTimestamps are enabled, the iterator must also contain the CommitInfo action.

    spark

    - SparkSession

    isStreaming

    - indicates whether the DataFrame returned is a streaming DataFrame

    useCoarseGrainedCDC

    - ignores checks related to CDC being disabled in any of the versions and computes CDC entirely from AddFiles/RemoveFiles (ignoring AddCDCFile actions)

    startVersionSnapshot

    - The snapshot of the starting version.

    returns

    CDCInfo which contains the DataFrame of the changes as well as the statistics related to the changes

    Definition Classes
    CDCReaderImpl
  21. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  22. def deltaAssert(check: ⇒ Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit

    Helper method to check invariants in Delta code.

    Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  23. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  24. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  25. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  26. def getBatchSchemaModeForTable(spark: SparkSession, columnMappingEnabled: Boolean): DeltaBatchCDFSchemaMode

    Get the batch cdf schema mode for a table, considering whether it has column mapping enabled or not.

    Get the batch cdf schema mode for a table, considering whether it has column mapping enabled or not.

    Definition Classes
    CDCReaderImpl
  27. def getCDCRelation(spark: SparkSession, snapshotToUse: Snapshot, isTimeTravelQuery: Boolean, conf: SQLConf, options: CaseInsensitiveStringMap): BaseRelation

    Get a Relation that represents change data between two snapshots of the table.

    Get a Relation that represents change data between two snapshots of the table.

    spark

    Spark session

    snapshotToUse

    Snapshot to use to provide read schema and version

    isTimeTravelQuery

    Whether this CDC scan is used in conjunction with time-travel args

    conf

    SQL conf

    options

    CDC specific options

    Definition Classes
    CDCReaderImpl
  28. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  29. def getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
    Definition Classes
    DeltaLogging
  30. def getDeletedAndAddedRows(addFileSpecs: Seq[CDCDataSpec[AddFile]], removeFileSpecs: Seq[CDCDataSpec[RemoveFile]], deltaLog: DeltaLog, snapshot: SnapshotDescriptor, isStreaming: Boolean, spark: SparkSession): Seq[DataFrame]

    Generate CDC rows by looking at added and removed files, together with Deletion Vectors they may have.

    Generate CDC rows by looking at added and removed files, together with Deletion Vectors they may have.

    When DV is used, the same file can be removed then added in the same version, and the only difference is the assigned DVs. The base method does not consider DVs in this case, thus will produce CDC that *all* rows in file being removed then *some* re-added. The correct answer, however, is to compare two DVs and apply the diff to the file to get removed and re-added rows.

    Currently it is always the case that in the log "remove" comes first, followed by "add" -- which means that the file stays alive with a new DV. There's another possibility, though not make many senses, that a file is "added" to log then "removed" in the same version. If this becomes possible in future, we have to reconstruct the timeline considering the order of actions rather than simply matching files by path.

    Attributes
    protected
    Definition Classes
    CDCReaderImpl
  31. def getErrorData(e: Throwable): Map[String, Any]
    Definition Classes
    DeltaLogging
  32. def getNonICTTimestampsByVersion(deltaLog: DeltaLog, start: Long, end: Long): Map[Long, Timestamp]

    Builds a map from commit versions to associated commit timestamps where the timestamp is the modification time of the commit file.

    Builds a map from commit versions to associated commit timestamps where the timestamp is the modification time of the commit file. Note that this function will not return InCommitTimestamps, it is up to the consumer of this function to decide whether the file modification time is the correct commit timestamp or whether they need to read the ICT.

    start

    start commit version

    end

    end commit version (inclusive)

    Definition Classes
    CDCReaderImpl
  33. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  34. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  35. def initializeLogIfNecessary(isInterpreter: Boolean): Unit
    Attributes
    protected
    Definition Classes
    Logging
  36. def isCDCEnabledOnTable(metadata: Metadata, spark: SparkSession): Boolean

    Determine if the metadata provided has cdc enabled or not.

    Determine if the metadata provided has cdc enabled or not.

    Definition Classes
    CDCReaderImpl
  37. def isCDCRead(options: CaseInsensitiveStringMap): Boolean

    Based on the read options passed it indicates whether the read was a cdc read or not.

    Based on the read options passed it indicates whether the read was a cdc read or not.

    Definition Classes
    CDCReaderImpl
  38. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  39. def isTraceEnabled(): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  40. def log: Logger
    Attributes
    protected
    Definition Classes
    Logging
  41. def logConsole(line: String): Unit
    Definition Classes
    DatabricksLogging
  42. def logDebug(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  43. def logDebug(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  44. def logDebug(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  45. def logDebug(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  46. def logError(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  47. def logError(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  48. def logError(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  49. def logError(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  50. def logInfo(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  51. def logInfo(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  52. def logInfo(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  53. def logInfo(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  54. def logName: String
    Attributes
    protected
    Definition Classes
    Logging
  55. def logTrace(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  56. def logTrace(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  57. def logTrace(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  58. def logTrace(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  59. def logWarning(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  60. def logWarning(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  61. def logWarning(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  62. def logWarning(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  63. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  64. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  65. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  66. def processDeletionVectorActions(addFilesMap: Map[FilePathWithTableVersion, AddFile], removeFilesMap: Map[FilePathWithTableVersion, RemoveFile], versionToCommitInfo: Map[Long, CommitInfo], deltaLog: DeltaLog, snapshot: SnapshotDescriptor, isStreaming: Boolean, spark: SparkSession): Seq[DataFrame]
    Definition Classes
    CDCReaderImpl
  67. def recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit

    Used to record the occurrence of a single event or report detailed, operation specific statistics.

    Used to record the occurrence of a single event or report detailed, operation specific statistics.

    path

    Used to log the path of the delta table when deltaLog is null.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  68. def recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A

    Used to report the duration as well as the success or failure of an operation on a deltaLog.

    Used to report the duration as well as the success or failure of an operation on a deltaLog.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  69. def recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A

    Used to report the duration as well as the success or failure of an operation on a tahoePath.

    Used to report the duration as well as the success or failure of an operation on a tahoePath.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  70. def recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
    Definition Classes
    DatabricksLogging
  71. def recordFrameProfile[T](group: String, name: String)(thunk: ⇒ T): T
    Attributes
    protected
    Definition Classes
    DeltaLogging
  72. def recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: ⇒ S): S
    Definition Classes
    DatabricksLogging
  73. def recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
    Definition Classes
    DatabricksLogging
  74. def recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
    Definition Classes
    DatabricksLogging
  75. def recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
    Definition Classes
    DatabricksLogging
  76. def scanIndex(spark: SparkSession, index: TahoeFileIndexWithSnapshotDescriptor, isStreaming: Boolean = false): DataFrame

    Build a dataframe from the specified file index.

    Build a dataframe from the specified file index. We can't use a DataFrame scan directly on the file names because that scan wouldn't include partition columns.

    It can optionally take a customReadSchema for the dataframe generated.

    Attributes
    protected
    Definition Classes
    CDCReaderImpl
  77. def shouldSkipFileActionsInCommit(commitInfo: CommitInfo): Boolean

    Function to check if file actions should be skipped for no-op merges based on CommitInfo metrics.

    Function to check if file actions should be skipped for no-op merges based on CommitInfo metrics. MERGE will sometimes rewrite files in a way which *could* have changed data (so dataChange = true) but did not actually do so (so no CDC will be produced). In this case the correct CDC output is empty - we shouldn't serve it from those files. This should be handled within the command, but as a hotfix-safe fix, we check the metrics. If the command reported 0 rows inserted, updated, or deleted, then CDC shouldn't be produced.

    Definition Classes
    CDCReaderImpl
  78. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  79. def toString(): String
    Definition Classes
    AnyRef → Any
  80. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  81. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  82. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  83. def withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: ⇒ T): T

    Report a log to indicate some command is running.

    Report a log to indicate some command is running.

    Definition Classes
    DeltaProgressReporter

Inherited from CDCReaderImpl

Inherited from DeltaLogging

Inherited from DatabricksLogging

Inherited from DeltaProgressReporter

Inherited from LoggingShims

Inherited from Logging

Inherited from AnyRef

Inherited from Any

Ungrouped