package sources
- Alphabetic
- Public
- All
Type Members
-
case class
CompositeLimit(bytes: ReadMaxBytes, maxFiles: ReadMaxFiles, minFiles: ReadMinFiles = ReadMinFiles(-1)) extends ReadLimit with Product with Serializable
A read limit that admits the given soft-max of
bytesor maxmaxFiles, onceminFileshas been reached.A read limit that admits the given soft-max of
bytesor maxmaxFiles, onceminFileshas been reached. Prior to that anything is admitted. -
class
DeltaDataSource extends RelationProvider with StreamSourceProvider with StreamSinkProvider with CreatableRelationProviderShim with DataSourceRegister with TableProvider with DeltaLogging
A DataSource V1 for integrating Delta into Spark SQL batch and Streaming APIs.
-
trait
DeltaSQLConfBase extends AnyRef
SQLConf entries for Delta features.
-
case class
DeltaSink(sqlContext: SQLContext, path: Path, partitionColumns: Seq[String], outputMode: OutputMode, options: DeltaOptions, catalogTable: Option[CatalogTable] = None) extends Sink with ImplicitMetadataOperation with UpdateExpressionsSupport with DeltaLogging with Product with Serializable
A streaming sink that writes data into a Delta Table.
-
case class
DeltaSource(spark: SparkSession, deltaLog: DeltaLog, options: DeltaOptions, snapshotAtSourceInit: SnapshotDescriptor, metadataPath: String, metadataTrackingLog: Option[DeltaSourceMetadataTrackingLog] = None, filters: Seq[Expression] = Nil) extends DeltaSourceBase with DeltaSourceCDCSupport with DeltaSourceMetadataEvolutionSupport with Product with Serializable
A streaming source for a Delta table.
A streaming source for a Delta table.
When a new stream is started, delta starts by constructing a org.apache.spark.sql.delta.Snapshot at the current version of the table. This snapshot is broken up into batches until all existing data has been processed. Subsequent processing is done by tailing the change log looking for new data. This results in the streaming query returning the same answer as a batch query that had processed the entire dataset at any given point.
-
trait
DeltaSourceBase extends Source with SupportsAdmissionControl with SupportsTriggerAvailableNow with DeltaLogging
Base trait for the Delta Source, that contains methods that deal with getting changes from the delta log.
-
trait
DeltaSourceCDCSupport extends AnyRef
Helper functions for CDC-specific handling for DeltaSource.
-
trait
DeltaSourceMetadataEvolutionSupport extends DeltaSourceBase
Helper functions for metadata evolution related handling for DeltaSource.
Helper functions for metadata evolution related handling for DeltaSource. A metadata change is one of: 1. Schema change 2. Delta table configuration change 3. Delta protocol change The documentation below will use schema change as example throughout.
To achieve schema evolution, we intercept in different stages of the normal streaming process to: 1. Capture all schema changes inside a stream 2. Stop the latestOffset from crossing the schema change boundary 3. Ensure the batch prior to the schema change can still be served correctly 4. Ensure the stream fails if and only if the prior batch is served successfully 5. Write the new schema to the schema tracking log prior to stream failure, so that next time when it restarts we will use the updated schema.
Specifically, 1. During latestOffset calls, if we detect schema change at version V, we generate a special barrier DeltaSourceOffset X that has ver=V and index=INDEX_METADATA_CHANGE. (We first generate an IndexedFile at this index, and that gets converted into an equivalent DeltaSourceOffset.) INDEX_METADATA_CHANGE comes after INDEX_VERSION_BASE (the first offset index that exists for any reservoir version) and before the offsets that represent data changes. This ensures that we apply the schema change before processing the data that uses that schema. 2. When we see a schema change offset X, then this is treated as a barrier that ends the current batch. The remaining data is effectively unavailable until all the source data before the schema change has been committed. 3. Then, when a commit is invoked on the offset schema change barrier offset X, we can then officially write the new schema into the schema tracking log and fail the stream. commit is only called after this batch ending at X is completed, so it would be safe to fail there. 4. In between when offset X is generated and when it is committed, there could be arbitrary number of calls to latestOffset, attempting to fetch new latestOffset. These calls mustn't generate new offsets until the schema change barrier offset has been committed, the new schema has been written to the schema tracking log, and the stream has been aborted and restarted. A nuance here - streaming engine won't commit until it sees a new offset that is semantically different, which is why we first generate an offset X with index INDEX_METADATA_CHANGE, but another second barrier offset X' immediately following it with index INDEX_POST_SCHEMA_CHANGE. In this way, we could ensure: a) Offset with index INDEX_METADATA_CHANGE is always committed (typically) b) Even if streaming engine changed its behavior and ONLY offset with index INDEX_POST_SCHEMA_CHANGE is committed, we can still see this is a schema change barrier with a schema change ready to be evolved. c) Whenever latestOffset sees a startOffset with a schema change barrier index, we can easily tell that we should not progress past the schema change, unless the schema change has actually happened. When a stream is restarted post a schema evolution (not initialization), it is guaranteed to have >= 2 entries in the schema log. To prevent users from shooting themselves in the foot while blindly restart stream without considering implications to downstream tables, by default we would not allow stream to restart without a magic SQL conf that user has to set to allow non-additive schema changes to propagate. We detect such non-additive schema changes during stream start by comparing the last schema log entry with the current one.
-
class
DeltaSourceMetadataTrackingLog extends AnyRef
Tracks the metadata changes for a particular Delta streaming source in a particular stream, it is utilized to save and lookup the correct metadata during streaming from a Delta table.
Tracks the metadata changes for a particular Delta streaming source in a particular stream, it is utilized to save and lookup the correct metadata during streaming from a Delta table. This schema log is NOT meant to be shared across different Delta streaming source instances.
-
case class
DeltaSourceOffset extends Offset with Comparable[DeltaSourceOffset] with Product with Serializable
Tracks how far we processed in when reading changes from the DeltaLog.
Tracks how far we processed in when reading changes from the DeltaLog.
Note this class retains the naming of
Reservoirto maintain compatibility with serialized offsets from the beta period.- Annotations
- @JsonDeserialize() @JsonSerialize()
-
case class
PersistedMetadata(tableId: String, deltaCommitVersion: Long, dataSchemaJson: String, partitionSchemaJson: String, sourceMetadataPath: String, tableConfigurations: Option[Map[String, String]] = None, protocolJson: Option[String] = None, previousMetadataSeqNum: Option[Long] = None) extends PartitionAndDataSchema with Product with Serializable
A PersistedMetadata is an entry in Delta streaming source schema log, which can be used to read data files during streaming.
A PersistedMetadata is an entry in Delta streaming source schema log, which can be used to read data files during streaming.
- tableId
Delta table id
- deltaCommitVersion
Delta commit version in which this change is captured. It does not necessarily have to be the commit when there's an actual change, e.g. during initialization. The invariant is that the metadata must be read-compatible with the table snapshot at this version.
- dataSchemaJson
Full schema json
- partitionSchemaJson
Partition schema json
- sourceMetadataPath
The checkpoint path that is unique to each source.
- tableConfigurations
The configurations of the table inside the metadata when the schema change was detected. It is used to correctly create the right file format when we use a particular schema to read. Default to None for backward compatibility.
- protocolJson
JSON of the protocol change if any. Default to None for backward compatibility.
- previousMetadataSeqNum
When defined, it points to the batch ID / seq num for the previous metadata in the log sequence. It is used when we could not reliably tell if the currentBatchId - 1 is indeed the previous schema evolution, e.g. when we are merging consecutive schema changes during the analysis phase and we are appending an extra schema after the merge to the log. Default to None for backward compatibility.
-
case class
ReadMaxBytes(maxBytes: Long) extends ReadLimit with Product with Serializable
A read limit that admits a soft-max of
maxBytesper micro-batch. -
case class
ReadMinFiles(minFiles: Int) extends ReadLimit with Product with Serializable
A read limit that admits a min of
minFilesper micro-batch.
Value Members
- object DeltaDataSource extends DatabricksLogging
- object DeltaSQLConf extends DeltaSQLConfBase
- object DeltaSource extends Serializable
- object DeltaSourceMetadataEvolutionSupport
- object DeltaSourceMetadataTrackingLog extends Logging
- object DeltaSourceOffset extends Logging with Serializable
- object DeltaSourceUtils
- object DeltaStreamUtils
- object NonAdditiveSchemaChangeTypes
- object PersistedMetadata extends Serializable