trait DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging
Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.
- Alphabetic
- By Inheritance
- DataSkippingReaderBase
- StateCache
- ReadsMetadataFields
- StatisticsCollection
- DeltaLogging
- DatabricksLogging
- DeltaProgressReporter
- LoggingShims
- Logging
- DeltaScanGenerator
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Type Members
-
implicit
class
LogStringContext extends AnyRef
- Definition Classes
- LoggingShims
-
class
DataFiltersBuilder extends AnyRef
Builds the data filters for data skipping.
-
class
CachedDS[A] extends AnyRef
- Definition Classes
- StateCache
Abstract Value Members
- abstract def allFiles: Dataset[AddFile]
-
abstract
def
columnMappingMode: DeltaColumnMappingMode
The column mapping mode of the target delta table.
The column mapping mode of the target delta table.
- Definition Classes
- StatisticsCollection
- abstract def deltaLog: DeltaLog
- abstract def metadata: Metadata
-
abstract
def
outputAttributeSchema: StructType
The schema of the output attributes of the write queries that needs to collect statistics.
The schema of the output attributes of the write queries that needs to collect statistics. The partition columns' definitions are not included in this schema.
- Definition Classes
- StatisticsCollection
-
abstract
def
outputTableStatsSchema: StructType
The output attributes (
outputAttributeSchema) that are replaced with table schema with the physical mapping information.The output attributes (
outputAttributeSchema) that are replaced with table schema with the physical mapping information. NOTE: The partition columns' definitions are not included in this schema.- Definition Classes
- StatisticsCollection
- abstract def path: Path
-
abstract
def
protocol: Protocol
- Attributes
- protected
- Definition Classes
- StatisticsCollection
- abstract def redactedPath: String
- abstract def schema: StructType
-
abstract
val
snapshotToScan: Snapshot
The snapshot that the scan is being generated on.
The snapshot that the scan is being generated on.
- Definition Classes
- DeltaScanGenerator
-
abstract
def
spark: SparkSession
- Attributes
- protected
- Definition Classes
- StateCache
-
abstract
val
statsColumnSpec: DeltaStatsColumnSpec
The statistic indexed column specification of the target delta table.
The statistic indexed column specification of the target delta table.
- Definition Classes
- StatisticsCollection
-
abstract
def
tableSchema: StructType
The schema of the target table of this statistics collection.
The schema of the target table of this statistics collection.
- Definition Classes
- StatisticsCollection
- abstract def version: Long
Concrete Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
applyFuncToStatisticsColumn(statisticsSchema: StructType, statisticsColumn: Column)(function: PartialFunction[(Column, StructField), Option[Column]]): Seq[Column]
Traverses the statisticsSchema for the provided statisticsColumn and applies function to leaves.
Traverses the statisticsSchema for the provided statisticsColumn and applies function to leaves.
Note, for values that are outside the domain of the partial function we keep the original column. If the caller wants to drop the column needs to explicitly return None.
- Definition Classes
- StatisticsCollection
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
cacheDS[A](ds: Dataset[A], name: String): CachedDS[A]
Create a CachedDS instance for the given Dataset and the name.
Create a CachedDS instance for the given Dataset and the name.
- Definition Classes
- StateCache
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
constructNotNullFilter(statsProvider: StatsProvider, pathToColumn: Seq[String]): Option[DataSkippingPredicate]
Constructs a DataSkippingPredicate for isNotNull predicates.
Constructs a DataSkippingPredicate for isNotNull predicates.
- Attributes
- protected
-
def
constructPartitionFilters(filters: Seq[Expression]): Column
Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.
Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.
- Attributes
- protected
-
def
convertDataFrameToAddFiles(df: DataFrame): Array[AddFile]
- Attributes
- protected
-
def
datasetRefCache[A](creator: () ⇒ Dataset[A]): DatasetRefCache[A]
- Definition Classes
- StateCache
-
lazy val
deletionVectorsSupported: Boolean
- Definition Classes
- StatisticsCollection
-
def
deltaAssert(check: ⇒ Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit
Helper method to check invariants in Delta code.
Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
filesForScan(limit: Long, partitionFilters: Seq[Expression]): DeltaScan
Gathers files that should be included in a scan based on the given predicates and limit.
Gathers files that should be included in a scan based on the given predicates and limit. This will be called only when all predicates are on partitioning columns. Statistics about the amount of data that will be read are gathered and returned.
- Definition Classes
- DataSkippingReaderBase → DeltaScanGenerator
-
def
filesForScan(filters: Seq[Expression], keepNumRecords: Boolean): DeltaScan
Gathers files that should be included in a scan based on the given predicates.
Gathers files that should be included in a scan based on the given predicates. Statistics about the amount of data that will be read are gathered and returned. Note, the statistics column that is added when keepNumRecords = true should NOT take into account DVs. Consumers of this method might commit the file. The semantics of the statistics need to be consistent across all files.
- Definition Classes
- DataSkippingReaderBase → DeltaScanGenerator
-
def
filesWithStatsForScan(partitionFilters: Seq[Expression]): DataFrame
Returns a DataFrame for the given partition filters.
Returns a DataFrame for the given partition filters. The schema of returned DataFrame is nearly the same as
AddFile, except that thestatsfield is parsed to a struct from a json string.- Definition Classes
- DataSkippingReaderBase → DeltaScanGenerator
-
def
filterOnPartitions(partitionFilters: Seq[Expression], keepNumRecords: Boolean): (Seq[AddFile], DataSize)
Get all the files in this table given the partition filter and the corresponding size of the scan.
Get all the files in this table given the partition filter and the corresponding size of the scan.
- keepNumRecords
Also select
stats.numRecordsin the query. This may slow down the query as it has to parse json.
- Attributes
- protected
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
getAllFiles(keepNumRecords: Boolean): Seq[AddFile]
Get all the files in this table.
Get all the files in this table.
- keepNumRecords
Also select
stats.numRecordsin the query. This may slow down the query as it has to parse json.
- Attributes
- protected
-
def
getBaseStatsColumn: Column
Returns a Column that references the stats field data skipping should use
Returns a Column that references the stats field data skipping should use
- Definition Classes
- ReadsMetadataFields
-
def
getBaseStatsColumnName: String
- Definition Classes
- ReadsMetadataFields
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
- Definition Classes
- DeltaLogging
-
def
getDataSkippedFiles(partitionFilters: Column, dataFilters: DataSkippingPredicate, keepNumRecords: Boolean): (Seq[AddFile], Seq[DataSize])
Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried.
Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried. Returns a tuple of the files and optionally the size of the scan that's generated if there were no filters, if there were only partition filters, and combined effect of partition and data filters respectively.
- Attributes
- protected
-
def
getErrorData(e: Throwable): Map[String, Any]
- Definition Classes
- DeltaLogging
-
def
getFilesAndNumRecords(df: DataFrame): Iterator[(AddFile, NumRecords)] with Closeable
Get the files and number of records within each file, to perform limit pushdown.
-
def
getSpecificFilesWithStats(paths: Seq[String]): Seq[AddFile]
Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot.
Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot. If a path doesn't exist in snapshot, it will be ignored and no AddFile will be returned for it.
- paths
Sequence of paths for which we want to get AddFile action
- returns
a sequence of addFiles for the given
paths
-
final
def
getStatsColumnOpt(stat: StatsColumn): Option[Column]
Overload for convenience working with StatsColumn helpers
Overload for convenience working with StatsColumn helpers
- Attributes
- protected
-
final
def
getStatsColumnOpt(statType: String, pathToColumn: Seq[String] = Nil): Option[Column]
Convenience overload for single element stat type paths.
Convenience overload for single element stat type paths.
- Attributes
- protected
-
final
def
getStatsColumnOpt(pathToStatType: Seq[String], pathToColumn: Seq[String]): Option[Column]
Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.
Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.
- pathToStatType
Path components of one of the fields declared by the
DeltaStatisticsobject. For statistics of collated strings, this path contains the versioned collation identifier. In all other cases the path only has one element. The path is in reverse order.- pathToColumn
The components of the nested column name to get stats for. The components are in reverse order.
- Attributes
- protected
-
final
def
getStatsColumnOrNullLiteral(stat: StatsColumn): Column
Overload for convenience working with StatsColumn helpers
Overload for convenience working with StatsColumn helpers
- Attributes
- protected[delta]
-
final
def
getStatsColumnOrNullLiteral(statType: String, pathToColumn: Seq[String] = Nil): Column
Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.
Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.
- Attributes
- protected[delta]
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logConsole(line: String): Unit
- Definition Classes
- DatabricksLogging
-
def
logDebug(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(entry: LogEntry, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(entry: LogEntry): Unit
- Attributes
- protected
- Definition Classes
- LoggingShims
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
pruneFilesByLimit(df: DataFrame, limit: Long): ScanAfterLimit
- Attributes
- protected[delta]
-
def
recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit
Used to record the occurrence of a single event or report detailed, operation specific statistics.
Used to record the occurrence of a single event or report detailed, operation specific statistics.
- path
Used to log the path of the delta table when
deltaLogis null.
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
deltaLog.Used to report the duration as well as the success or failure of an operation on a
deltaLog.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A
Used to report the duration as well as the success or failure of an operation on a
tahoePath.Used to report the duration as well as the success or failure of an operation on a
tahoePath.- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordFrameProfile[T](group: String, name: String)(thunk: ⇒ T): T
- Attributes
- protected
- Definition Classes
- DeltaLogging
-
def
recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: ⇒ S): S
- Definition Classes
- DatabricksLogging
-
def
recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
- Definition Classes
- DatabricksLogging
-
def
recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
def
recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
- Definition Classes
- DatabricksLogging
-
lazy val
statCollectionLogicalSchema: StructType
statCollectionLogicalSchema is the logical schema that is composed of all the columns that have the stats collected with our current table configuration.
statCollectionLogicalSchema is the logical schema that is composed of all the columns that have the stats collected with our current table configuration.
- Definition Classes
- StatisticsCollection
-
lazy val
statCollectionPhysicalSchema: StructType
statCollectionPhysicalSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.
statCollectionPhysicalSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.
- Definition Classes
- StatisticsCollection
-
lazy val
statsCollector: Column
Returns a struct column that can be used to collect statistics for the current schema of the table.
Returns a struct column that can be used to collect statistics for the current schema of the table. The types we keep stats on must be consistent with DataSkippingReader.SkippingEligibleLiteral. If a column is missing from dataSchema (which will be filled with nulls), we will only collect the NULL_COUNT stats for it as the number of rows.
- Definition Classes
- StatisticsCollection
-
lazy val
statsSchema: StructType
Returns schema of the statistics collected.
Returns schema of the statistics collected.
- Definition Classes
- StatisticsCollection
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
def
uncache(): Unit
Drop any cached data for this Snapshot.
Drop any cached data for this Snapshot.
- Definition Classes
- StateCache
-
def
updateStatsToWideBounds(withStats: DataFrame, statsColName: String): DataFrame
Sets the TIGHT_BOUNDS column to false and converts the logical nullCount to a tri-state nullCount.
Sets the TIGHT_BOUNDS column to false and converts the logical nullCount to a tri-state nullCount. The nullCount states are the following: 1) For "all-nulls" columns we set the physical nullCount which is equal to the physical numRecords. 2) "no-nulls" columns remain unchanged, i.e. zero nullCount is the same for both physical and logical representations. 3) For "some-nulls" columns, we leave the existing value. In files with wide bounds, the nullCount in SOME_NULLs columns is considered unknown.
The file's state can transition back to tight when statistics are recomputed. In that case, TIGHT_BOUNDS is set back to true and nullCount back to the logical value.
Note, this function gets as input parsed statistics and returns a json document similarly to allFiles. To further match the behavior of allFiles we always return a column named
statsinstead of statsColName.- withStats
A dataFrame of actions with parsed statistics.
- statsColName
The name of the parsed statistics column.
- Definition Classes
- StatisticsCollection
-
def
verifyStatsForFilter(referencedStats: Set[StatsColumn]): Column
Returns an expression that can be used to check that the required statistics are present for a given file.
Returns an expression that can be used to check that the required statistics are present for a given file. If any required statistics are missing we must include the corresponding file.
NOTE: We intentionally choose to disable skipping for any file if any required stat is missing, because doing it that way allows us to check each stat only once (rather than once per use). Checking per-use would anyway only help for tables where the number of indexed columns has changed over time, producing add.stats_parsed records with differing schemas. That should be a rare enough case to not worry about optimizing for, given that the fix requires more complex skipping predicates that would penalize the common case.
- Attributes
- protected
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
withNoStats: DataFrame
All files with the statistics column dropped completely.
-
final
def
withStats: DataFrame
Returns a parsed and cached representation of files with statistics.
Returns a parsed and cached representation of files with statistics.
- returns
- def withStatsDeduplicated: DataFrame
-
def
withStatsInternal: DataFrame
- Attributes
- protected
-
def
withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: ⇒ T): T
Report a log to indicate some command is running.
Report a log to indicate some command is running.
- Definition Classes
- DeltaProgressReporter