t

org.apache.spark.sql.delta.stats

DataSkippingReaderBase

trait DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging

Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.

Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. DataSkippingReaderBase
  2. StateCache
  3. ReadsMetadataFields
  4. StatisticsCollection
  5. DeltaLogging
  6. DatabricksLogging
  7. DeltaProgressReporter
  8. LoggingShims
  9. Logging
  10. DeltaScanGenerator
  11. AnyRef
  12. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. implicit class LogStringContext extends AnyRef
    Definition Classes
    LoggingShims
  2. class DataFiltersBuilder extends AnyRef

    Builds the data filters for data skipping.

  3. class CachedDS[A] extends AnyRef
    Definition Classes
    StateCache

Abstract Value Members

  1. abstract def allFiles: Dataset[AddFile]
  2. abstract def columnMappingMode: DeltaColumnMappingMode

    The column mapping mode of the target delta table.

    The column mapping mode of the target delta table.

    Definition Classes
    StatisticsCollection
  3. abstract def deltaLog: DeltaLog
  4. abstract def metadata: Metadata
  5. abstract def outputAttributeSchema: StructType

    The schema of the output attributes of the write queries that needs to collect statistics.

    The schema of the output attributes of the write queries that needs to collect statistics. The partition columns' definitions are not included in this schema.

    Definition Classes
    StatisticsCollection
  6. abstract def outputTableStatsSchema: StructType

    The output attributes (outputAttributeSchema) that are replaced with table schema with the physical mapping information.

    The output attributes (outputAttributeSchema) that are replaced with table schema with the physical mapping information. NOTE: The partition columns' definitions are not included in this schema.

    Definition Classes
    StatisticsCollection
  7. abstract def path: Path
  8. abstract def protocol: Protocol
    Attributes
    protected
    Definition Classes
    StatisticsCollection
  9. abstract def redactedPath: String
  10. abstract def schema: StructType
  11. abstract val snapshotToScan: Snapshot

    The snapshot that the scan is being generated on.

    The snapshot that the scan is being generated on.

    Definition Classes
    DeltaScanGenerator
  12. abstract def spark: SparkSession
    Attributes
    protected
    Definition Classes
    StateCache
  13. abstract val statsColumnSpec: DeltaStatsColumnSpec

    The statistic indexed column specification of the target delta table.

    The statistic indexed column specification of the target delta table.

    Definition Classes
    StatisticsCollection
  14. abstract def tableSchema: StructType

    The schema of the target table of this statistics collection.

    The schema of the target table of this statistics collection.

    Definition Classes
    StatisticsCollection
  15. abstract def version: Long

Concrete Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. def applyFuncToStatisticsColumn(statisticsSchema: StructType, statisticsColumn: Column)(function: PartialFunction[(Column, StructField), Option[Column]]): Seq[Column]

    Traverses the statisticsSchema for the provided statisticsColumn and applies function to leaves.

    Traverses the statisticsSchema for the provided statisticsColumn and applies function to leaves.

    Note, for values that are outside the domain of the partial function we keep the original column. If the caller wants to drop the column needs to explicitly return None.

    Definition Classes
    StatisticsCollection
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def cacheDS[A](ds: Dataset[A], name: String): CachedDS[A]

    Create a CachedDS instance for the given Dataset and the name.

    Create a CachedDS instance for the given Dataset and the name.

    Definition Classes
    StateCache
  7. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  8. def constructNotNullFilter(statsProvider: StatsProvider, pathToColumn: Seq[String]): Option[DataSkippingPredicate]

    Constructs a DataSkippingPredicate for isNotNull predicates.

    Constructs a DataSkippingPredicate for isNotNull predicates.

    Attributes
    protected
  9. def constructPartitionFilters(filters: Seq[Expression]): Column

    Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.

    Given the partition filters on the data, rewrite these filters by pointing to the metadata columns.

    Attributes
    protected
  10. def convertDataFrameToAddFiles(df: DataFrame): Array[AddFile]
    Attributes
    protected
  11. def datasetRefCache[A](creator: () ⇒ Dataset[A]): DatasetRefCache[A]
    Definition Classes
    StateCache
  12. lazy val deletionVectorsSupported: Boolean
    Definition Classes
    StatisticsCollection
  13. def deltaAssert(check: ⇒ Boolean, name: String, msg: String, deltaLog: DeltaLog = null, data: AnyRef = null, path: Option[Path] = None): Unit

    Helper method to check invariants in Delta code.

    Helper method to check invariants in Delta code. Fails when running in tests, records a delta assertion event and logs a warning otherwise.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  14. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  16. def filesForScan(limit: Long, partitionFilters: Seq[Expression]): DeltaScan

    Gathers files that should be included in a scan based on the given predicates and limit.

    Gathers files that should be included in a scan based on the given predicates and limit. This will be called only when all predicates are on partitioning columns. Statistics about the amount of data that will be read are gathered and returned.

    Definition Classes
    DataSkippingReaderBaseDeltaScanGenerator
  17. def filesForScan(filters: Seq[Expression], keepNumRecords: Boolean): DeltaScan

    Gathers files that should be included in a scan based on the given predicates.

    Gathers files that should be included in a scan based on the given predicates. Statistics about the amount of data that will be read are gathered and returned. Note, the statistics column that is added when keepNumRecords = true should NOT take into account DVs. Consumers of this method might commit the file. The semantics of the statistics need to be consistent across all files.

    Definition Classes
    DataSkippingReaderBaseDeltaScanGenerator
  18. def filesWithStatsForScan(partitionFilters: Seq[Expression]): DataFrame

    Returns a DataFrame for the given partition filters.

    Returns a DataFrame for the given partition filters. The schema of returned DataFrame is nearly the same as AddFile, except that the stats field is parsed to a struct from a json string.

    Definition Classes
    DataSkippingReaderBaseDeltaScanGenerator
  19. def filterOnPartitions(partitionFilters: Seq[Expression], keepNumRecords: Boolean): (Seq[AddFile], DataSize)

    Get all the files in this table given the partition filter and the corresponding size of the scan.

    Get all the files in this table given the partition filter and the corresponding size of the scan.

    keepNumRecords

    Also select stats.numRecords in the query. This may slow down the query as it has to parse json.

    Attributes
    protected
  20. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  21. def getAllFiles(keepNumRecords: Boolean): Seq[AddFile]

    Get all the files in this table.

    Get all the files in this table.

    keepNumRecords

    Also select stats.numRecords in the query. This may slow down the query as it has to parse json.

    Attributes
    protected
  22. def getBaseStatsColumn: Column

    Returns a Column that references the stats field data skipping should use

    Returns a Column that references the stats field data skipping should use

    Definition Classes
    ReadsMetadataFields
  23. def getBaseStatsColumnName: String
    Definition Classes
    ReadsMetadataFields
  24. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  25. def getCommonTags(deltaLog: DeltaLog, tahoeId: String): Map[TagDefinition, String]
    Definition Classes
    DeltaLogging
  26. def getDataSkippedFiles(partitionFilters: Column, dataFilters: DataSkippingPredicate, keepNumRecords: Boolean): (Seq[AddFile], Seq[DataSize])

    Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried.

    Given the partition and data filters, leverage data skipping statistics to find the set of files that need to be queried. Returns a tuple of the files and optionally the size of the scan that's generated if there were no filters, if there were only partition filters, and combined effect of partition and data filters respectively.

    Attributes
    protected
  27. def getErrorData(e: Throwable): Map[String, Any]
    Definition Classes
    DeltaLogging
  28. def getFilesAndNumRecords(df: DataFrame): Iterator[(AddFile, NumRecords)] with Closeable

    Get the files and number of records within each file, to perform limit pushdown.

  29. def getSpecificFilesWithStats(paths: Seq[String]): Seq[AddFile]

    Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot.

    Get AddFile (with stats) actions corresponding to given set of paths in the Snapshot. If a path doesn't exist in snapshot, it will be ignored and no AddFile will be returned for it.

    paths

    Sequence of paths for which we want to get AddFile action

    returns

    a sequence of addFiles for the given paths

  30. final def getStatsColumnOpt(stat: StatsColumn): Option[Column]

    Overload for convenience working with StatsColumn helpers

    Overload for convenience working with StatsColumn helpers

    Attributes
    protected
  31. final def getStatsColumnOpt(statType: String, pathToColumn: Seq[String] = Nil): Option[Column]

    Convenience overload for single element stat type paths.

    Convenience overload for single element stat type paths.

    Attributes
    protected
  32. final def getStatsColumnOpt(pathToStatType: Seq[String], pathToColumn: Seq[String]): Option[Column]

    Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.

    Returns an expression to access the given statistics for a specific column, or None if that stats column does not exist.

    pathToStatType

    Path components of one of the fields declared by the DeltaStatistics object. For statistics of collated strings, this path contains the versioned collation identifier. In all other cases the path only has one element. The path is in reverse order.

    pathToColumn

    The components of the nested column name to get stats for. The components are in reverse order.

    Attributes
    protected
  33. final def getStatsColumnOrNullLiteral(stat: StatsColumn): Column

    Overload for convenience working with StatsColumn helpers

    Overload for convenience working with StatsColumn helpers

    Attributes
    protected[delta]
  34. final def getStatsColumnOrNullLiteral(statType: String, pathToColumn: Seq[String] = Nil): Column

    Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.

    Returns an expression to access the given statistics for a specific column, or a NULL literal expression if that column does not exist.

    Attributes
    protected[delta]
  35. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  36. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  37. def initializeLogIfNecessary(isInterpreter: Boolean): Unit
    Attributes
    protected
    Definition Classes
    Logging
  38. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  39. def isTraceEnabled(): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  40. def log: Logger
    Attributes
    protected
    Definition Classes
    Logging
  41. def logConsole(line: String): Unit
    Definition Classes
    DatabricksLogging
  42. def logDebug(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  43. def logDebug(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  44. def logDebug(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  45. def logDebug(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  46. def logError(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  47. def logError(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  48. def logError(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  49. def logError(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  50. def logInfo(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  51. def logInfo(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  52. def logInfo(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  53. def logInfo(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  54. def logName: String
    Attributes
    protected
    Definition Classes
    Logging
  55. def logTrace(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  56. def logTrace(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  57. def logTrace(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  58. def logTrace(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  59. def logWarning(entry: LogEntry, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  60. def logWarning(entry: LogEntry): Unit
    Attributes
    protected
    Definition Classes
    LoggingShims
  61. def logWarning(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  62. def logWarning(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  63. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  64. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  65. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  66. def pruneFilesByLimit(df: DataFrame, limit: Long): ScanAfterLimit
    Attributes
    protected[delta]
  67. def recordDeltaEvent(deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty, data: AnyRef = null, path: Option[Path] = None): Unit

    Used to record the occurrence of a single event or report detailed, operation specific statistics.

    Used to record the occurrence of a single event or report detailed, operation specific statistics.

    path

    Used to log the path of the delta table when deltaLog is null.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  68. def recordDeltaOperation[A](deltaLog: DeltaLog, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A

    Used to report the duration as well as the success or failure of an operation on a deltaLog.

    Used to report the duration as well as the success or failure of an operation on a deltaLog.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  69. def recordDeltaOperationForTablePath[A](tablePath: String, opType: String, tags: Map[TagDefinition, String] = Map.empty)(thunk: ⇒ A): A

    Used to report the duration as well as the success or failure of an operation on a tahoePath.

    Used to report the duration as well as the success or failure of an operation on a tahoePath.

    Attributes
    protected
    Definition Classes
    DeltaLogging
  70. def recordEvent(metric: MetricDefinition, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
    Definition Classes
    DatabricksLogging
  71. def recordFrameProfile[T](group: String, name: String)(thunk: ⇒ T): T
    Attributes
    protected
    Definition Classes
    DeltaLogging
  72. def recordOperation[S](opType: OpType, opTarget: String = null, extraTags: Map[TagDefinition, String], isSynchronous: Boolean = true, alwaysRecordStats: Boolean = false, allowAuthTags: Boolean = false, killJvmIfStuck: Boolean = false, outputMetric: MetricDefinition = METRIC_OPERATION_DURATION, silent: Boolean = true)(thunk: ⇒ S): S
    Definition Classes
    DatabricksLogging
  73. def recordProductEvent(metric: MetricDefinition with CentralizableMetric, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, trimBlob: Boolean = true): Unit
    Definition Classes
    DatabricksLogging
  74. def recordProductUsage(metric: MetricDefinition with CentralizableMetric, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
    Definition Classes
    DatabricksLogging
  75. def recordUsage(metric: MetricDefinition, quantity: Double, additionalTags: Map[TagDefinition, String] = Map.empty, blob: String = null, forceSample: Boolean = false, trimBlob: Boolean = true, silent: Boolean = false): Unit
    Definition Classes
    DatabricksLogging
  76. lazy val statCollectionLogicalSchema: StructType

    statCollectionLogicalSchema is the logical schema that is composed of all the columns that have the stats collected with our current table configuration.

    statCollectionLogicalSchema is the logical schema that is composed of all the columns that have the stats collected with our current table configuration.

    Definition Classes
    StatisticsCollection
  77. lazy val statCollectionPhysicalSchema: StructType

    statCollectionPhysicalSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.

    statCollectionPhysicalSchema is the schema that is composed of all the columns that have the stats collected with our current table configuration.

    Definition Classes
    StatisticsCollection
  78. lazy val statsCollector: Column

    Returns a struct column that can be used to collect statistics for the current schema of the table.

    Returns a struct column that can be used to collect statistics for the current schema of the table. The types we keep stats on must be consistent with DataSkippingReader.SkippingEligibleLiteral. If a column is missing from dataSchema (which will be filled with nulls), we will only collect the NULL_COUNT stats for it as the number of rows.

    Definition Classes
    StatisticsCollection
  79. lazy val statsSchema: StructType

    Returns schema of the statistics collected.

    Returns schema of the statistics collected.

    Definition Classes
    StatisticsCollection
  80. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  81. def toString(): String
    Definition Classes
    AnyRef → Any
  82. def uncache(): Unit

    Drop any cached data for this Snapshot.

    Drop any cached data for this Snapshot.

    Definition Classes
    StateCache
  83. def updateStatsToWideBounds(withStats: DataFrame, statsColName: String): DataFrame

    Sets the TIGHT_BOUNDS column to false and converts the logical nullCount to a tri-state nullCount.

    Sets the TIGHT_BOUNDS column to false and converts the logical nullCount to a tri-state nullCount. The nullCount states are the following: 1) For "all-nulls" columns we set the physical nullCount which is equal to the physical numRecords. 2) "no-nulls" columns remain unchanged, i.e. zero nullCount is the same for both physical and logical representations. 3) For "some-nulls" columns, we leave the existing value. In files with wide bounds, the nullCount in SOME_NULLs columns is considered unknown.

    The file's state can transition back to tight when statistics are recomputed. In that case, TIGHT_BOUNDS is set back to true and nullCount back to the logical value.

    Note, this function gets as input parsed statistics and returns a json document similarly to allFiles. To further match the behavior of allFiles we always return a column named stats instead of statsColName.

    withStats

    A dataFrame of actions with parsed statistics.

    statsColName

    The name of the parsed statistics column.

    Definition Classes
    StatisticsCollection
  84. def verifyStatsForFilter(referencedStats: Set[StatsColumn]): Column

    Returns an expression that can be used to check that the required statistics are present for a given file.

    Returns an expression that can be used to check that the required statistics are present for a given file. If any required statistics are missing we must include the corresponding file.

    NOTE: We intentionally choose to disable skipping for any file if any required stat is missing, because doing it that way allows us to check each stat only once (rather than once per use). Checking per-use would anyway only help for tables where the number of indexed columns has changed over time, producing add.stats_parsed records with differing schemas. That should be a rare enough case to not worry about optimizing for, given that the fix requires more complex skipping predicates that would penalize the common case.

    Attributes
    protected
  85. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  86. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  87. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  88. def withNoStats: DataFrame

    All files with the statistics column dropped completely.

  89. final def withStats: DataFrame

    Returns a parsed and cached representation of files with statistics.

    Returns a parsed and cached representation of files with statistics.

    returns

    DataFrame

  90. def withStatsDeduplicated: DataFrame
  91. def withStatsInternal: DataFrame
    Attributes
    protected
  92. def withStatusCode[T](statusCode: String, defaultMessage: String, data: Map[String, Any] = Map.empty)(body: ⇒ T): T

    Report a log to indicate some command is running.

    Report a log to indicate some command is running.

    Definition Classes
    DeltaProgressReporter

Inherited from StateCache

Inherited from ReadsMetadataFields

Inherited from StatisticsCollection

Inherited from DeltaLogging

Inherited from DatabricksLogging

Inherited from DeltaProgressReporter

Inherited from LoggingShims

Inherited from Logging

Inherited from DeltaScanGenerator

Inherited from AnyRef

Inherited from Any

Ungrouped