package stats
- Alphabetic
- Public
- All
Type Members
-
class
ArrayAccumulator extends AccumulatorV2[(Int, Long), Array[Long]]
An accumulator that keeps arrays of counts.
An accumulator that keeps arrays of counts. Counts from multiple partitions are merged by index. -1 indicates a null and is handled using TVL (-1 + N = -1)
-
class
AutoCompactPartitionStats extends AnyRef
This singleton object collect the table partition statistic for each commit that creates AddFile or RemoveFile objects.
This singleton object collect the table partition statistic for each commit that creates AddFile or RemoveFile objects. To control the memory usage, there are
maxNumTablePartitionsper table and 'maxNumPartitions' partition entries across all tables. Note:- Since the partition of each table is limited, if this limitation is reached, the least recently used table partitions will be evicted. 2. If all 'maxNumPartitions' are occupied, the partition stats of least recently used tables will be evicted until the used partitions fall back below to 'maxNumPartitions'. 3. The un-partitioned tables are treated as tables with single partition.
-
trait
AutoCompactPartitionStatsCollector extends AnyRef
A collector used to aggregate auto-compaction stats for a single commit.
A collector used to aggregate auto-compaction stats for a single commit. The expectation is to spin this up for a commit and then merging those local stats with the global stats.
-
case class
DataSize(bytesCompressed: Option[Long] = None, rows: Option[Long] = None, files: Option[Long] = None, logicalRows: Option[Long] = None) extends Product with Serializable
DataSize describes following attributes for data that consists of a list of input files
DataSize describes following attributes for data that consists of a list of input files
- bytesCompressed
total size of the data
- rows
number of rows in the data
- files
number of input files Note: Please don't add any new constructor to this class.
jackson-module-scalaalways picks up the first constructor returned byClass.getConstructorsbut the order of the constructors list is non-deterministic. (SC-13343)
- trait DataSkippingReader extends DataSkippingReaderBase
-
trait
DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging
Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.
-
case class
DeletedRecordCountsHistogram(deletedRecordCounts: Array[Long]) extends Product with Serializable
A Histogram class tracking the deleted record count distribution for all files in a table.
A Histogram class tracking the deleted record count distribution for all files in a table.
- deletedRecordCounts
An array with 10 bins where each slot represents the number of files where the number of deleted records falls within the range of the particular bin. The range of each bin is the following: bin1 -> [0,0] bin2 -> [1,9] bin3 -> [10,99] bin4 -> [100,999], bin5 -> [1000,9999] bin6 -> [10000,99999], bin7 -> [100000,999999], bin8 -> [1000000,9999999], bin9 -> [10000000,Int.Max - 1], bin10 -> [Int.Max,Long.Max].
-
case class
DeltaFileStatistics(stats: Map[String, String]) extends WriteTaskStats with Product with Serializable
A WriteTaskStats that contains a map from file name to the json representation of the collected statistics.
-
class
DeltaJobStatisticsTracker extends WriteJobStatsTracker
Serializable factory class that holds together all required parameters for being able to instantiate a DeltaTaskStatisticsTracker on an executor.
-
case class
DeltaScan(version: Long, files: Seq[AddFile], total: DataSize, partition: DataSize, scanned: DataSize)(scannedSnapshot: Snapshot, partitionFilters: ExpressionSet, dataFilters: ExpressionSet, partitionLikeDataFilters: ExpressionSet, rewrittenPartitionLikeDataFilters: Set[Expression], unusedFilters: ExpressionSet, scanDurationMs: Long, dataSkippingType: DeltaDataSkippingType) extends Product with Serializable
Used to hold details the files and stats for a scan where we have already applied filters and a limit.
-
trait
DeltaScanGenerator extends AnyRef
Trait representing a class that can generate DeltaScan given filters and a limit.
-
case class
DeltaStatsColumnSpec(deltaStatsColumnNamesOpt: Option[Seq[UnresolvedAttribute]], numIndexedColsOpt: Option[Int]) extends Product with Serializable
Specifies the set of columns to be used for stats collection on a table.
Specifies the set of columns to be used for stats collection on a table. The
deltaStatsColumnNamesOpthas higher priority thannumIndexedColsOpt. Thus, ifdeltaStatsColumnNamesOptis not None, StatisticsCollection would only collects file statistics for all columns inside it. Otherwise,numIndexedColsOptis used. -
class
DeltaTaskStatisticsTracker extends WriteTaskStatsTracker
A per-task (i.e.
A per-task (i.e. one instance per executor) WriteTaskStatsTracker that collects the statistics defined by StatisticsCollection for files that are being written into a delta table.
-
case class
FileSizeHistogram(sortedBinBoundaries: IndexedSeq[Long], fileCounts: Array[Long], totalBytes: Array[Long]) extends Product with Serializable
A Histogram class tracking the file counts and total bytes in different size ranges
A Histogram class tracking the file counts and total bytes in different size ranges
- sortedBinBoundaries
- a sorted list of bin boundaries where each element represents the start of the bin (included) and the next element represents the end of the bin (excluded)
- fileCounts
- an array of Int representing total number of files in different bins
- totalBytes
- an array of Long representing total number of bytes in different bins
-
case class
FilterMetric(numFiles: Long, predicates: Seq[QueryPredicateReport]) extends Product with Serializable
Used to report details about prequery filtering of what data is scanned.
-
case class
NumRecords(numPhysicalRecords: Long, numLogicalRecords: Long) extends Product with Serializable
Used in deduplicateAndFilterRemovedLocally/getFilesAndNumRecords iterator for grouping physical and logical number of records.
Used in deduplicateAndFilterRemovedLocally/getFilesAndNumRecords iterator for grouping physical and logical number of records.
- numPhysicalRecords
The number of records physically present in the file.
- numLogicalRecords
The physical number of records minus the Deletion Vector cardinality.
- class PrepareDeltaScan extends Rule[LogicalPlan] with PrepareDeltaScanBase
-
trait
PrepareDeltaScanBase extends Rule[LogicalPlan] with PredicateHelper with DeltaLogging with OptimizeMetadataOnlyDeltaQuery with SubqueryTransformerHelper
Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.
Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.
Note the following - This rule also ensures that all reads from the same delta log use the same snapshot of log thus providing snapshot isolation. - If this rule is invoked within an active OptimisticTransaction, then the scans are generated using the transaction.
-
case class
PreparedDeltaFileIndex(spark: SparkSession, deltaLog: DeltaLog, path: Path, preparedScan: DeltaScan, versionScanned: Option[Long]) extends TahoeFileIndexWithSnapshotDescriptor with DeltaLogging with Product with Serializable
A TahoeFileIndex that uses a prepared scan to return the list of relevant files.
A TahoeFileIndex that uses a prepared scan to return the list of relevant files. This is injected into a query right before query planning by PrepareDeltaScan so that CBO and metering can accurately understand how much data will be read.
- versionScanned
The version of the table that is being scanned, if a specific version has specifically been requested, e.g. by time travel.
-
case class
QueryPredicateReport(predicate: String, pruningType: String, filesMissingStats: Long, filesDropped: Long) extends Product with Serializable
Used to report metrics on how predicates are used to prune the set of files that are read by a query.
Used to report metrics on how predicates are used to prune the set of files that are read by a query.
- predicate
A user readable version of the predicate.
- pruningType
One of {partition, dataStats, none}.
- filesMissingStats
The number of files that were included due to missing statistics.
- filesDropped
The number of files that were dropped by this predicate.
-
trait
ReadsMetadataFields extends AnyRef
A mixin trait that provides access to the stats fields in the transaction log.
-
case class
ScanAfterLimit(files: Seq[AddFile], byteSize: Option[Long], numPhysicalRecords: Option[Long], numLogicalRecords: Option[Long]) extends Product with Serializable
Used to hold the list of files and scan stats after pruning files using the limit.
-
trait
StatisticsCollection extends DeltaLogging
A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.
A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.
Global statistics (such as the number of records) are stored as top level columns. Per-column statistics (such as min/max) are stored in a struct that mirrors the schema of the data.
To illustrate, here is an example of a data schema along with the schema of the statistics that would be collected.
Data Schema:
|-- a: struct (nullable = true) | |-- b: struct (nullable = true) | | |-- c: long (nullable = true)
Collected Statistics:
|-- stats: struct (nullable = true) | |-- numRecords: long (nullable = false) | |-- minValues: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true) | |-- maxValues: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true) | |-- nullCount: struct (nullable = false) | | |-- a: struct (nullable = false) | | | |-- b: struct (nullable = false) | | | | |-- c: long (nullable = true)
-
abstract
class
StatsCollector extends Serializable
A helper class to collect stats of parquet data files for Delta table and its equivalent (tables that can be converted into Delta table like Parquet/Iceberg table).
Value Members
- object AutoCompactPartitionStats
- object DataSize extends Serializable
-
object
DataSkippingPredicateBuilder
A collection of supported data skipping predicate builders.
- object DeltaDataSkippingType extends Enumeration
-
object
DeltaStatistics
A singleton of the Delta statistics field names.
- object ParallelFetchPool
- object PrepareDeltaScanBase
-
object
SkippingEligibleColumn
An extractor that matches on access of a skipping-eligible column.
An extractor that matches on access of a skipping-eligible column. We only collect stats for leaf columns, so internal columns of nested types are ineligible for skipping.
NOTE: This check is sufficient for safe use of NULL_COUNT stats, but safe use of MIN and MAX stats requires additional restrictions on column data type (see SkippingEligibleLiteral).
- returns
The path to the column and the column's data type if it exists and is eligible. Otherwise, return None.
- object SkippingEligibleDataType
-
object
SkippingEligibleLiteral
An extractor that matches on access of a skipping-eligible Literal.
An extractor that matches on access of a skipping-eligible Literal. Delta tables track min/max stats for a limited set of data types, and only Literals of those types are skipping-eligible.
- returns
The Literal, if it is eligible. Otherwise, return None.
- object StatisticsCollection extends DeltaCommand
- object StatsCollectionUtils extends LoggingShims
- object StatsCollector extends Serializable