Packages

package stats

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class ArrayAccumulator extends AccumulatorV2[(Int, Long), Array[Long]]

    An accumulator that keeps arrays of counts.

    An accumulator that keeps arrays of counts. Counts from multiple partitions are merged by index. -1 indicates a null and is handled using TVL (-1 + N = -1)

  2. class AutoCompactPartitionStats extends AnyRef

    This singleton object collect the table partition statistic for each commit that creates AddFile or RemoveFile objects.

    This singleton object collect the table partition statistic for each commit that creates AddFile or RemoveFile objects. To control the memory usage, there are maxNumTablePartitions per table and 'maxNumPartitions' partition entries across all tables. Note:

    1. Since the partition of each table is limited, if this limitation is reached, the least recently used table partitions will be evicted. 2. If all 'maxNumPartitions' are occupied, the partition stats of least recently used tables will be evicted until the used partitions fall back below to 'maxNumPartitions'. 3. The un-partitioned tables are treated as tables with single partition.
  3. trait AutoCompactPartitionStatsCollector extends AnyRef

    A collector used to aggregate auto-compaction stats for a single commit.

    A collector used to aggregate auto-compaction stats for a single commit. The expectation is to spin this up for a commit and then merging those local stats with the global stats.

  4. case class DataSize(bytesCompressed: Option[Long] = None, rows: Option[Long] = None, files: Option[Long] = None, logicalRows: Option[Long] = None) extends Product with Serializable

    DataSize describes following attributes for data that consists of a list of input files

    DataSize describes following attributes for data that consists of a list of input files

    bytesCompressed

    total size of the data

    rows

    number of rows in the data

    files

    number of input files Note: Please don't add any new constructor to this class. jackson-module-scala always picks up the first constructor returned by Class.getConstructors but the order of the constructors list is non-deterministic. (SC-13343)

  5. trait DataSkippingReader extends DataSkippingReaderBase
  6. trait DataSkippingReaderBase extends DeltaScanGenerator with StatisticsCollection with ReadsMetadataFields with StateCache with DeltaLogging

    Adds the ability to use statistics to filter the set of files based on predicates to a org.apache.spark.sql.delta.Snapshot of a given Delta table.

  7. case class DeletedRecordCountsHistogram(deletedRecordCounts: Array[Long]) extends Product with Serializable

    A Histogram class tracking the deleted record count distribution for all files in a table.

    A Histogram class tracking the deleted record count distribution for all files in a table.

    deletedRecordCounts

    An array with 10 bins where each slot represents the number of files where the number of deleted records falls within the range of the particular bin. The range of each bin is the following: bin1 -> [0,0] bin2 -> [1,9] bin3 -> [10,99] bin4 -> [100,999], bin5 -> [1000,9999] bin6 -> [10000,99999], bin7 -> [100000,999999], bin8 -> [1000000,9999999], bin9 -> [10000000,Int.Max - 1], bin10 -> [Int.Max,Long.Max].

  8. case class DeltaFileStatistics(stats: Map[String, String]) extends WriteTaskStats with Product with Serializable

    A WriteTaskStats that contains a map from file name to the json representation of the collected statistics.

  9. class DeltaJobStatisticsTracker extends WriteJobStatsTracker

    Serializable factory class that holds together all required parameters for being able to instantiate a DeltaTaskStatisticsTracker on an executor.

  10. case class DeltaScan(version: Long, files: Seq[AddFile], total: DataSize, partition: DataSize, scanned: DataSize)(scannedSnapshot: Snapshot, partitionFilters: ExpressionSet, dataFilters: ExpressionSet, partitionLikeDataFilters: ExpressionSet, rewrittenPartitionLikeDataFilters: Set[Expression], unusedFilters: ExpressionSet, scanDurationMs: Long, dataSkippingType: DeltaDataSkippingType) extends Product with Serializable

    Used to hold details the files and stats for a scan where we have already applied filters and a limit.

  11. trait DeltaScanGenerator extends AnyRef

    Trait representing a class that can generate DeltaScan given filters and a limit.

  12. case class DeltaStatsColumnSpec(deltaStatsColumnNamesOpt: Option[Seq[UnresolvedAttribute]], numIndexedColsOpt: Option[Int]) extends Product with Serializable

    Specifies the set of columns to be used for stats collection on a table.

    Specifies the set of columns to be used for stats collection on a table. The deltaStatsColumnNamesOpt has higher priority than numIndexedColsOpt. Thus, if deltaStatsColumnNamesOpt is not None, StatisticsCollection would only collects file statistics for all columns inside it. Otherwise, numIndexedColsOpt is used.

  13. class DeltaTaskStatisticsTracker extends WriteTaskStatsTracker

    A per-task (i.e.

    A per-task (i.e. one instance per executor) WriteTaskStatsTracker that collects the statistics defined by StatisticsCollection for files that are being written into a delta table.

  14. case class FileSizeHistogram(sortedBinBoundaries: IndexedSeq[Long], fileCounts: Array[Long], totalBytes: Array[Long]) extends Product with Serializable

    A Histogram class tracking the file counts and total bytes in different size ranges

    A Histogram class tracking the file counts and total bytes in different size ranges

    sortedBinBoundaries

    - a sorted list of bin boundaries where each element represents the start of the bin (included) and the next element represents the end of the bin (excluded)

    fileCounts

    - an array of Int representing total number of files in different bins

    totalBytes

    - an array of Long representing total number of bytes in different bins

  15. case class FilterMetric(numFiles: Long, predicates: Seq[QueryPredicateReport]) extends Product with Serializable

    Used to report details about prequery filtering of what data is scanned.

  16. case class NumRecords(numPhysicalRecords: Long, numLogicalRecords: Long) extends Product with Serializable

    Used in deduplicateAndFilterRemovedLocally/getFilesAndNumRecords iterator for grouping physical and logical number of records.

    Used in deduplicateAndFilterRemovedLocally/getFilesAndNumRecords iterator for grouping physical and logical number of records.

    numPhysicalRecords

    The number of records physically present in the file.

    numLogicalRecords

    The physical number of records minus the Deletion Vector cardinality.

  17. class PrepareDeltaScan extends Rule[LogicalPlan] with PrepareDeltaScanBase
  18. trait PrepareDeltaScanBase extends Rule[LogicalPlan] with PredicateHelper with DeltaLogging with OptimizeMetadataOnlyDeltaQuery with SubqueryTransformerHelper

    Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.

    Before query planning, we prepare any scans over delta tables by pushing any projections or filters in allowing us to gather more accurate statistics for CBO and metering.

    Note the following - This rule also ensures that all reads from the same delta log use the same snapshot of log thus providing snapshot isolation. - If this rule is invoked within an active OptimisticTransaction, then the scans are generated using the transaction.

  19. case class PreparedDeltaFileIndex(spark: SparkSession, deltaLog: DeltaLog, path: Path, preparedScan: DeltaScan, versionScanned: Option[Long]) extends TahoeFileIndexWithSnapshotDescriptor with DeltaLogging with Product with Serializable

    A TahoeFileIndex that uses a prepared scan to return the list of relevant files.

    A TahoeFileIndex that uses a prepared scan to return the list of relevant files. This is injected into a query right before query planning by PrepareDeltaScan so that CBO and metering can accurately understand how much data will be read.

    versionScanned

    The version of the table that is being scanned, if a specific version has specifically been requested, e.g. by time travel.

  20. case class QueryPredicateReport(predicate: String, pruningType: String, filesMissingStats: Long, filesDropped: Long) extends Product with Serializable

    Used to report metrics on how predicates are used to prune the set of files that are read by a query.

    Used to report metrics on how predicates are used to prune the set of files that are read by a query.

    predicate

    A user readable version of the predicate.

    pruningType

    One of {partition, dataStats, none}.

    filesMissingStats

    The number of files that were included due to missing statistics.

    filesDropped

    The number of files that were dropped by this predicate.

  21. trait ReadsMetadataFields extends AnyRef

    A mixin trait that provides access to the stats fields in the transaction log.

  22. case class ScanAfterLimit(files: Seq[AddFile], byteSize: Option[Long], numPhysicalRecords: Option[Long], numLogicalRecords: Option[Long]) extends Product with Serializable

    Used to hold the list of files and scan stats after pruning files using the limit.

  23. trait StatisticsCollection extends DeltaLogging

    A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.

    A helper trait that constructs expressions that can be used to collect global and column level statistics for a collection of data, given its schema.

    Global statistics (such as the number of records) are stored as top level columns. Per-column statistics (such as min/max) are stored in a struct that mirrors the schema of the data.

    To illustrate, here is an example of a data schema along with the schema of the statistics that would be collected.

    Data Schema:

    |-- a: struct (nullable = true)
    |    |-- b: struct (nullable = true)
    |    |    |-- c: long (nullable = true)

    Collected Statistics:

    |-- stats: struct (nullable = true)
    |    |-- numRecords: long (nullable = false)
    |    |-- minValues: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
    |    |-- maxValues: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
    |    |-- nullCount: struct (nullable = false)
    |    |    |-- a: struct (nullable = false)
    |    |    |    |-- b: struct (nullable = false)
    |    |    |    |    |-- c: long (nullable = true)
  24. abstract class StatsCollector extends Serializable

    A helper class to collect stats of parquet data files for Delta table and its equivalent (tables that can be converted into Delta table like Parquet/Iceberg table).

Value Members

  1. object AutoCompactPartitionStats
  2. object DataSize extends Serializable
  3. object DataSkippingPredicateBuilder

    A collection of supported data skipping predicate builders.

  4. object DeltaDataSkippingType extends Enumeration
  5. object DeltaStatistics

    A singleton of the Delta statistics field names.

  6. object ParallelFetchPool
  7. object PrepareDeltaScanBase
  8. object SkippingEligibleColumn

    An extractor that matches on access of a skipping-eligible column.

    An extractor that matches on access of a skipping-eligible column. We only collect stats for leaf columns, so internal columns of nested types are ineligible for skipping.

    NOTE: This check is sufficient for safe use of NULL_COUNT stats, but safe use of MIN and MAX stats requires additional restrictions on column data type (see SkippingEligibleLiteral).

    returns

    The path to the column and the column's data type if it exists and is eligible. Otherwise, return None.

  9. object SkippingEligibleDataType
  10. object SkippingEligibleLiteral

    An extractor that matches on access of a skipping-eligible Literal.

    An extractor that matches on access of a skipping-eligible Literal. Delta tables track min/max stats for a limited set of data types, and only Literals of those types are skipping-eligible.

    returns

    The Literal, if it is eligible. Otherwise, return None.

  11. object StatisticsCollection extends DeltaCommand
  12. object StatsCollectionUtils extends LoggingShims
  13. object StatsCollector extends Serializable

Ungrouped