abstract class StatsCollector extends Serializable

A helper class to collect stats of parquet data files for Delta table and its equivalent (tables that can be converted into Delta table like Parquet/Iceberg table).

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. StatsCollector
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new StatsCollector(dataSchema: StructType, statsSchema: StructType, parquetRebaseMode: String, ignoreMissingStats: Boolean, stringTruncateLength: Option[Int])

    dataSchema

    The data schema from table metadata, which is the logical schema with logical to physical mapping per schema field. It is used to map statsSchema to parquet metadata.

    statsSchema

    The schema of stats to be collected, statsSchema should follow the physical schema and must be generated by StatisticsCollection.

    parquetRebaseMode

    The parquet rebase mode used to parse date and timestamp.

    ignoreMissingStats

    Indicate whether to return partial result by ignoring missing stats or throw an exception.

    stringTruncateLength

    The optional max length of string stats to be truncated into. Scala Example:

    import org.apache.spark.sql.delta.stats.StatsCollector
    
    val stringTruncateLength =
      spark.sessionState.conf.getConf(DeltaSQLConf.DATA_SKIPPING_STRING_PREFIX_LENGTH)
    
    val statsCollector = StatsCollector(
      snapshot.metadata.columnMappingMode, snapshot.metadata.dataSchema, snapshot.statsSchema,
      ignoreMissingStats = false, Some(stringTruncateLength))
    
    val filesWithStats = snapshot.allFiles.map { file =>
      val path = DeltaFileOperations.absolutePath(dataPath, file.path)
      val fileSystem = path.getFileSystem(hadoopConf)
      val fileStatus = fileSystem.listStatus(path).head
    
      val footer = ParquetFileReader.readFooter(hadoopConf, fileStatus)
      val (stats, _) = statsCollector.collect(footer)
      file.copy(stats = JsonUtils.toJson(stats))
    }

Type Members

  1. case class StatsCollectionMetrics(numMissingMax: Long, numMissingMin: Long, numMissingNullCount: Long, numMissingTypes: Long) extends Product with Serializable

    Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.

    Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.

    numMissingMax

    The number of missing fields for MAX

    numMissingMin

    The number of missing fields for MIN

    numMissingNullCount

    The number of missing fields for NULL_COUNT

    numMissingTypes

    The number of unsupported type being requested.

Abstract Value Members

  1. abstract def getSchemaPhysicalPathToParquetIndex(blockMetaData: BlockMetaData): Map[Seq[String], Int]

    Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats).

    Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats). statsSchema generated by StatisticsCollection always use physical field paths so physical field paths are the same as to the ones used in statsSchema. Child class must implement this method based on delta column mapping mode.

Concrete Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. final val NUM_MISSING_TYPES: String("numMissingTypes")
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  7. final def collect(parquetMetadata: ParquetMetadata): (Map[String, Any], StatsCollectionMetrics)

    Collects the stats from ParquetMetadata

    Collects the stats from ParquetMetadata

    parquetMetadata

    The metadata of parquet file following physical schema, it contains statistics of row groups.

    returns

    A nested Map[String: Any] from requested stats field names to their stats field value and StatsCollectionMetrics counting the number of missing fields/types.

  8. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  12. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  13. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  14. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  15. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  16. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  17. lazy val schemaPhysicalPathAndSchemaField: Seq[(Seq[String], StructField)]

    A list of schema physical path and corresponding struct field of leaf fields.

    A list of schema physical path and corresponding struct field of leaf fields. Beside primitive types, Map and Array (instead of their sub-columns) are also treated as leaf fields since we only compute null count of them, and null is counted based on themselves instead of sub-fields.

    Attributes
    protected
  18. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  19. def toString(): String
    Definition Classes
    AnyRef → Any
  20. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  23. object StatsCollectionMetrics extends Serializable

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped