abstract class StatsCollector extends Serializable
A helper class to collect stats of parquet data files for Delta table and its equivalent (tables that can be converted into Delta table like Parquet/Iceberg table).
- Alphabetic
- By Inheritance
- StatsCollector
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- Protected
Instance Constructors
- new StatsCollector(dataSchema: StructType, statsSchema: StructType, parquetRebaseMode: String, ignoreMissingStats: Boolean, stringTruncateLength: Option[Int])
- dataSchema
The data schema from table metadata, which is the logical schema with logical to physical mapping per schema field. It is used to map statsSchema to parquet metadata.
- statsSchema
The schema of stats to be collected, statsSchema should follow the physical schema and must be generated by StatisticsCollection.
- parquetRebaseMode
The parquet rebase mode used to parse date and timestamp.
- ignoreMissingStats
Indicate whether to return partial result by ignoring missing stats or throw an exception.
- stringTruncateLength
The optional max length of string stats to be truncated into. Scala Example:
import org.apache.spark.sql.delta.stats.StatsCollector val stringTruncateLength = spark.sessionState.conf.getConf(DeltaSQLConf.DATA_SKIPPING_STRING_PREFIX_LENGTH) val statsCollector = StatsCollector( snapshot.metadata.columnMappingMode, snapshot.metadata.dataSchema, snapshot.statsSchema, ignoreMissingStats = false, Some(stringTruncateLength)) val filesWithStats = snapshot.allFiles.map { file => val path = DeltaFileOperations.absolutePath(dataPath, file.path) val fileSystem = path.getFileSystem(hadoopConf) val fileStatus = fileSystem.listStatus(path).head val footer = ParquetFileReader.readFooter(hadoopConf, fileStatus) val (stats, _) = statsCollector.collect(footer) file.copy(stats = JsonUtils.toJson(stats)) }
Type Members
- case class StatsCollectionMetrics(numMissingMax: Long, numMissingMin: Long, numMissingNullCount: Long, numMissingTypes: Long) extends Product with Serializable
Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.
Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.
- numMissingMax
The number of missing fields for MAX
- numMissingMin
The number of missing fields for MIN
- numMissingNullCount
The number of missing fields for NULL_COUNT
- numMissingTypes
The number of unsupported type being requested.
Abstract Value Members
- abstract def getSchemaPhysicalPathToParquetIndex(blockMetaData: BlockMetaData): Map[Seq[String], Int]
Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats).
Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats). statsSchema generated by StatisticsCollection always use physical field paths so physical field paths are the same as to the ones used in statsSchema. Child class must implement this method based on delta column mapping mode.
Concrete Value Members
- final def !=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final def ##: Int
- Definition Classes
- AnyRef → Any
- final def ==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final val NUM_MISSING_TYPES: String("numMissingTypes")
- final def asInstanceOf[T0]: T0
- Definition Classes
- Any
- def clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.CloneNotSupportedException]) @native()
- final def collect(parquetMetadata: ParquetMetadata): (Map[String, Any], StatsCollectionMetrics)
Collects the stats from ParquetMetadata
Collects the stats from ParquetMetadata
- parquetMetadata
The metadata of parquet file following physical schema, it contains statistics of row groups.
- returns
A nested Map[String: Any] from requested stats field names to their stats field value and StatsCollectionMetrics counting the number of missing fields/types.
- final def eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- def equals(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef → Any
- def finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.Throwable])
- final def getClass(): Class[_ <: AnyRef]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- def hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
- final def isInstanceOf[T0]: Boolean
- Definition Classes
- Any
- final def ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
- final def notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- final def notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
- lazy val schemaPhysicalPathAndSchemaField: Seq[(Seq[String], StructField)]
A list of schema physical path and corresponding struct field of leaf fields.
A list of schema physical path and corresponding struct field of leaf fields. Beside primitive types, Map and Array (instead of their sub-columns) are also treated as leaf fields since we only compute null count of them, and null is counted based on themselves instead of sub-fields.
- Attributes
- protected
- final def synchronized[T0](arg0: => T0): T0
- Definition Classes
- AnyRef
- def toString(): String
- Definition Classes
- AnyRef → Any
- final def wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException])
- final def wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws(classOf[java.lang.InterruptedException]) @native()
- object StatsCollectionMetrics extends Serializable