abstract class StatsCollector extends Serializable
A helper class to collect stats of parquet data files for Delta table and its equivalent (tables that can be converted into Delta table like Parquet/Iceberg table).
- Alphabetic
- By Inheritance
- StatsCollector
- Serializable
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
-
new
StatsCollector(dataSchema: StructType, statsSchema: StructType, parquetRebaseMode: String, ignoreMissingStats: Boolean, stringTruncateLength: Option[Int])
- dataSchema
The data schema from table metadata, which is the logical schema with logical to physical mapping per schema field. It is used to map statsSchema to parquet metadata.
- statsSchema
The schema of stats to be collected, statsSchema should follow the physical schema and must be generated by StatisticsCollection.
- parquetRebaseMode
The parquet rebase mode used to parse date and timestamp.
- ignoreMissingStats
Indicate whether to return partial result by ignoring missing stats or throw an exception.
- stringTruncateLength
The optional max length of string stats to be truncated into. Scala Example:
import org.apache.spark.sql.delta.stats.StatsCollector val stringTruncateLength = spark.sessionState.conf.getConf(DeltaSQLConf.DATA_SKIPPING_STRING_PREFIX_LENGTH) val statsCollector = StatsCollector( snapshot.metadata.columnMappingMode, snapshot.metadata.dataSchema, snapshot.statsSchema, ignoreMissingStats = false, Some(stringTruncateLength)) val filesWithStats = snapshot.allFiles.map { file => val path = DeltaFileOperations.absolutePath(dataPath, file.path) val fileSystem = path.getFileSystem(hadoopConf) val fileStatus = fileSystem.listStatus(path).head val footer = ParquetFileReader.readFooter(hadoopConf, fileStatus) val (stats, _) = statsCollector.collect(footer) file.copy(stats = JsonUtils.toJson(stats)) }
Type Members
-
case class
StatsCollectionMetrics(numMissingMax: Long, numMissingMin: Long, numMissingNullCount: Long, numMissingTypes: Long) extends Product with Serializable
Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.
Used to report number of missing fields per supported type and number of missing unsupported types in the collected statistics, currently the statistics collection supports 4 types of stats: NUM_RECORDS, MAX, MIN, NULL_COUNT.
- numMissingMax
The number of missing fields for MAX
- numMissingMin
The number of missing fields for MIN
- numMissingNullCount
The number of missing fields for NULL_COUNT
- numMissingTypes
The number of unsupported type being requested.
Abstract Value Members
-
abstract
def
getSchemaPhysicalPathToParquetIndex(blockMetaData: BlockMetaData): Map[Seq[String], Int]
Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats).
Returns the map from schema physical field path (field for which to collect stats) to the parquet metadata column index (where to collect stats). statsSchema generated by StatisticsCollection always use physical field paths so physical field paths are the same as to the ones used in statsSchema. Child class must implement this method based on delta column mapping mode.
Concrete Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
- final val NUM_MISSING_TYPES: String("numMissingTypes")
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
collect(parquetMetadata: ParquetMetadata): (Map[String, Any], StatsCollectionMetrics)
Collects the stats from ParquetMetadata
Collects the stats from ParquetMetadata
- parquetMetadata
The metadata of parquet file following physical schema, it contains statistics of row groups.
- returns
A nested Map[String: Any] from requested stats field names to their stats field value and StatsCollectionMetrics counting the number of missing fields/types.
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
lazy val
schemaPhysicalPathAndSchemaField: Seq[(Seq[String], StructField)]
A list of schema physical path and corresponding struct field of leaf fields.
A list of schema physical path and corresponding struct field of leaf fields. Beside primitive types, Map and Array (instead of their sub-columns) are also treated as leaf fields since we only compute null count of them, and null is counted based on themselves instead of sub-fields.
- Attributes
- protected
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- AnyRef → Any
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
- object StatsCollectionMetrics extends Serializable