Package io.delta.kernel.internal.util
Class PartitionUtils
Object
io.delta.kernel.internal.util.PartitionUtils
-
Method Summary
Modifier and TypeMethodDescriptionstatic StringgetTargetDirectory(String dataRoot, List<String> partitionColNames, Map<String, Literal> partitionValues) Get the target directory for writing data for given partition values.static StructTypephysicalSchemaWithoutPartitionColumns(StructType logicalSchema, StructType physicalSchema, Set<String> columnsToRemove) Utility method to remove the given columns (ascolumnsToRemove) from the givenphysicalSchema.static PredicaterewritePartitionPredicateOnCheckpointFileSchema(Predicate predicate, Map<String, StructField> partitionColNameToField) Rewrite the given predicate on partition columns on `partitionValues_parsed` in checkpoint schema.static PredicaterewritePartitionPredicateOnScanFileSchema(Predicate predicate, Map<String, StructField> partitionColMetadata) Utility method to rewrite the partition predicate referring to the table schema as predicate referring to thepartitionValuesin scan files read from Delta log.static MapValueserializePartitionMap(Map<String, Literal> partitionValueMap) Convert the given partition values to aMapValuethat can be serialized to a Delta commit file.splitMetadataAndDataPredicates(Predicate predicate, Set<String> partitionColNames) Split the given predicate into predicate on partition columns and predicate on data columns.validateAndSanitizePartitionValues(StructType tableSchema, List<String> partitionColNames, Map<String, Literal> partitionValues) ValidatepartitionValuescontains values for every partition column in the table and the type of the value is correct.static ColumnarBatchwithPartitionColumns(ExpressionHandler expressionHandler, ColumnarBatch dataBatch, Map<String, String> partitionValues, StructType schemaWithPartitionCols)
-
Method Details
-
physicalSchemaWithoutPartitionColumns
public static StructType physicalSchemaWithoutPartitionColumns(StructType logicalSchema, StructType physicalSchema, Set<String> columnsToRemove) Utility method to remove the given columns (ascolumnsToRemove) from the givenphysicalSchema.- Parameters:
physicalSchema-logicalSchema- To create a logical name to physical name map. Partition column names are in logical space and we need to identify the equivalent physical column name.columnsToRemove-- Returns:
-
withPartitionColumns
public static ColumnarBatch withPartitionColumns(ExpressionHandler expressionHandler, ColumnarBatch dataBatch, Map<String, String> partitionValues, StructType schemaWithPartitionCols) -
serializePartitionMap
Convert the given partition values to aMapValuethat can be serialized to a Delta commit file.- Parameters:
partitionValueMap- Expected the partition column names to be same case as in the schema. We want to preserve the case of the partition column names when serializing to the Delta commit file.- Returns:
MapValuerepresenting the serialized partition values that can be written to a Delta commit file.
-
validateAndSanitizePartitionValues
public static Map<String,Literal> validateAndSanitizePartitionValues(StructType tableSchema, List<String> partitionColNames, Map<String, Literal> partitionValues) ValidatepartitionValuescontains values for every partition column in the table and the type of the value is correct. Once validated the partition values are sanitized to match the case of the partition column names in the table schema and returned- Parameters:
tableSchema- Schema of the table.partitionColNames- Partition column name. These should be from the table metadata that retain the same case as in the table schema.partitionValues- Map of partition column to value map given by the connector- Returns:
- Sanitized partition values.
-
splitMetadataAndDataPredicates
public static Tuple2<Predicate,Predicate> splitMetadataAndDataPredicates(Predicate predicate, Set<String> partitionColNames) Split the given predicate into predicate on partition columns and predicate on data columns.- Parameters:
predicate-partitionColNames-- Returns:
- Tuple of partition column predicate and data column predicate.
-
rewritePartitionPredicateOnCheckpointFileSchema
public static Predicate rewritePartitionPredicateOnCheckpointFileSchema(Predicate predicate, Map<String, StructField> partitionColNameToField) Rewrite the given predicate on partition columns on `partitionValues_parsed` in checkpoint schema. The rewritten predicate can be pushed to the Parquet reader when reading the checkpoint files.- Parameters:
predicate- Predicate on partition columns.partitionColNameToField- Map of partition column name (in lower case) to itsStructField.- Returns:
- Rewritten
Predicateon `partitionValues_parsed` in `add`.
-
rewritePartitionPredicateOnScanFileSchema
public static Predicate rewritePartitionPredicateOnScanFileSchema(Predicate predicate, Map<String, StructField> partitionColMetadata) Utility method to rewrite the partition predicate referring to the table schema as predicate referring to thepartitionValuesin scan files read from Delta log. The scan file batch is returned by theScan.getScanFiles(Engine).E.g. given predicate on partition columns:
p1 = 'new york' && p2 >= 26where p1 is of type string and p2 is of int Rewritten expression looks like:element_at(Column('add', 'partitionValues'), 'p1') = 'new york' && partition_value(element_at(Column('add', 'partitionValues'), 'p2'), 'integer') >= 26The column `add.partitionValues` is a map(string -> string) type. Each partition values is in string serialization format according to the Delta protocol. Expression `partition_value` deserializes the string value into the given partition column type value. String type partition values don't need any deserialization.- Parameters:
predicate- Predicate containing filters only on partition columns.partitionColMetadata- Map of partition column name (in lower case) to its type.- Returns:
-
getTargetDirectory
public static String getTargetDirectory(String dataRoot, List<String> partitionColNames, Map<String, Literal> partitionValues) Get the target directory for writing data for given partition values. Example: Given partition values (part1=1, part2='abc'), the target directory will be for a table rooted at 's3://bucket/table': 's3://bucket/table/part1=1/part2=abc'.- Parameters:
dataRoot- Root directory where the data is stored.partitionColNames- Partition column names. We need this to create the target directory structure that is consistent levels of directories.partitionValues- Partition values to create the target directory.- Returns:
- Target directory path.
-