Packages

package physical

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. case class BroadcastDistribution(mode: BroadcastMode) extends Distribution with Product with Serializable

    Represents data where tuples are broadcasted to every node.

    Represents data where tuples are broadcasted to every node. It is quite common that the entire set of tuples is transformed into different data structure.

  2. trait BroadcastMode extends AnyRef

    Marker trait to identify the shape in which tuples are broadcasted.

    Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are identity (tuples remain unchanged) or hashed (tuples are converted into some hash index).

  3. case class BroadcastPartitioning(mode: BroadcastMode) extends Partitioning with Product with Serializable

    Represents a partitioning where rows are collected, transformed and broadcasted to each node in the cluster.

  4. case class ClusteredDistribution(clustering: Seq[Expression], requireAllClusterKeys: Boolean = ..., requiredNumPartitions: Option[Int] = None) extends Distribution with Product with Serializable

    Represents data where tuples that share the same values for the clustering Expressions will be co-located in the same partition.

    Represents data where tuples that share the same values for the clustering Expressions will be co-located in the same partition.

    requireAllClusterKeys

    When true, Partitioning which satisfies this distribution, must match all clustering expressions in the same ordering.

  5. sealed trait Distribution extends AnyRef

    Specifies how tuples that share common expressions will be distributed when a query is executed in parallel on many machines.

    Specifies how tuples that share common expressions will be distributed when a query is executed in parallel on many machines.

    Distribution here refers to inter-node partitioning of data. That is, it describes how tuples are partitioned across physical machines in a cluster. Knowing this property allows some operators (e.g., Aggregate) to perform partition local operations instead of global ones.

  6. case class HashPartitioning(expressions: Seq[Expression], numPartitions: Int) extends Expression with Partitioning with Unevaluable with Product with Serializable

    Represents a partitioning where rows are split up across partitions based on the hash of expressions.

    Represents a partitioning where rows are split up across partitions based on the hash of expressions. All rows where expressions evaluate to the same values are guaranteed to be in the same partition.

    Since StatefulOpClusteredDistribution relies on this partitioning and Spark requires stateful operators to retain the same physical partitioning during the lifetime of the query (including restart), the result of evaluation on partitionIdExpression must be unchanged across Spark versions. Violation of this requirement may bring silent correctness issue.

  7. case class HashShuffleSpec(partitioning: HashPartitioning, distribution: ClusteredDistribution) extends ShuffleSpec with Product with Serializable
  8. case class KeyGroupedPartitioning(expressions: Seq[Expression], numPartitions: Int, partitionValuesOpt: Option[Seq[InternalRow]] = None) extends Partitioning with Product with Serializable

    Represents a partitioning where rows are split across partitions based on transforms defined by expressions.

    Represents a partitioning where rows are split across partitions based on transforms defined by expressions. partitionValuesOpt, if defined, should contain value of partition key(s) in ascending order, after evaluated by the transforms in expressions, for each input partition. In addition, its length must be the same as the number of input partitions (and thus is a 1-1 mapping), and each row in partitionValuesOpt must be unique.

    For example, if expressions is [years(ts_col)], then a valid value of partitionValuesOpt is [0, 1, 2], which represents 3 input partitions with distinct partition values. All rows in each partition have the same value for column ts_col (which is of timestamp type), after being applied by the years transform.

    On the other hand, [0, 0, 1] is not a valid value for partitionValuesOpt since 0 is duplicated twice.

    expressions

    partition expressions for the partitioning.

    numPartitions

    the number of partitions

    partitionValuesOpt

    if set, the values for the cluster keys of the distribution, must be in ascending order.

  9. case class KeyGroupedShuffleSpec(partitioning: KeyGroupedPartitioning, distribution: ClusteredDistribution) extends ShuffleSpec with Product with Serializable
  10. case class OrderedDistribution(ordering: Seq[SortOrder]) extends Distribution with Product with Serializable

    Represents data where tuples have been ordered according to the ordering Expressions.

    Represents data where tuples have been ordered according to the ordering Expressions. Its requirement is defined as the following:

    • Given any 2 adjacent partitions, all the rows of the second partition must be larger than or equal to any row in the first partition, according to the ordering expressions.

    In other words, this distribution requires the rows to be ordered across partitions, but not necessarily within a partition.

  11. trait Partitioning extends AnyRef

    Describes how an operator's output is split across partitions.

    Describes how an operator's output is split across partitions. It has 2 major properties:

    1. number of partitions. 2. if it can satisfy a given distribution.
  12. case class PartitioningCollection(partitionings: Seq[Partitioning]) extends Expression with Partitioning with Unevaluable with Product with Serializable

    A collection of Partitionings that can be used to describe the partitioning scheme of the output of a physical operator.

    A collection of Partitionings that can be used to describe the partitioning scheme of the output of a physical operator. It is usually used for an operator that has multiple children. In this case, a Partitioning in this collection describes how this operator's output is partitioned based on expressions from a child. For example, for a Join operator on two tables A and B with a join condition A.key1 = B.key2, assuming we use HashPartitioning schema, there are two Partitionings can be used to describe how the output of this Join operator is partitioned, which are HashPartitioning(A.key1) and HashPartitioning(B.key2). It is also worth noting that partitionings in this collection do not need to be equivalent, which is useful for Outer Join operators.

  13. case class RangePartitioning(ordering: Seq[SortOrder], numPartitions: Int) extends Expression with Partitioning with Unevaluable with Product with Serializable

    Represents a partitioning where rows are split across partitions based on some total ordering of the expressions specified in ordering.

    Represents a partitioning where rows are split across partitions based on some total ordering of the expressions specified in ordering. When data is partitioned in this manner, it guarantees: Given any 2 adjacent partitions, all the rows of the second partition must be larger than any row in the first partition, according to the ordering expressions.

    This is a strictly stronger guarantee than what OrderedDistribution(ordering) requires, as there is no overlap between partitions.

    This class extends expression primarily so that transformations over expression will descend into its child.

  14. case class RangeShuffleSpec(numPartitions: Int, distribution: ClusteredDistribution) extends ShuffleSpec with Product with Serializable
  15. case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning with Product with Serializable

    Represents a partitioning where rows are distributed evenly across output partitions by starting from a random target partition number and distributing rows in a round-robin fashion.

    Represents a partitioning where rows are distributed evenly across output partitions by starting from a random target partition number and distributing rows in a round-robin fashion. This partitioning is used when implementing the DataFrame.repartition() operator.

  16. trait ShuffleSpec extends AnyRef

    This is used in the scenario where an operator has multiple children (e.g., join) and one or more of which have their own requirement regarding whether its data can be considered as co-partitioned from others.

    This is used in the scenario where an operator has multiple children (e.g., join) and one or more of which have their own requirement regarding whether its data can be considered as co-partitioned from others. This offers APIs for:

    • Comparing with specs from other children of the operator and check if they are compatible. When two specs are compatible, we can say their data are co-partitioned, and Spark will potentially be able to eliminate shuffle if necessary.
    • Creating a partitioning that can be used to re-partition another child, so that to make it having a compatible partitioning as this node.
  17. case class ShuffleSpecCollection(specs: Seq[ShuffleSpec]) extends ShuffleSpec with Product with Serializable
  18. case class StatefulOpClusteredDistribution(expressions: Seq[Expression], _requiredNumPartitions: Int) extends Distribution with Product with Serializable

    Represents the requirement of distribution on the stateful operator in Structured Streaming.

    Represents the requirement of distribution on the stateful operator in Structured Streaming.

    Each partition in stateful operator initializes state store(s), which are independent with state store(s) in other partitions. Since it is not possible to repartition the data in state store, Spark should make sure the physical partitioning of the stateful operator is unchanged across Spark versions. Violation of this requirement may bring silent correctness issue.

    Since this distribution relies on HashPartitioning on the physical partitioning of the stateful operator, only HashPartitioning (and HashPartitioning in PartitioningCollection) can satisfy this distribution. When _requiredNumPartitions is 1, SinglePartition is essentially same as HashPartitioning, so it can satisfy this distribution as well.

    NOTE: This is applied only to stream-stream join as of now. For other stateful operators, we have been using ClusteredDistribution, which could construct the physical partitioning of the state in different way (ClusteredDistribution requires relaxed condition and multiple partitionings can satisfy the requirement.) We need to construct the way to fix this with minimizing possibility to break the existing checkpoints.

    TODO(SPARK-38204): address the issue explained in above note.

  19. case class UnknownPartitioning(numPartitions: Int) extends Partitioning with Product with Serializable

Value Members

  1. object AllTuples extends Distribution with Product with Serializable

    Represents a distribution that only has a single partition and all tuples of the dataset are co-located.

  2. object IdentityBroadcastMode extends BroadcastMode with Product with Serializable

    IdentityBroadcastMode requires that rows are broadcasted in their original form.

  3. object KeyGroupedPartitioning extends Serializable
  4. object SinglePartition extends Partitioning with Product with Serializable
  5. object SinglePartitionShuffleSpec extends ShuffleSpec with Product with Serializable
  6. object UnspecifiedDistribution extends Distribution with Product with Serializable

    Represents a distribution where no promises are made about co-location of data.

Ungrouped