package read
Package Members
- package partitioning
- package streaming
Type Members
- trait Batch extends AnyRef
A physical representation of a data source scan for batch queries.
A physical representation of a data source scan for batch queries. This interface is used to provide physical information, like how many partitions the scanned data has, and how to read records from the partitions.
- Annotations
- @Evolving()
- Since
3.0.0
- trait InputPartition extends Serializable
A serializable representation of an input partition returned by
Batch#planInputPartitions()and the corresponding ones in streaming .A serializable representation of an input partition returned by
Batch#planInputPartitions()and the corresponding ones in streaming .Note that
InputPartitionwill be serialized and sent to executors, thenPartitionReaderwill be created byPartitionReaderFactory#createReader(InputPartition)orPartitionReaderFactory#createColumnarReader(InputPartition)on executors to do the actual reading. SoInputPartitionmust be serializable whilePartitionReaderdoesn't need to be.- Annotations
- @Evolving()
- Since
3.0.0
- trait LocalScan extends Scan
A special Scan which will happen on Driver locally instead of Executors.
A special Scan which will happen on Driver locally instead of Executors.
- Annotations
- @Experimental()
- Since
3.2.0
- trait PartitionReader[T] extends Closeable
A partition reader returned by
PartitionReaderFactory#createReader(InputPartition)orPartitionReaderFactory#createColumnarReader(InputPartition).A partition reader returned by
PartitionReaderFactory#createReader(InputPartition)orPartitionReaderFactory#createColumnarReader(InputPartition). It's responsible for outputting data for a RDD partition.Note that, Currently the type
Tcan only beorg.apache.spark.sql.catalyst.InternalRowfor normal data sources, ororg.apache.spark.sql.vectorized.ColumnarBatchfor columnar data sources(whosePartitionReaderFactory#supportColumnarReads(InputPartition)returns true).- Annotations
- @Evolving()
- Since
3.0.0
- trait PartitionReaderFactory extends Serializable
A factory used to create
PartitionReaderinstances.A factory used to create
PartitionReaderinstances.If Spark fails to execute any methods in the implementations of this interface or in the returned
PartitionReader(by throwing an exception), corresponding Spark task would fail and get retried until hitting the maximum retry times.- Annotations
- @Evolving()
- Since
3.0.0
- trait Scan extends AnyRef
A logical representation of a data source scan.
A logical representation of a data source scan. This interface is used to provide logical information, like what the actual read schema is.
This logical representation is shared between batch scan, micro-batch streaming scan and continuous streaming scan. Data sources must implement the corresponding methods in this interface, to match what the table promises to support. For example,
#toBatch()must be implemented, if theTablethat creates thisScanreturnsTableCapability#BATCH_READsupport in itsTable#capabilities().- Annotations
- @Evolving()
- Since
3.0.0
- trait ScanBuilder extends AnyRef
An interface for building the
Scan.An interface for building the
Scan. Implementations can mixin SupportsPushDownXYZ interfaces to do operator pushdown, and keep the operator pushdown result in the returnedScan. When pushing down operators, Spark pushes down filters first, then pushes down aggregates or applies column pruning.- Annotations
- @Evolving()
- Since
3.0.0
- trait Statistics extends AnyRef
An interface to represent statistics for a data source, which is returned by
SupportsReportStatistics#estimateStatistics().An interface to represent statistics for a data source, which is returned by
SupportsReportStatistics#estimateStatistics().- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsPushDownAggregates extends ScanBuilder
A mix-in interface for
ScanBuilder.A mix-in interface for
ScanBuilder. Data sources can implement this interface to push down aggregates. Spark assumes that the data source can't fully complete the grouping work, and will group the data source output again. For queries like "SELECT min(value) AS m FROM t GROUP BY key", after pushing down the aggregate to the data source, the data source can still output data with duplicated keys, which is OK as Spark will do GROUP BY key again. The final query plan can be something like this:Aggregate [key#1], [min(min(value)#2) AS m#3] +- RelationV2[key#1, min(value)#2]Similarly, if there is no grouping expression, the data source can still output more than one rows.When pushing down operators, Spark pushes down filters to the data source first, then push down aggregates or apply column pruning. Depends on data source implementation, aggregates may or may not be able to be pushed down with filters. If pushed filters still need to be evaluated after scanning, aggregates can't be pushed down.
- Annotations
- @Evolving()
- Since
3.2.0
- trait SupportsPushDownFilters extends ScanBuilder
A mix-in interface for
ScanBuilder.A mix-in interface for
ScanBuilder. Data sources can implement this interface to push down filters to the data source and reduce the size of the data to be read.- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsPushDownRequiredColumns extends ScanBuilder
A mix-in interface for
ScanBuilder.A mix-in interface for
ScanBuilder. Data sources can implement this interface to push down required columns to the data source and only read these columns during scan to reduce the size of the data to be read.- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsReportPartitioning extends Scan
A mix in interface for
Scan.A mix in interface for
Scan. Data sources can implement this interface to report data partitioning and try to avoid shuffle at Spark side.Note that, when a
Scanimplementation creates exactly oneInputPartition, Spark may avoid adding a shuffle even if the reader does not implement this interface.- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsReportStatistics extends Scan
A mix in interface for
Scan.A mix in interface for
Scan. Data sources can implement this interface to report statistics to Spark.As of Spark 3.0, statistics are reported to the optimizer after operators are pushed to the data source. Implementations may return more accurate statistics based on pushed operators which may improve query performance by providing better information to the optimizer.
- Annotations
- @Evolving()
- Since
3.0.0
- trait SupportsRuntimeFiltering extends Scan
A mix-in interface for
Scan.A mix-in interface for
Scan. Data sources can implement this interface if they can filter initially plannedInputPartitions using predicates Spark infers at runtime.Note that Spark will push runtime filters only if they are beneficial.
- Annotations
- @Experimental()
- Since
3.2.0