public class SuperSorter extends Object
ReadableFrameChannel and output is provided as OutputChannels.
Work is performed on a provided FrameProcessorExecutor.
The most central point for SuperSorter logic is the runWorkersIfPossible() method, which determines what
needs to be done next based on the current state of the SuperSorter. The logic is:
1) Read input channels into inputBuffer using FrameChannelBatcher, launched via
runNextBatcher(), up to a limit of maxChannelsPerProcessor per batcher.
2) Merge and write frames from inputBuffer into FrameFile scratch files using
FrameChannelMerger launched via runNextLevelZeroMerger().
3a) Merge level 0 scratch files into level 1 scratch files using FrameChannelMerger launched from
runNextMiddleMerger(), processing up to maxChannelsPerProcessor files per merger.
Continue this process through increasing level numbers, with the size of scratch files increasing by a factor
of maxChannelsPerProcessor each level.
3b) For the penultimate level, the FrameChannelMerger launched by runNextMiddleMerger() writes
partitioned FrameFile scratch files. The penultimate level cannot be written until
outputPartitionsFuture resolves, so if it has not resolved yet by this point, the SuperSorter pauses.
The SuperSorter resumes and writes the penultimate level's files when the future resolves.
4) Write the final level using FrameChannelMerger launched from runNextUltimateMerger().
Outputs for this level are written to channels provided by outputChannelFactory, rather than scratch files.
At all points, higher level processing is preferred over lower-level processing. Writing to final output files
is preferred over intermediate, and writing to intermediate files is preferred over reading inputs. These
preferences ensure that the amount of data buffered up in memory does not grow too large.
Potential future work (things we could optimize if necessary):
- Collapse merging to a single level if level zero has one merger, and we want to write one output partition.
- Skip batching, and inject directly into level 0, if input channels are already individually fully-sorted.
- Combine (for example: aggregate) while merging.| Modifier and Type | Field and Description |
|---|---|
static int |
UNKNOWN_LEVEL |
static long |
UNKNOWN_TOTAL |
| Constructor and Description |
|---|
SuperSorter(List<ReadableFrameChannel> inputChannels,
FrameReader frameReader,
ClusterBy clusterBy,
com.google.common.util.concurrent.ListenableFuture<ClusterByPartitions> outputPartitionsFuture,
FrameProcessorExecutor exec,
File temporaryDirectory,
OutputChannelFactory outputChannelFactory,
Supplier<MemoryAllocator> innerFrameAllocatorMaker,
int maxActiveProcessors,
int maxChannelsPerProcessor,
long rowLimit,
String cancellationId,
SuperSorterProgressTracker superSorterProgressTracker)
Initializes a SuperSorter.
|
| Modifier and Type | Method and Description |
|---|---|
com.google.common.util.concurrent.ListenableFuture<OutputChannels> |
run()
Starts sorting.
|
String |
stateString()
Returns a string encapsulating the current state of this object.
|
public static final int UNKNOWN_LEVEL
public static final long UNKNOWN_TOTAL
public SuperSorter(List<ReadableFrameChannel> inputChannels, FrameReader frameReader, ClusterBy clusterBy, com.google.common.util.concurrent.ListenableFuture<ClusterByPartitions> outputPartitionsFuture, FrameProcessorExecutor exec, File temporaryDirectory, OutputChannelFactory outputChannelFactory, Supplier<MemoryAllocator> innerFrameAllocatorMaker, int maxActiveProcessors, int maxChannelsPerProcessor, long rowLimit, @Nullable String cancellationId, SuperSorterProgressTracker superSorterProgressTracker)
inputChannels - input channels. All frames in these channels must be sorted according to the
ClusterBy.getColumns(), or else sorting will not produce correct
output.frameReader - frame reader for the input channelsclusterBy - desired sorting orderoutputPartitionsFuture - a future that resolves to the desired output partitions. Sorting will block
prior to writing out final outputs until this future resolves. However, the
sorter will be able to read all inputs even if this future is unresolved.
If output need not be partitioned, use
ClusterByPartitions.oneUniversalPartition(). In this case a single
sorted channel is generated.exec - executor to perform work intemporaryDirectory - directory to use for scratch files. This must have enough space to store at
least two copies of the dataset in FrameFile format.outputChannelFactory - factory for partitioned, sorted output channelsinnerFrameAllocatorMaker - supplier for allocators that are used to make merged frames for intermediate
levels of merging, prior to the final output. Final output frame allocation is
controlled by outputChannelFactory. One allocator is created per intermediate
scratch file.maxActiveProcessors - maximum number of merging processors to execute at once in the provided
FrameProcessorExecutormaxChannelsPerProcessor - maximum number of channels to merge at once per merging processorrowLimit - limit to apply during sorting. The limit is merely advisory: the actual number
of rows returned may be larger than the limit. The limit is applied across
all partitions, not to each partition individually.cancellationId - cancellation id to use when running processors in the provided
FrameProcessorExecutor.superSorterProgressTracker - progress trackerpublic com.google.common.util.concurrent.ListenableFuture<OutputChannels> run()
FrameProcessorExecutor that was
passed to the constructor.
Returns a future containing partitioned sorted output channels.public String stateString()
Copyright © 2011–2022 The Apache Software Foundation. All rights reserved.