public abstract class PrefetchableTextFilesFirehoseFactory<T> extends AbstractTextFilesFirehoseFactory<T>
- Caching: for the first call of connect(StringInputRowParser, File), it caches objects in a local disk
up to maxCacheCapacityBytes. These caches are NOT deleted until the process terminates, and thus can be used for
future reads.
- Fetching: when it reads all cached data, it fetches remaining objects into a local disk and reads data from
them. For the performance reason, prefetch technique is used, that is, when the size of remaining fetched data is
smaller than PrefetchConfig.prefetchTriggerBytes, a background prefetch thread automatically starts to fetch remaining
objects.
- Retry: if an exception occurs while downloading an object, it retries again up to maxFetchRetry.
This implementation can be useful when the cost for reading input objects is large as reading from AWS S3 because
batch tasks like IndexTask or HadoopIndexTask can read the whole data twice for determining partition specs and
generating segments if the intervals of GranularitySpec is not specified.
Prefetching can be turned on/off by setting maxFetchCapacityBytes. Depending on prefetching is enabled or
disabled, the behavior of the firehose is different like below.
1. If prefetch is enabled, this firehose can fetch input objects in background.
2. When next() is called, it first checks that there are already fetched files in local storage.
2.1 If exists, it simply chooses a fetched file and returns a LineIterator reading that file.
2.2 If there is no fetched files in local storage but some objects are still remained to be read, the firehose
fetches one of input objects in background immediately. If an IOException occurs while downloading the object,
it retries up to the maximum retry count. Finally, the firehose returns a LineIterator only when the
download operation is successfully finished.
3. If prefetch is disabled, the firehose returns a LineIterator which directly reads the stream opened by
openObjectStream(T, long). If there is an IOException, it will throw it and the read will fail.
| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_MAX_FETCH_RETRY |
| Constructor and Description |
|---|
PrefetchableTextFilesFirehoseFactory(Long maxCacheCapacityBytes,
Long maxFetchCapacityBytes,
Long prefetchTriggerBytes,
Long fetchTimeout,
Integer maxFetchRetry) |
| Modifier and Type | Method and Description |
|---|---|
Firehose |
connect(StringInputRowParser firehoseParser,
File temporaryDirectory)
Initialization method that connects up the fire hose.
|
long |
getFetchTimeout() |
long |
getMaxCacheCapacityBytes() |
long |
getMaxFetchCapacityBytes() |
int |
getMaxFetchRetry() |
long |
getPrefetchTriggerBytes() |
protected abstract com.google.common.base.Predicate<Throwable> |
getRetryCondition()
Returns a predicate describing retry conditions.
|
protected abstract InputStream |
openObjectStream(T object,
long start)
Open an input stream from the given object.
|
getNumSplits, getObjects, getSplits, initializeObjectsIfNeeded, initObjects, openObjectStream, wrapObjectStreamclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitisSplittable, withSplitconnectpublic static final int DEFAULT_MAX_FETCH_RETRY
public long getMaxCacheCapacityBytes()
public long getMaxFetchCapacityBytes()
public long getPrefetchTriggerBytes()
public long getFetchTimeout()
public int getMaxFetchRetry()
public Firehose connect(StringInputRowParser firehoseParser, @Nullable File temporaryDirectory) throws IOException
FirehoseFactoryPrefetchableTextFilesFirehoseFactory may use a temporary
directory to cache data in it.connect in interface FirehoseFactory<StringInputRowParser>connect in class AbstractTextFilesFirehoseFactory<T>firehoseParser - an input row parsertemporaryDirectory - a directory where temporary files are storedIOExceptionprotected abstract com.google.common.base.Predicate<Throwable> getRetryCondition()
Fetcher and RetryingInputStream will retry on the
errors satisfying this condition.protected abstract InputStream openObjectStream(T object, long start) throws IOException
AbstractTextFilesFirehoseFactory.wrapObjectStream(Object, InputStream).object - an object to be readstart - start offsetIOExceptionCopyright © 2011–2018 The Apache Software Foundation. All rights reserved.