T - type of object.public class HoodieListData<T> extends HoodieBaseListData<T> implements HoodieData<T>
HoodieData holding internally a Stream of objects.
HoodieListData can have either of the 2 execution semantics:
HoodieJavaRDD, and it strives to provide
similar semantic as RDD container -- all intermediate (non-terminal, not de-referencing
the stream like "collect", "groupBy", etc) operations are executed *lazily*.
This allows to make sure that compute/memory churn is minimal since only necessary
computations will ultimately be performed.
Please note, however, that while RDD container allows the same collection to be
de-referenced more than once (ie terminal operation invoked more than once),
HoodieListData allows that only when instantiated w/ an eager execution semantic.HoodieData.HoodieDataCacheKeydata, lazy| Modifier and Type | Method and Description |
|---|---|
List<T> |
collectAsList()
Collects results of the underlying collection into a
List
This is a terminal operation |
long |
count()
Returns number of objects held in the collection
|
HoodieData<T> |
distinct()
Returns new
HoodieData collection holding only distinct objects of the original one
This is a stateful intermediate operation |
HoodieData<T> |
distinct(int parallelism)
Returns new
HoodieData collection holding only distinct objects of the original one
This is a stateful intermediate operation |
<O> HoodieData<T> |
distinctWithKey(SerializableFunction<T,O> keyGetter,
int parallelism) |
static <T> HoodieListData<T> |
eager(List<T> listData)
Creates instance of
HoodieListData bearing *eager* execution semantic |
HoodieData<T> |
filter(SerializableFunction<T,Boolean> filterFunc)
Returns new instance of
HoodieData collection only containing elements matching provided
filterFunc (ie ones it returns true on) |
<O> HoodieData<O> |
flatMap(SerializableFunction<T,Iterator<O>> func)
Maps every element in the collection into a collection of the new elements using provided
mapping
func, subsequently flattening the result (by concatenating) into a single
collection
This is an intermediate operation |
<K,V> HoodiePairData<K,V> |
flatMapToPair(SerializableFunction<T,Iterator<? extends Pair<K,V>>> func)
Maps every element in the collection into a collection of the
Pairs of new elements
using provided mapping func, subsequently flattening the result (by concatenating) into
a single collection
NOTE: That this operation will convert container from HoodieData to HoodiePairData
This is an intermediate operation |
int |
getId()
Get the
HoodieData's unique non-negative identifier. |
int |
getNumPartitions() |
boolean |
isEmpty()
Returns whether the collection is empty.
|
static <T> HoodieListData<T> |
lazy(List<T> listData)
Creates instance of
HoodieListData bearing *lazy* execution semantic |
<O> HoodieData<O> |
map(SerializableFunction<T,O> func)
Maps every element in the collection using provided mapping
func. |
<O> HoodieData<O> |
mapPartitions(SerializableFunction<Iterator<T>,Iterator<O>> func,
boolean preservesPartitioning)
Maps every element in the collection's partition (if applicable) by applying provided
mapping
func to every collection's partition
This is an intermediate operation |
<K,V> HoodiePairData<K,V> |
mapToPair(SerializablePairFunction<T,K,V> func)
Maps every element in the collection using provided mapping
func into a Pair
of elements K and V |
void |
persist(String level)
Persists the data w/ provided
level (if applicable). |
void |
persist(String level,
HoodieEngineContext engineContext,
HoodieData.HoodieDataCacheKey cacheKey)
Persists the data w/ provided
level (if applicable), and cache the data's ids within the engineContext. |
HoodieData<T> |
repartition(int parallelism)
Re-partitions underlying collection (if applicable) making sure new
HoodieData has
exactly parallelism partitions |
HoodieData<T> |
union(HoodieData<T> other)
Unions
HoodieData with another instance of HoodieData. |
void |
unpersist()
Un-persists the data (if previously persisted)
|
asStreampublic static <T> HoodieListData<T> eager(List<T> listData)
HoodieListData bearing *eager* execution semanticT - type of objectlistData - a List of objects in type TList referencepublic static <T> HoodieListData<T> lazy(List<T> listData)
HoodieListData bearing *lazy* execution semanticT - type of objectlistData - a List of objects in type TList referencepublic int getId()
HoodieDataHoodieData's unique non-negative identifier. -1 indicates invalid id.getId in interface HoodieData<T>public void persist(String level)
HoodieDatalevel (if applicable).
Use this method only when you call HoodieData.unpersist() at some later point for the same HoodieData.
Otherwise, use HoodieData.persist(String, HoodieEngineContext, HoodieDataCacheKey) instead for auto-unpersist
at the end of a client write operation.persist in interface HoodieData<T>public void persist(String level, HoodieEngineContext engineContext, HoodieData.HoodieDataCacheKey cacheKey)
HoodieDatalevel (if applicable), and cache the data's ids within the engineContext.persist in interface HoodieData<T>public void unpersist()
HoodieDataunpersist in interface HoodieData<T>public <O> HoodieData<O> map(SerializableFunction<T,O> func)
HoodieDatafunc.
This is an intermediate operation
map in interface HoodieData<T>O - output object typefunc - serializable map functionHoodieData holding mapped elementspublic <O> HoodieData<O> mapPartitions(SerializableFunction<Iterator<T>,Iterator<O>> func, boolean preservesPartitioning)
HoodieDatafunc to every collection's partition
This is an intermediate operationmapPartitions in interface HoodieData<T>O - output object typefunc - serializable map function accepting Iterator of a single
partition's elements and returning a new Iterator mapping
every element of the partition into a new onepreservesPartitioning - whether to preserve partitioning in the resulting collectionHoodieData holding mapped elementspublic <O> HoodieData<O> flatMap(SerializableFunction<T,Iterator<O>> func)
HoodieDatafunc, subsequently flattening the result (by concatenating) into a single
collection
This is an intermediate operationflatMap in interface HoodieData<T>O - output object typefunc - serializable function mapping every element T into Iterator<O>HoodieData holding mapped elementspublic <K,V> HoodiePairData<K,V> flatMapToPair(SerializableFunction<T,Iterator<? extends Pair<K,V>>> func)
HoodieDataPairs of new elements
using provided mapping func, subsequently flattening the result (by concatenating) into
a single collection
NOTE: That this operation will convert container from HoodieData to HoodiePairData
This is an intermediate operationflatMapToPair in interface HoodieData<T>public <K,V> HoodiePairData<K,V> mapToPair(SerializablePairFunction<T,K,V> func)
HoodieDatafunc into a Pair
of elements K and V
This is an intermediate operation
mapToPair in interface HoodieData<T>K - key type of the pairV - value type of the pairfunc - serializable map functionHoodiePairData holding mapped elementspublic HoodieData<T> distinct()
HoodieDataHoodieData collection holding only distinct objects of the original one
This is a stateful intermediate operationdistinct in interface HoodieData<T>public HoodieData<T> distinct(int parallelism)
HoodieDataHoodieData collection holding only distinct objects of the original one
This is a stateful intermediate operationdistinct in interface HoodieData<T>public <O> HoodieData<T> distinctWithKey(SerializableFunction<T,O> keyGetter, int parallelism)
distinctWithKey in interface HoodieData<T>public HoodieData<T> filter(SerializableFunction<T,Boolean> filterFunc)
HoodieDataHoodieData collection only containing elements matching provided
filterFunc (ie ones it returns true on)filter in interface HoodieData<T>filterFunc - filtering func either accepting or rejecting the elementsHoodieData holding filtered elementspublic HoodieData<T> union(HoodieData<T> other)
HoodieDataHoodieData with another instance of HoodieData.
Note that, it's only able to union same underlying collection implementations.
This is a stateful intermediate operationunion in interface HoodieData<T>other - HoodieData collectionHoodieData holding superset of elements of this and other collectionspublic HoodieData<T> repartition(int parallelism)
HoodieDataHoodieData has
exactly parallelism partitionsrepartition in interface HoodieData<T>parallelism - target number of partitions in the underlying collectionHoodieData holding re-partitioned collectionpublic boolean isEmpty()
HoodieDataisEmpty in interface HoodieData<T>isEmpty in class HoodieBaseListData<T>public long count()
HoodieDataNOTE: This is a terminal operation
count in interface HoodieData<T>count in class HoodieBaseListData<T>public int getNumPartitions()
getNumPartitions in interface HoodieData<T>public List<T> collectAsList()
HoodieDataList
This is a terminal operationcollectAsList in interface HoodieData<T>collectAsList in class HoodieBaseListData<T>Copyright © 2024 The Apache Software Foundation. All rights reserved.