S - output schema typeD - output record typepublic interface Source<S,D>
An implementation of this interface should contain all the logic required to work with a specific data source. This usually includes work determination and partitioning, and details of the connection protocol to work with the data source.
| Modifier and Type | Method and Description |
|---|---|
Extractor<S,D> |
getExtractor(WorkUnitState state)
Get an
Extractor based on a given WorkUnitState. |
List<WorkUnit> |
getWorkunits(SourceState state)
Get a list of
WorkUnits, each of which is for extracting a portion of the data. |
default boolean |
isEarlyStopped()
|
void |
shutdown(SourceState state)
Shutdown this
Source instance. |
List<WorkUnit> getWorkunits(SourceState state)
WorkUnits, each of which is for extracting a portion of the data.
Each WorkUnit will be used instantiate a WorkUnitState that gets passed to the
getExtractor(org.apache.gobblin.configuration.WorkUnitState) method to get an Extractor for extracting schema
and data records from the source. The WorkUnit instance should have all the properties
needed for the Extractor to work.
Typically the list of WorkUnits for the current run is determined by taking into account
the list of WorkUnits from the previous run so data gets extracted incrementally. The
method SourceState.getPreviousWorkUnitStates() can be used to get the list of WorkUnits
from the previous run.
state - see SourceStateWorkUnitsExtractor<S,D> getExtractor(WorkUnitState state) throws IOException
Extractor based on a given WorkUnitState.
The Extractor returned can use WorkUnitState to store arbitrary key-value pairs
that will be persisted to the state store and loaded in the next scheduled job run.
state - a WorkUnitState carrying properties needed by the returned ExtractorExtractor used to extract schema and data records from the data sourceIOException - if it fails to create an Extractorvoid shutdown(SourceState state)
Source instance.
This method is called once when the job completes. Properties (key-value pairs) added to the input
SourceState instance will be persisted and available to the next scheduled job run through
the method getWorkunits(SourceState). If there is no cleanup or reporting required for a
particular implementation of this interface, then it is acceptable to have a default implementation
of this method.
state - see SourceState