Package org.jesterj.ingest.model
Interface DocumentProcessor
-
- All Superinterfaces:
Configurable
- All Known Implementing Classes:
CopyField,DropFieldProcessor,FetchUrl,FieldTemplateProcessor,LogAndDrop,LogAndFail,NoOpProcessor,PreAnalyzeFields,RegexValueReplace,SendToSolrCloudProcessor,SetReadableFileSize,SetStaticValue,SimpleDateTimeReformatter,SplitFieldProcessor,StaxExtractingProcessor,TikaProcessor,TrimValues,UrlEncodeFieldProcessor,WrappingProcessor
public interface DocumentProcessor extends Configurable
-
-
Field Summary
-
Fields inherited from interface org.jesterj.ingest.model.Configurable
VALID_NAME
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description default booleanisIdempotent()Indicates if this processor can be executed multiple times, without cumulative external side effects.default booleanisPotent()Indicates a processor for which repeated invocations have cumulative external side effects.default booleanisSafe()Indicates if this processor can be re-executed multiple times safely.Document[]processDocument(Document document)Mutate, validate or transmit a document.-
Methods inherited from interface org.jesterj.ingest.model.Configurable
getName, isValidName
-
-
-
-
Method Detail
-
processDocument
Document[] processDocument(Document document)
Mutate, validate or transmit a document. Implementations must not throw any *Throwablethat is not a JVMErrorand should be written expecting the possibility that the code might be interrupted at any point. Practically this means Document processors should perform no more than one persistent or externally visible actions and that action should be transactional. Large complex processors that write to disk, DB, or elsewhere multiple times run the risk of partial completion. Similarly, since JesterJ is a long-running system it will often cease operation due to unexpected outages (power cord, etc.), so it is not a good idea to hold resources that require an explicit release or "return". "Check then write" is of course a performance anti-pattern with respect to external networked or disk resources since network and disk io are typically slow to access. Processors should feel free to set the status of a document and add a status message viaDocument.setStatus(Status, String, java.io.Serializable...)however the easiest way to communicate a failure (for which all further processing is in error) is to simply throw a runtime exception. The document processor has no need to add the document to the next step in the plan as this will be handled by the infrastructure inStepImplbased on the status of the document so long as the document is emitted via the return value of this method. If the document enters via the parameters and is not emitted for any reason the processor MUST set an appropriate status before the end of this method, though it is preferable to just set the status and emit it.- Parameters:
document- the item to process- Returns:
- The documents that result from the processing in this step. Documents with status of
Status.PROCESSINGwill be processed by subsequent steps, and documents with any other status will have their status recorded and will not be processed by subsequent steps.
-
isSafe
default boolean isSafe()
Indicates if this processor can be re-executed multiple times safely. The concept is similar to "SAFE" http requests. By default, this will return true unless explicitly overridden.- Returns:
- true if the execution of this processor will have no externally persistent side effects.
-
isIdempotent
default boolean isIdempotent()
Indicates if this processor can be executed multiple times, without cumulative external side effects. This is similar to the "IDEMPOTENT" concept for http methods. For example "RecordDocumentSeen" would be idempotent if it set a flag on a database record to true since any number of repeated invocations would result in the same external state. However, "DecrementBankBalance" would not be idempotent because repeated invocations continue to change the external state. Be very careful when creating this type of processors that you do not rely on ordering that your JesterJ plan does not guarantee. "SetBankBalance" would also be idempotent, but if repeated, might undo the effect of an intervening "DecrementBankBalance".- Returns:
- true if the repeated execution of this processor with the same inputs results in a constant external state
-
isPotent
default boolean isPotent()
Indicates a processor for which repeated invocations have cumulative external side effects. "Potent" is a term coined for use in JesterJ meant to be faster to type and easier to think about than "non-idempotent". Potent processors are the key processors that fault tolerance must avoid repeating, and thus adding potent processors increases the load on the internal cassandra instance.- Returns:
- true if the repeated execution of this processor with the same inputs results in cumulative or otherwise inconstant external state.
-
-