Package org.jesterj.ingest.model
Interface Scanner
-
- All Superinterfaces:
Active,java.util.concurrent.BlockingQueue<Document>,java.util.Collection<Document>,Configurable,DeferredBuilding,java.lang.Iterable<Document>,java.util.Queue<Document>,java.lang.Runnable,Step
- All Known Implementing Classes:
JdbcScanner,ScannerImpl,SimpleFileScanner
public interface Scanner extends Step
Monitors a document source for changes on a regular basis. When new files are found, they are submitted to the supplied queue. Note that Scanners do not normally support the methods fromBlockingQueuesince they normally only output documents, and never receive them. These methods may throwUnsupportedOperationException
-
-
Field Summary
-
Fields inherited from interface org.jesterj.ingest.model.Configurable
VALID_NAME
-
Fields inherited from interface org.jesterj.ingest.model.Step
JJ_PLAN_NAME, JJ_PLAN_VERSION
-
-
Method Summary
All Methods Instance Methods Abstract Methods Default Methods Modifier and Type Method Description java.util.Optional<Document>fetchById(java.lang.String id, java.lang.String origination)Load a document based on the document's id.default java.util.function.Consumer<Document>getDocumentTracker()Get a procedure that takes a document and uses this information to persist a record that this document has been scanned.default java.util.function.Function<java.lang.String,java.lang.String>getIdFunction()A function that can be used to provide a custom transformation of the identifier generated by the scanning process.longgetInterval()The interval for the scanner to fire.java.lang.RunnablegetScanOperation()A callback that calls docFound() on the scanner when a document is found that needs to be indexed.booleanisHashing()Indicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document.booleanisHeuristicallyDirty(Document doc)Scanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.booleanisRemembering()Indicates if this scanner will re-feed documents it has already seen.booleanisScanning()True if a new scan may be started.java.lang.StringkeySpace(java.lang.String outputStep)Calculates the keyspace for this scanner's FTI records.-
Methods inherited from interface org.jesterj.ingest.model.Active
activate, deactivate, isActive
-
Methods inherited from interface java.util.concurrent.BlockingQueue
add, contains, drainTo, drainTo, offer, offer, poll, put, remainingCapacity, remove, take
-
Methods inherited from interface java.util.Collection
addAll, clear, containsAll, equals, hashCode, isEmpty, iterator, parallelStream, removeAll, removeIf, retainAll, size, spliterator, stream, toArray, toArray, toArray
-
Methods inherited from interface org.jesterj.ingest.model.Configurable
getName, isValidName
-
Methods inherited from interface org.jesterj.ingest.model.DeferredBuilding
addDeferred, executeDeferred
-
Methods inherited from interface org.jesterj.ingest.model.Step
addPredecessor, getBatchSize, getDownstreamOutputSteps, getEligibleNextSteps, getNextSteps, getNextSteps, getOutputDestinationNames, getPlan, getPriorSteps, getRouter, isActivePriorSteps, isOutputDestinationThisStep, isOutputStep, sendToNext
-
-
-
-
Method Detail
-
getIdFunction
default java.util.function.Function<java.lang.String,java.lang.String> getIdFunction()
A function that can be used to provide a custom transformation of the identifier generated by the scanning process. The default transformation is an identity transform.- Returns:
- the function to map ID to new ID
-
getDocumentTracker
default java.util.function.Consumer<Document> getDocumentTracker()
Get a procedure that takes a document and uses this information to persist a record that this document has been scanned. Typical implementations might be writing a status to cassandra, updating a row in a database, or renaming the target file. By convention, the first element of the object array passed will be a String identifier, and the second argument will be the document object. Subsequent arguments are unrestricted. The default implementation is a no-op.- Returns:
- a
Consumerthat consumes data about a document and has the side effect of persisting a record that the document was scanned.
-
isHeuristicallyDirty
boolean isHeuristicallyDirty(Document doc)
Scanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.- Parameters:
doc- the document to check- Returns:
- true if indexing is required, false otherwise.
-
getScanOperation
java.lang.Runnable getScanOperation()
A callback that calls docFound() on the scanner when a document is found that needs to be indexed. The call back should callScannerImpl.scanStarted()when it starts doing work, andScannerImpl.scanFinished()when it has completed any work for which concurrency might be relevant.- Returns:
- a
Runnableobject that locates documents.
-
getInterval
long getInterval()
The interval for the scanner to fire. Scanners implementations must not begin a new scan more frequently than this interval. There is no guarantee that the scan will begin this frequently although implementations are encouraged to report any occasions on which scans are started later than this interval would imply as warnings. An interval of less than zero indicates that the scanner should only run once.- Returns:
- the scan interval. Defaults to 30 minutes
-
isScanning
boolean isScanning()
True if a new scan may be started. Implementations may choose not to start a new scan until the old one has completed. This value is independent ofActive.isActive().- Returns:
- true if a new scan should be started
-
fetchById
java.util.Optional<Document> fetchById(java.lang.String id, java.lang.String origination)
Load a document based on the document's id.- Parameters:
id- the id of the document, see alsoDocument.getId()origination- A constant indicating the source (scanner or fti) for debugging- Returns:
- An optional that contains the document if it is possible to retrieve the document by ID
-
isRemembering
boolean isRemembering()
Indicates if this scanner will re-feed documents it has already seen. This behavior can be modified by the value forisHashing().- Returns:
- true if previously indexed documents will be skipped, false if every document will be freshly indexed on every scan.
-
isHashing
boolean isHashing()
Indicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document. The value of this property has no effect if isRemembering is false. Turning this off can speed up processing of individual docs significantly, if indexing a repository or data source that disallows updating existing documents.- Returns:
- True if a hash value should be calculated and compared to the previously stored value.
-
keySpace
java.lang.String keySpace(java.lang.String outputStep)
Calculates the keyspace for this scanner's FTI records.- Parameters:
outputStep- the step for which we want to generate a keyspace- Returns:
- a keyspace name encoding the scanner name, plan name and plan version.
-
-