Package org.jesterj.ingest.model.impl
Class ScannerImpl
- java.lang.Object
-
- org.jesterj.ingest.model.impl.StepImpl
-
- org.jesterj.ingest.model.impl.ScannerImpl
-
- All Implemented Interfaces:
java.lang.Iterable<Document>,java.lang.Runnable,java.util.Collection<Document>,java.util.concurrent.BlockingQueue<Document>,java.util.Queue<Document>,Active,Configurable,DeferredBuilding,Scanner,Step
- Direct Known Subclasses:
JdbcScanner,SimpleFileScanner
public abstract class ScannerImpl extends StepImpl implements Scanner
A base implementation of a scanner that doesn't do anything.getScanOperation()andScanner.getDocumentTracker()should be overridden for most implementations.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classScannerImpl.BuilderclassScannerImpl.ScanOpThe base, default scan operation.
-
Field Summary
Fields Modifier and Type Field Description protected java.util.concurrent.atomic.AtomicIntegeractiveScansstatic java.lang.StringCREATE_DOC_HASHstatic java.lang.StringCREATE_FT_KEYSPACEstatic java.lang.StringCREATE_FT_TABLEstatic java.lang.StringCREATE_INDEX_STATUSstatic intDDL_TIMEOUTstatic intDEF_MAX_ERROR_RETRYstatic java.lang.StringFTI_ORIGINstatic java.lang.StringNEW_CONTENT_FOUND_MSGstatic java.lang.StringSCAN_ORIGINstatic intTIMEOUT-
Fields inherited from interface org.jesterj.ingest.model.Configurable
VALID_NAME
-
Fields inherited from interface org.jesterj.ingest.model.Step
JJ_PLAN_NAME, JJ_PLAN_VERSION
-
-
Constructor Summary
Constructors Modifier Constructor Description protectedScannerImpl()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description voidactivate()Begin processing.booleanadd(Document document)booleanaddAll(java.util.Collection<? extends Document> c)voidaddPredecessor(StepImpl obj)Register a step as a predecessor of this step (one that might send documents to this step).voidclear()booleancontains(java.lang.Object o)booleancontainsAll(java.util.Collection<?> c)voiddeactivate()Stop processing.booleandocFound(Document doc)What to do when a document has been recognized as required for indexing.intdrainTo(java.util.Collection<? super Document> c)intdrainTo(java.util.Collection<? super Document> c, int maxElements)Documentelement()CassandraSupportgetCassandra()longgetInterval()The interval for the scanner to fire.protected org.apache.logging.log4j.LoggergetLogger()abstract ScannerImpl.ScanOpgetScanOperation()The default scan operation is to check the cassandra database for records marked dirty or restart and process those records using the scanner's document fetching logic (empty by default)booleanisActivePriorSteps()Determine if any upstream steps are still active.booleanisEmpty()booleanisHashing()Indicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document.booleanisHeuristicallyDirty(Document doc)Scanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.booleanisRemembering()Indicates if this scanner will re-feed documents it has already seen.booleanisScanActive()java.util.Iterator<Document>iterator()java.lang.StringkeySpace(java.lang.String outputStep)Calculates the keyspace for this scanner's FTI records.booleanoffer(Document document)booleanoffer(Document document, long timeout, java.util.concurrent.TimeUnit unit)Documentpeek()Documentpoll()Documentpoll(long timeout, java.util.concurrent.TimeUnit unit)protected voidprocessDirty()protected voidprocessPendingDocs(FTIQueryContext ftiQueryContext, java.util.List<Status> statusesToProcess, boolean force)Force processing of documents in the specified status (except Dirty which will receive normal hash and memory checks) Note: this method scales O(n) with the number of documents returned for each status processed.voidput(Document document)Attempt to send the document to this step blocking if the queue for this step is full.intremainingCapacity()Documentremove()booleanremove(java.lang.Object o)booleanremoveAll(java.util.Collection<?> c)booleanretainAll(java.util.Collection<?> c)voidrun()voidscanFinished()Decrement the active Scans.voidscanStarted()voidsendToNext(Document doc)After processing is complete, send it on to any subsequent steps if appropriate.voidsetCassandra(CassandraSupport cassandra)protected voidsetInterval(long interval)Documenttake()java.lang.Object[]toArray()<T> T[]toArray(T[] a)-
Methods inherited from class org.jesterj.ingest.model.impl.StepImpl
addDeferred, executeDeferred, forEach, getBatchSize, getDownstreamOutputSteps, getEligibleNextSteps, getName, getNextSteps, getNextSteps, getOutputDestinationNames, getPatternForStep, getPlan, getPriorSteps, getProcessor, getRouter, isActive, isOutputStep, parallelStream, removeIf, reportException, size, spliterator, stream, toString
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface java.util.Collection
equals, hashCode, parallelStream, removeIf, size, spliterator, stream, toArray
-
Methods inherited from interface org.jesterj.ingest.model.Configurable
getName, isValidName
-
Methods inherited from interface org.jesterj.ingest.model.DeferredBuilding
addDeferred, executeDeferred
-
Methods inherited from interface org.jesterj.ingest.model.Scanner
fetchById, getDocumentTracker, getIdFunction, isScanning
-
Methods inherited from interface org.jesterj.ingest.model.Step
getBatchSize, getDownstreamOutputSteps, getEligibleNextSteps, getNextSteps, getNextSteps, getOutputDestinationNames, getPlan, getPriorSteps, getRouter, isOutputDestinationThisStep, isOutputStep
-
-
-
-
Field Detail
-
SCAN_ORIGIN
public static final java.lang.String SCAN_ORIGIN
- See Also:
- Constant Field Values
-
FTI_ORIGIN
public static final java.lang.String FTI_ORIGIN
- See Also:
- Constant Field Values
-
DEF_MAX_ERROR_RETRY
public static final int DEF_MAX_ERROR_RETRY
-
TIMEOUT
public static final int TIMEOUT
- See Also:
- Constant Field Values
-
NEW_CONTENT_FOUND_MSG
public static final java.lang.String NEW_CONTENT_FOUND_MSG
- See Also:
- Constant Field Values
-
DDL_TIMEOUT
public static final int DDL_TIMEOUT
- See Also:
- Constant Field Values
-
activeScans
protected final java.util.concurrent.atomic.AtomicInteger activeScans
-
CREATE_FT_KEYSPACE
public static final java.lang.String CREATE_FT_KEYSPACE
- See Also:
- Constant Field Values
-
CREATE_FT_TABLE
public static final java.lang.String CREATE_FT_TABLE
- See Also:
- Constant Field Values
-
CREATE_INDEX_STATUS
public static final java.lang.String CREATE_INDEX_STATUS
- See Also:
- Constant Field Values
-
CREATE_DOC_HASH
public static final java.lang.String CREATE_DOC_HASH
- See Also:
- Constant Field Values
-
-
Method Detail
-
activate
public void activate()
Description copied from interface:ActiveBegin processing. This is the on switch.
-
deactivate
public void deactivate()
Description copied from interface:ActiveStop processing. This is the stop switch.- Specified by:
deactivatein interfaceActive- Overrides:
deactivatein classStepImpl
-
run
public void run()
-
sendToNext
public void sendToNext(Document doc)
Description copied from interface:StepAfter processing is complete, send it on to any subsequent steps if appropriate. This method may inspect the document status and if the document is not dropped, errored, etc. and there are multiple possible destination steps it should invoke the router to determine the appropriate destinations and conduct the submission of the results to the indicated steps.- Specified by:
sendToNextin interfaceStep- Overrides:
sendToNextin classStepImpl- Parameters:
doc- The document for which processing is complete.
-
docFound
public boolean docFound(Document doc)
What to do when a document has been recognized as required for indexing.- Parameters:
doc- The document to be processed
-
setInterval
protected void setInterval(long interval)
-
isHeuristicallyDirty
public boolean isHeuristicallyDirty(Document doc)
Description copied from interface:ScannerScanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.- Specified by:
isHeuristicallyDirtyin interfaceScanner- Parameters:
doc- the document to check- Returns:
- true if indexing is required, false otherwise.
-
getScanOperation
public abstract ScannerImpl.ScanOp getScanOperation()
The default scan operation is to check the cassandra database for records marked dirty or restart and process those records using the scanner's document fetching logic (empty by default)- Specified by:
getScanOperationin interfaceScanner- Returns:
- a
Runnableobject that locates documents.
-
processPendingDocs
protected void processPendingDocs(FTIQueryContext ftiQueryContext, java.util.List<Status> statusesToProcess, boolean force)
Force processing of documents in the specified status (except Dirty which will receive normal hash and memory checks) Note: this method scales O(n) with the number of documents returned for each status processed. In JesterJ all documents should eventually end up in terminal statuses (INDEXED,DEAD,DROPPED,SEARCHABLE) It is very dangerous to pass in any terminal status because then N is the size of the entire corpus, whereas the transient statuses will relate only to "in flight" documents. Thus, so long as plans don't cause an accumulation of never resolving transients, the FTI system will scale dependent on the number of inflight documents rather primarily, and secondarily as cassandra scales vs the number of events seen during the TTL period. Furthermore, that scaling will only relate to the scanning for FTI documents, and primary processing should be write-only and bound only by cassandra's write behavior. That's the theory at least :)- Parameters:
ftiQueryContext- An object providing some context for the FTI queriesstatusesToProcess- The list of statuses that we want to reprocess.force- determines if the document produced should setDocument.setForceReprocess(boolean)to true
-
getInterval
public long getInterval()
Description copied from interface:ScannerThe interval for the scanner to fire. Scanners implementations must not begin a new scan more frequently than this interval. There is no guarantee that the scan will begin this frequently although implementations are encouraged to report any occasions on which scans are started later than this interval would imply as warnings. An interval of less than zero indicates that the scanner should only run once.- Specified by:
getIntervalin interfaceScanner- Returns:
- the scan interval. Defaults to 30 minutes
-
isActivePriorSteps
public boolean isActivePriorSteps()
Description copied from interface:StepDetermine if any upstream steps are still active. A true result implies that documents may yet be received for processing, and it is not safe to shut down the processing thread for this step.- Specified by:
isActivePriorStepsin interfaceStep- Overrides:
isActivePriorStepsin classStepImpl- Returns:
- true if any immediately prior steps are still active
-
addPredecessor
public void addPredecessor(StepImpl obj)
Description copied from interface:StepRegister a step as a predecessor of this step (one that might send documents to this step).- Specified by:
addPredecessorin interfaceStep- Overrides:
addPredecessorin classStepImpl- Parameters:
obj- The step to register as a potential upstream source of documents.
-
add
public boolean add(Document document)
-
offer
public boolean offer(Document document)
-
remove
public Document remove()
-
poll
public Document poll()
-
element
public Document element()
-
peek
public Document peek()
-
put
public void put(Document document)
Description copied from class:StepImplAttempt to send the document to this step blocking if the queue for this step is full. This method does NOT guarantee delivery however, and will return immediately if the destination step is shutting down.
-
offer
public boolean offer(Document document, long timeout, java.util.concurrent.TimeUnit unit)
-
take
public Document take()
-
poll
public Document poll(long timeout, java.util.concurrent.TimeUnit unit)
-
remainingCapacity
public int remainingCapacity()
- Specified by:
remainingCapacityin interfacejava.util.concurrent.BlockingQueue<Document>- Overrides:
remainingCapacityin classStepImpl
-
remove
public boolean remove(java.lang.Object o)
-
containsAll
public boolean containsAll(java.util.Collection<?> c)
- Specified by:
containsAllin interfacejava.util.Collection<Document>- Overrides:
containsAllin classStepImpl
-
addAll
public boolean addAll(java.util.Collection<? extends Document> c)
-
removeAll
public boolean removeAll(java.util.Collection<?> c)
-
retainAll
public boolean retainAll(java.util.Collection<?> c)
-
clear
public void clear()
-
contains
public boolean contains(java.lang.Object o)
-
iterator
public java.util.Iterator<Document> iterator()
-
toArray
public java.lang.Object[] toArray()
-
toArray
public <T> T[] toArray(T[] a)
-
drainTo
public int drainTo(java.util.Collection<? super Document> c)
-
drainTo
public int drainTo(java.util.Collection<? super Document> c, int maxElements)
-
isEmpty
public boolean isEmpty()
-
getLogger
protected org.apache.logging.log4j.Logger getLogger()
-
isScanActive
public boolean isScanActive()
-
scanStarted
public void scanStarted()
-
scanFinished
public void scanFinished()
Decrement the active Scans. While it's possible to do more in an overridden version this method be very careful since it runs in a finally block after the step has been deactivated.
-
isRemembering
public boolean isRemembering()
Description copied from interface:ScannerIndicates if this scanner will re-feed documents it has already seen. This behavior can be modified by the value forScanner.isHashing().- Specified by:
isRememberingin interfaceScanner- Returns:
- true if previously indexed documents will be skipped, false if every document will be freshly indexed on every scan.
-
isHashing
public boolean isHashing()
Description copied from interface:ScannerIndicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document. The value of this property has no effect if isRemembering is false. Turning this off can speed up processing of individual docs significantly, if indexing a repository or data source that disallows updating existing documents.
-
getCassandra
public CassandraSupport getCassandra()
-
setCassandra
public void setCassandra(CassandraSupport cassandra)
-
keySpace
public java.lang.String keySpace(java.lang.String outputStep)
Description copied from interface:ScannerCalculates the keyspace for this scanner's FTI records.
-
processDirty
protected void processDirty()
-
-