Class ScannerImpl

    • Field Detail

      • DEF_MAX_ERROR_RETRY

        public static final int DEF_MAX_ERROR_RETRY
      • NEW_CONTENT_FOUND_MSG

        public static final java.lang.String NEW_CONTENT_FOUND_MSG
        See Also:
        Constant Field Values
      • activeScans

        protected final java.util.concurrent.atomic.AtomicInteger activeScans
      • CREATE_FT_KEYSPACE

        public static final java.lang.String CREATE_FT_KEYSPACE
        See Also:
        Constant Field Values
      • CREATE_INDEX_STATUS

        public static final java.lang.String CREATE_INDEX_STATUS
        See Also:
        Constant Field Values
    • Constructor Detail

      • ScannerImpl

        protected ScannerImpl()
    • Method Detail

      • activate

        public void activate()
        Description copied from interface: Active
        Begin processing. This is the on switch.
        Specified by:
        activate in interface Active
        Overrides:
        activate in class StepImpl
      • deactivate

        public void deactivate()
        Description copied from interface: Active
        Stop processing. This is the stop switch.
        Specified by:
        deactivate in interface Active
        Overrides:
        deactivate in class StepImpl
      • run

        public void run()
        Specified by:
        run in interface java.lang.Runnable
        Overrides:
        run in class StepImpl
      • sendToNext

        public void sendToNext​(Document doc)
        Description copied from interface: Step
        After processing is complete, send it on to any subsequent steps if appropriate. This method may inspect the document status and if the document is not dropped, errored, etc. and there are multiple possible destination steps it should invoke the router to determine the appropriate destinations and conduct the submission of the results to the indicated steps.
        Specified by:
        sendToNext in interface Step
        Overrides:
        sendToNext in class StepImpl
        Parameters:
        doc - The document for which processing is complete.
      • docFound

        public boolean docFound​(Document doc)
        What to do when a document has been recognized as required for indexing.
        Parameters:
        doc - The document to be processed
      • setInterval

        protected void setInterval​(long interval)
      • isHeuristicallyDirty

        public boolean isHeuristicallyDirty​(Document doc)
        Description copied from interface: Scanner
        Scanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.
        Specified by:
        isHeuristicallyDirty in interface Scanner
        Parameters:
        doc - the document to check
        Returns:
        true if indexing is required, false otherwise.
      • getScanOperation

        public abstract ScannerImpl.ScanOp getScanOperation()
        The default scan operation is to check the cassandra database for records marked dirty or restart and process those records using the scanner's document fetching logic (empty by default)
        Specified by:
        getScanOperation in interface Scanner
        Returns:
        a Runnable object that locates documents.
      • processPendingDocs

        protected void processPendingDocs​(FTIQueryContext ftiQueryContext,
                                          java.util.List<Status> statusesToProcess,
                                          boolean force)
        Force processing of documents in the specified status (except Dirty which will receive normal hash and memory checks) Note: this method scales O(n) with the number of documents returned for each status processed. In JesterJ all documents should eventually end up in terminal statuses (INDEXED,DEAD,DROPPED,SEARCHABLE) It is very dangerous to pass in any terminal status because then N is the size of the entire corpus, whereas the transient statuses will relate only to "in flight" documents. Thus, so long as plans don't cause an accumulation of never resolving transients, the FTI system will scale dependent on the number of inflight documents rather primarily, and secondarily as cassandra scales vs the number of events seen during the TTL period. Furthermore, that scaling will only relate to the scanning for FTI documents, and primary processing should be write-only and bound only by cassandra's write behavior. That's the theory at least :)
        Parameters:
        ftiQueryContext - An object providing some context for the FTI queries
        statusesToProcess - The list of statuses that we want to reprocess.
        force - determines if the document produced should set Document.setForceReprocess(boolean) to true
      • getInterval

        public long getInterval()
        Description copied from interface: Scanner
        The interval for the scanner to fire. Scanners implementations must not begin a new scan more frequently than this interval. There is no guarantee that the scan will begin this frequently although implementations are encouraged to report any occasions on which scans are started later than this interval would imply as warnings. An interval of less than zero indicates that the scanner should only run once.
        Specified by:
        getInterval in interface Scanner
        Returns:
        the scan interval. Defaults to 30 minutes
      • isActivePriorSteps

        public boolean isActivePriorSteps()
        Description copied from interface: Step
        Determine if any upstream steps are still active. A true result implies that documents may yet be received for processing, and it is not safe to shut down the processing thread for this step.
        Specified by:
        isActivePriorSteps in interface Step
        Overrides:
        isActivePriorSteps in class StepImpl
        Returns:
        true if any immediately prior steps are still active
      • addPredecessor

        public void addPredecessor​(StepImpl obj)
        Description copied from interface: Step
        Register a step as a predecessor of this step (one that might send documents to this step).
        Specified by:
        addPredecessor in interface Step
        Overrides:
        addPredecessor in class StepImpl
        Parameters:
        obj - The step to register as a potential upstream source of documents.
      • add

        public boolean add​(Document document)
        Specified by:
        add in interface java.util.concurrent.BlockingQueue<Document>
        Specified by:
        add in interface java.util.Collection<Document>
        Specified by:
        add in interface java.util.Queue<Document>
        Overrides:
        add in class StepImpl
      • offer

        public boolean offer​(Document document)
        Specified by:
        offer in interface java.util.concurrent.BlockingQueue<Document>
        Specified by:
        offer in interface java.util.Queue<Document>
        Overrides:
        offer in class StepImpl
      • put

        public void put​(Document document)
        Description copied from class: StepImpl
        Attempt to send the document to this step blocking if the queue for this step is full. This method does NOT guarantee delivery however, and will return immediately if the destination step is shutting down.
        Specified by:
        put in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        put in class StepImpl
        Parameters:
        document - the element to add
      • offer

        public boolean offer​(Document document,
                             long timeout,
                             java.util.concurrent.TimeUnit unit)
        Specified by:
        offer in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        offer in class StepImpl
      • take

        public Document take()
        Specified by:
        take in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        take in class StepImpl
      • poll

        public Document poll​(long timeout,
                             java.util.concurrent.TimeUnit unit)
        Specified by:
        poll in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        poll in class StepImpl
      • remainingCapacity

        public int remainingCapacity()
        Specified by:
        remainingCapacity in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        remainingCapacity in class StepImpl
      • remove

        public boolean remove​(java.lang.Object o)
        Specified by:
        remove in interface java.util.concurrent.BlockingQueue<Document>
        Specified by:
        remove in interface java.util.Collection<Document>
        Overrides:
        remove in class StepImpl
      • containsAll

        public boolean containsAll​(java.util.Collection<?> c)
        Specified by:
        containsAll in interface java.util.Collection<Document>
        Overrides:
        containsAll in class StepImpl
      • addAll

        public boolean addAll​(java.util.Collection<? extends Document> c)
        Specified by:
        addAll in interface java.util.Collection<Document>
        Overrides:
        addAll in class StepImpl
      • removeAll

        public boolean removeAll​(java.util.Collection<?> c)
        Specified by:
        removeAll in interface java.util.Collection<Document>
        Overrides:
        removeAll in class StepImpl
      • retainAll

        public boolean retainAll​(java.util.Collection<?> c)
        Specified by:
        retainAll in interface java.util.Collection<Document>
        Overrides:
        retainAll in class StepImpl
      • clear

        public void clear()
        Specified by:
        clear in interface java.util.Collection<Document>
        Overrides:
        clear in class StepImpl
      • contains

        public boolean contains​(java.lang.Object o)
        Specified by:
        contains in interface java.util.concurrent.BlockingQueue<Document>
        Specified by:
        contains in interface java.util.Collection<Document>
        Overrides:
        contains in class StepImpl
      • iterator

        public java.util.Iterator<Document> iterator()
        Specified by:
        iterator in interface java.util.Collection<Document>
        Specified by:
        iterator in interface java.lang.Iterable<Document>
        Overrides:
        iterator in class StepImpl
      • toArray

        public java.lang.Object[] toArray()
        Specified by:
        toArray in interface java.util.Collection<Document>
        Overrides:
        toArray in class StepImpl
      • toArray

        public <T> T[] toArray​(T[] a)
        Specified by:
        toArray in interface java.util.Collection<Document>
        Overrides:
        toArray in class StepImpl
      • drainTo

        public int drainTo​(java.util.Collection<? super Document> c)
        Specified by:
        drainTo in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        drainTo in class StepImpl
      • drainTo

        public int drainTo​(java.util.Collection<? super Document> c,
                           int maxElements)
        Specified by:
        drainTo in interface java.util.concurrent.BlockingQueue<Document>
        Overrides:
        drainTo in class StepImpl
      • isEmpty

        public boolean isEmpty()
        Specified by:
        isEmpty in interface java.util.Collection<Document>
        Overrides:
        isEmpty in class StepImpl
      • getLogger

        protected org.apache.logging.log4j.Logger getLogger()
        Overrides:
        getLogger in class StepImpl
      • isScanActive

        public boolean isScanActive()
      • scanStarted

        public void scanStarted()
      • scanFinished

        public void scanFinished()
        Decrement the active Scans. While it's possible to do more in an overridden version this method be very careful since it runs in a finally block after the step has been deactivated.
      • isRemembering

        public boolean isRemembering()
        Description copied from interface: Scanner
        Indicates if this scanner will re-feed documents it has already seen. This behavior can be modified by the value for Scanner.isHashing().
        Specified by:
        isRemembering in interface Scanner
        Returns:
        true if previously indexed documents will be skipped, false if every document will be freshly indexed on every scan.
      • isHashing

        public boolean isHashing()
        Description copied from interface: Scanner
        Indicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document. The value of this property has no effect if isRemembering is false. Turning this off can speed up processing of individual docs significantly, if indexing a repository or data source that disallows updating existing documents.
        Specified by:
        isHashing in interface Scanner
        Returns:
        True if a hash value should be calculated and compared to the previously stored value.
      • keySpace

        public java.lang.String keySpace​(java.lang.String outputStep)
        Description copied from interface: Scanner
        Calculates the keyspace for this scanner's FTI records.
        Specified by:
        keySpace in interface Scanner
        Parameters:
        outputStep - the step for which we want to generate a keyspace
        Returns:
        a keyspace name encoding the scanner name, plan name and plan version.
      • processDirty

        protected void processDirty()