Interface Scanner

  • All Superinterfaces:
    Active, java.util.concurrent.BlockingQueue<Document>, java.util.Collection<Document>, Configurable, DeferredBuilding, java.lang.Iterable<Document>, java.util.Queue<Document>, java.lang.Runnable, Step
    All Known Implementing Classes:
    JdbcScanner, ScannerImpl, SimpleFileScanner

    public interface Scanner
    extends Step
    Monitors a document source for changes on a regular basis. When new files are found, they are submitted to the supplied queue. Note that Scanners do not normally support the methods from BlockingQueue since they normally only output documents, and never receive them. These methods may throw UnsupportedOperationException
    • Method Detail

      • getIdFunction

        default java.util.function.Function<java.lang.String,​java.lang.String> getIdFunction()
        A function that can be used to provide a custom transformation of the identifier generated by the scanning process. The default transformation is an identity transform.
        Returns:
        the function to map ID to new ID
      • getDocumentTracker

        default java.util.function.Consumer<Document> getDocumentTracker()
        Get a procedure that takes a document and uses this information to persist a record that this document has been scanned. Typical implementations might be writing a status to cassandra, updating a row in a database, or renaming the target file. By convention, the first element of the object array passed will be a String identifier, and the second argument will be the document object. Subsequent arguments are unrestricted. The default implementation is a no-op.
        Returns:
        a Consumer that consumes data about a document and has the side effect of persisting a record that the document was scanned.
      • isHeuristicallyDirty

        boolean isHeuristicallyDirty​(Document doc)
        Scanners that have a way of detecting dirty data that needs re-indexed can override this method to trigger re-indexing in cases where it would otherwise be skipped.
        Parameters:
        doc - the document to check
        Returns:
        true if indexing is required, false otherwise.
      • getScanOperation

        java.lang.Runnable getScanOperation()
        A callback that calls docFound() on the scanner when a document is found that needs to be indexed. The call back should call ScannerImpl.scanStarted() when it starts doing work, and ScannerImpl.scanFinished() when it has completed any work for which concurrency might be relevant.

        Returns:
        a Runnable object that locates documents.
      • getInterval

        long getInterval()
        The interval for the scanner to fire. Scanners implementations must not begin a new scan more frequently than this interval. There is no guarantee that the scan will begin this frequently although implementations are encouraged to report any occasions on which scans are started later than this interval would imply as warnings. An interval of less than zero indicates that the scanner should only run once.
        Returns:
        the scan interval. Defaults to 30 minutes
      • isScanning

        boolean isScanning()
        True if a new scan may be started. Implementations may choose not to start a new scan until the old one has completed. This value is independent of Active.isActive().
        Returns:
        true if a new scan should be started
      • fetchById

        java.util.Optional<Document> fetchById​(java.lang.String id,
                                               java.lang.String origination)
        Load a document based on the document's id.
        Parameters:
        id - the id of the document, see also Document.getId()
        origination - A constant indicating the source (scanner or fti) for debugging
        Returns:
        An optional that contains the document if it is possible to retrieve the document by ID
      • isRemembering

        boolean isRemembering()
        Indicates if this scanner will re-feed documents it has already seen. This behavior can be modified by the value for isHashing().
        Returns:
        true if previously indexed documents will be skipped, false if every document will be freshly indexed on every scan.
      • isHashing

        boolean isHashing()
        Indicates if this scanner will consider a hash of the document contents and compare it with a previously recorded value when asking if it has already seen a document. The value of this property has no effect if isRemembering is false. Turning this off can speed up processing of individual docs significantly, if indexing a repository or data source that disallows updating existing documents.
        Returns:
        True if a hash value should be calculated and compared to the previously stored value.
      • keySpace

        java.lang.String keySpace​(java.lang.String outputStep)
        Calculates the keyspace for this scanner's FTI records.
        Parameters:
        outputStep - the step for which we want to generate a keyspace
        Returns:
        a keyspace name encoding the scanner name, plan name and plan version.