Class DeDuplicatingTokenFilter

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class DeDuplicatingTokenFilter
    extends FilteringTokenFilter
    Inspects token streams for duplicate sequences of tokens. Token sequences have a minimum length - 6 is a good heuristic as it avoids filtering common idioms/phrases but detects longer sections that are typical of cut+paste copies of text.

    Internally each token is hashed/moduloed into a single byte (so 256 possible values for each token) and then recorded in a trie of seen byte sequences using a DuplicateByteSequenceSpotter. This trie is passed into the TokenFilter constructor so a single object can be reused across multiple documents.

    The emitDuplicates setting controls if duplicate tokens are filtered from results or are output (the DuplicateSequenceAttribute attribute can be used to inspect the number of prior sightings when emitDuplicates is true)