Class DeDuplicatingTokenFilter
- java.lang.Object
-
- org.apache.lucene.util.AttributeSource
-
- org.apache.lucene.analysis.TokenStream
-
- org.apache.lucene.analysis.TokenFilter
-
- org.apache.lucene.analysis.FilteringTokenFilter
-
- org.apache.lucene.analysis.miscellaneous.DeDuplicatingTokenFilter
-
- All Implemented Interfaces:
Closeable,AutoCloseable
public class DeDuplicatingTokenFilter extends FilteringTokenFilter
Inspects token streams for duplicate sequences of tokens. Token sequences have a minimum length - 6 is a good heuristic as it avoids filtering common idioms/phrases but detects longer sections that are typical of cut+paste copies of text.Internally each token is hashed/moduloed into a single byte (so 256 possible values for each token) and then recorded in a trie of seen byte sequences using a
DuplicateByteSequenceSpotter. This trie is passed into the TokenFilter constructor so a single object can be reused across multiple documents.The emitDuplicates setting controls if duplicate tokens are filtered from results or are output (the
DuplicateSequenceAttributeattribute can be used to inspect the number of prior sightings when emitDuplicates is true)
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
AttributeSource.State
-
-
Field Summary
-
Fields inherited from class org.apache.lucene.analysis.TokenFilter
input
-
Fields inherited from class org.apache.lucene.analysis.TokenStream
DEFAULT_TOKEN_ATTRIBUTE_FACTORY
-
-
Constructor Summary
Constructors Constructor Description DeDuplicatingTokenFilter(TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)DeDuplicatingTokenFilter(TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected booleanaccept()Override this method and return if the current input token should be returned byFilteringTokenFilter.incrementToken().-
Methods inherited from class org.apache.lucene.analysis.FilteringTokenFilter
end, incrementToken, reset
-
Methods inherited from class org.apache.lucene.analysis.TokenFilter
close
-
Methods inherited from class org.apache.lucene.util.AttributeSource
addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString
-
-
-
-
Constructor Detail
-
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter)
-
DeDuplicatingTokenFilter
public DeDuplicatingTokenFilter(TokenStream in, DuplicateByteSequenceSpotter byteStreamDuplicateSpotter, boolean emitDuplicates)
- Parameters:
in- The input token streambyteStreamDuplicateSpotter- object which retains trie of token sequencesemitDuplicates- true if duplicate tokens are to be emitted (useDuplicateSequenceAttributeattribute to inspect number of prior sightings of tokens as part of a sequence).
-
-
Method Detail
-
accept
protected boolean accept() throws IOExceptionDescription copied from class:FilteringTokenFilterOverride this method and return if the current input token should be returned byFilteringTokenFilter.incrementToken().- Specified by:
acceptin classFilteringTokenFilter- Throws:
IOException
-
-