Class AbstractEntityMergeStrategy
- All Implemented Interfaces:
EntityMergeStrategy
- Direct Known Subclasses:
AbstractCollectionEntityMergeStrategy,AbstractSingleEntityMergeStrategy
- Author:
- bdferris
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected EDuplicateDetectionStrategyBy default, we don't specify a default duplicate detection strategy, but instead attempt auto-detection of the best strategy.protected ELogDuplicatesStrategyWhat should happen when we detect a duplicate entity?protected doubleThis threshold is similar to_minElementsDuplicateScoreForAutoDetectexcept that is used only for auto-detecting fuzzy matches and only for producing a candidate set of fuzzy matches to score to determine if auto-detection should be used.protected doubleWhen auto-detecting the best duplicate detection strategy to use, the differentEDuplicateDetectionStrategywill produce a set of candidate duplicates, for which we score their overlap on a scale from 0.0 to 1.0, where 0.0 indicates that none of the entities seem to match and 1.0 indicates that they are exact duplicates.protected doubleWhen auto-detecting the best duplicate detection strategy to use, defines the scoring threshold to use when considering if two entity id sets have enough ids in common to consider using identifier-based duplicate detection. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprotected EDuplicateDetectionStrategyDetermines the bestEDuplicateDetectionStrategyto use for the current entity type and source feed.protected abstract Stringprotected abstract EDuplicateDetectionStrategyDetermines the bestEDuplicateDetectionStrategyto use for merging entities from the current source feed into the merged output feed.voidsetDuplicateDetectionStrategy(EDuplicateDetectionStrategy duplicateDetectionStrategy) Set a duplicate detection strategy.voidsetDuplicateRenamingStrategy(EDuplicateRenamingStrategy duplicateRenamingStrategy) voidsetLogDuplicatesStrategy(ELogDuplicatesStrategy logDuplicatesStrategy) Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface org.onebusaway.gtfs_merge.strategies.EntityMergeStrategy
getEntityTypes, merge
-
Field Details
-
_duplicateDetectionStrategy
By default, we don't specify a default duplicate detection strategy, but instead attempt auto-detection of the best strategy. When the auto-detected strategy is not appropriate, it can be manually overridden by setting this value. -
_minElementsInCommonScoreForAutoDetect
protected double _minElementsInCommonScoreForAutoDetectWhen auto-detecting the best duplicate detection strategy to use, defines the scoring threshold to use when considering if two entity id sets have enough ids in common to consider using identifier-based duplicate detection. Note that we aren't yet comparing entities with the same id to see if they seem similar at this point, just the raw number of identifiers in common between two sets. The intuition is that if two entity sets have very few identifiers in common, the odds are low that identity-based duplicate detection should be used.An id overlap score will be between 0.0 and 1.0, where 0.0 indicates absolutely no overlap and 1.0 indicates that the two id sets are the same. If the score is below the specified threshold, identifier-based duplicate detection will not be considered.
See
DuplicateScoringSupport.scoreElementOverlap(java.util.Collection, java.util.Collection)for an example scoring method. -
_minElementsDuplicateScoreForAutoDetect
protected double _minElementsDuplicateScoreForAutoDetectWhen auto-detecting the best duplicate detection strategy to use, the differentEDuplicateDetectionStrategywill produce a set of candidate duplicates, for which we score their overlap on a scale from 0.0 to 1.0, where 0.0 indicates that none of the entities seem to match and 1.0 indicates that they are exact duplicates. We define a minimum overlap score threshold that must be met for a particular duplicate detection strategy to be applied to the source and target feeds at large. -
_minElementDuplicateScoreForFuzzyMatch
protected double _minElementDuplicateScoreForFuzzyMatchThis threshold is similar to_minElementsDuplicateScoreForAutoDetectexcept that is used only for auto-detecting fuzzy matches and only for producing a candidate set of fuzzy matches to score to determine if auto-detection should be used.TODO(bdferris): I'll admit that I'm having a hard time remembering why I wanted a separate threshold for determining the set of candidate fuzzy matches. It might make sense to remove this at some point. I think the idea might have been to be more lenient when determining if we should use fuzzy-duplicate-detection in the first place, but be more strict when it comes to actual duplicate detection.
-
_logDuplicatesStrategy
What should happen when we detect a duplicate entity?
-
-
Constructor Details
-
AbstractEntityMergeStrategy
public AbstractEntityMergeStrategy()
-
-
Method Details
-
setDuplicateDetectionStrategy
Set a duplicate detection strategy. By default, we attempt to auto-detect an appropriate strategy.- Parameters:
duplicateDetectionStrategy-
-
setLogDuplicatesStrategy
-
setDuplicateRenamingStrategy
-
getDuplicateRenamingStrategy
-
determineDuplicateDetectionStrategy
Determines the bestEDuplicateDetectionStrategyto use for the current entity type and source feed. If a specific duplicate detection strategy has already been specified withsetDuplicateDetectionStrategy(EDuplicateDetectionStrategy), it will always be returned. If not, we attempt to pick the best duplicate detection strategy given the current source feed and the data already in the merged output feed. Auto-detecting the best duplicate detection strategy may be an expensive operation, so we cache the result for each source feed.- Parameters:
context-- Returns:
- the duplicate detection strategy to use for the current source input feed
-
pickBestDuplicateDetectionStrategy
protected abstract EDuplicateDetectionStrategy pickBestDuplicateDetectionStrategy(GtfsMergeContext context) Determines the bestEDuplicateDetectionStrategyto use for merging entities from the current source feed into the merged output feed. Sub-classes are required to provide the most appropriate strategy for merging their particular entity type.- Parameters:
context-- Returns:
-
getDescription
- Returns:
- a string description of the current entity merge strategy, typically identifying the entity-type to be merged
-