Class AbstractEntityMergeStrategy

java.lang.Object
org.onebusaway.gtfs_merge.strategies.AbstractEntityMergeStrategy
All Implemented Interfaces:
EntityMergeStrategy
Direct Known Subclasses:
AbstractCollectionEntityMergeStrategy, AbstractSingleEntityMergeStrategy

public abstract class AbstractEntityMergeStrategy extends Object implements EntityMergeStrategy
Abstract base class that defines methods and properties common to all entity merge strategies, regardless of entity type.
Author:
bdferris
See Also:
  • Field Details

    • _duplicateDetectionStrategy

      protected EDuplicateDetectionStrategy _duplicateDetectionStrategy
      By default, we don't specify a default duplicate detection strategy, but instead attempt auto-detection of the best strategy. When the auto-detected strategy is not appropriate, it can be manually overridden by setting this value.
    • _minElementsInCommonScoreForAutoDetect

      protected double _minElementsInCommonScoreForAutoDetect
      When auto-detecting the best duplicate detection strategy to use, defines the scoring threshold to use when considering if two entity id sets have enough ids in common to consider using identifier-based duplicate detection. Note that we aren't yet comparing entities with the same id to see if they seem similar at this point, just the raw number of identifiers in common between two sets. The intuition is that if two entity sets have very few identifiers in common, the odds are low that identity-based duplicate detection should be used.

      An id overlap score will be between 0.0 and 1.0, where 0.0 indicates absolutely no overlap and 1.0 indicates that the two id sets are the same. If the score is below the specified threshold, identifier-based duplicate detection will not be considered.

      See DuplicateScoringSupport.scoreElementOverlap(java.util.Collection, java.util.Collection) for an example scoring method.

    • _minElementsDuplicateScoreForAutoDetect

      protected double _minElementsDuplicateScoreForAutoDetect
      When auto-detecting the best duplicate detection strategy to use, the different EDuplicateDetectionStrategy will produce a set of candidate duplicates, for which we score their overlap on a scale from 0.0 to 1.0, where 0.0 indicates that none of the entities seem to match and 1.0 indicates that they are exact duplicates. We define a minimum overlap score threshold that must be met for a particular duplicate detection strategy to be applied to the source and target feeds at large.
    • _minElementDuplicateScoreForFuzzyMatch

      protected double _minElementDuplicateScoreForFuzzyMatch
      This threshold is similar to _minElementsDuplicateScoreForAutoDetect except that is used only for auto-detecting fuzzy matches and only for producing a candidate set of fuzzy matches to score to determine if auto-detection should be used.

      TODO(bdferris): I'll admit that I'm having a hard time remembering why I wanted a separate threshold for determining the set of candidate fuzzy matches. It might make sense to remove this at some point. I think the idea might have been to be more lenient when determining if we should use fuzzy-duplicate-detection in the first place, but be more strict when it comes to actual duplicate detection.

    • _logDuplicatesStrategy

      protected ELogDuplicatesStrategy _logDuplicatesStrategy
      What should happen when we detect a duplicate entity?
  • Constructor Details

    • AbstractEntityMergeStrategy

      public AbstractEntityMergeStrategy()
  • Method Details

    • setDuplicateDetectionStrategy

      public void setDuplicateDetectionStrategy(EDuplicateDetectionStrategy duplicateDetectionStrategy)
      Set a duplicate detection strategy. By default, we attempt to auto-detect an appropriate strategy.
      Parameters:
      duplicateDetectionStrategy -
    • setLogDuplicatesStrategy

      public void setLogDuplicatesStrategy(ELogDuplicatesStrategy logDuplicatesStrategy)
    • setDuplicateRenamingStrategy

      public void setDuplicateRenamingStrategy(EDuplicateRenamingStrategy duplicateRenamingStrategy)
    • getDuplicateRenamingStrategy

      public EDuplicateRenamingStrategy getDuplicateRenamingStrategy()
    • determineDuplicateDetectionStrategy

      protected EDuplicateDetectionStrategy determineDuplicateDetectionStrategy(GtfsMergeContext context)
      Determines the best EDuplicateDetectionStrategy to use for the current entity type and source feed. If a specific duplicate detection strategy has already been specified with setDuplicateDetectionStrategy(EDuplicateDetectionStrategy), it will always be returned. If not, we attempt to pick the best duplicate detection strategy given the current source feed and the data already in the merged output feed. Auto-detecting the best duplicate detection strategy may be an expensive operation, so we cache the result for each source feed.
      Parameters:
      context -
      Returns:
      the duplicate detection strategy to use for the current source input feed
    • pickBestDuplicateDetectionStrategy

      protected abstract EDuplicateDetectionStrategy pickBestDuplicateDetectionStrategy(GtfsMergeContext context)
      Determines the best EDuplicateDetectionStrategy to use for merging entities from the current source feed into the merged output feed. Sub-classes are required to provide the most appropriate strategy for merging their particular entity type.
      Parameters:
      context -
      Returns:
    • getDescription

      protected abstract String getDescription()
      Returns:
      a string description of the current entity merge strategy, typically identifying the entity-type to be merged