Class CLARA<V>

  • Type Parameters:
    V - Data type
    All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<MedoidModel>>, KMedoidsClustering<V>

    @Reference(authors="L. Kaufman, P. J. Rousseeuw",title="Clustering Large Data Sets",booktitle="Pattern Recognition in Practice",url="https://doi.org/10.1016/B978-0-444-87877-9.50039-X",bibkey="doi:10.1016/B978-0-444-87877-9.50039-X") @Reference(authors="L. Kaufman, P. J. Rousseeuw",title="Clustering Large Applications (Program CLARA)",booktitle="Finding Groups in Data: An Introduction to Cluster Analysis",url="https://doi.org/10.1002/9780470316801.ch3",bibkey="doi:10.1002/9780470316801.ch3")
    public class CLARA<V>
    extends PAM<V>
    Clustering Large Applications (CLARA) is a clustering method for large data sets based on PAM, partitioning around medoids (PAM) based on sampling.

    TODO: use a triangular distance matrix, rather than a hash-map based cache, for a bit better performance and less memory.

    Reference:

    L. Kaufman, P. J. Rousseeuw
    Clustering Large Data Sets
    Pattern Recognition in Practice

    L. Kaufman, P. J. Rousseeuw
    Clustering Large Applications (Program CLARA)
    Finding Groups in Data: An Introduction to Cluster Analysis

    Since:
    0.7.0
    Author:
    Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      protected static class  CLARA.CachedDistanceQuery<V>
      Cached distance query.
      static class  CLARA.Par<V>
      Parameterization class.
      • Nested classes/interfaces inherited from class elki.clustering.kmedoids.PAM

        PAM.Instance
      • Nested classes/interfaces inherited from interface elki.Algorithm

        elki.Algorithm.Utils
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) boolean keepmed
      Keep the previous medoids in the sample (see page 145).
      private static elki.logging.Logging LOG
      Class logger.
      (package private) int numsamples
      Number of samples to draw (i.e. iterations).
      (package private) elki.utilities.random.RandomFactory random
      Random factory for initialization.
      (package private) double sampling
      Sampling rate.
    • Constructor Summary

      Constructors 
      Constructor Description
      CLARA​(elki.distance.Distance<? super V> distance, int k, int maxiter, KMedoidsInitialization<V> initializer, int numsamples, double sampling, boolean keepmed, elki.utilities.random.RandomFactory random)
      Constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected static double assignRemainingToNearestCluster​(elki.database.ids.ArrayDBIDs means, elki.database.ids.DBIDs ids, elki.database.ids.DBIDs rids, elki.database.datastore.WritableIntegerDataStore assignment, elki.database.query.distance.DistanceQuery<?> distQ)
      Returns a list of clusters.
      (package private) static elki.database.ids.DBIDs randomSample​(elki.database.ids.DBIDs ids, int samplesize, java.util.Random rnd, elki.database.ids.DBIDs previous)
      Draw a random sample of the desired size.
      Clustering<MedoidModel> run​(elki.database.relation.Relation<V> relation)
      Run k-medoids clustering.
      Clustering<MedoidModel> run​(elki.database.relation.Relation<V> relation, int k, elki.database.query.distance.DistanceQuery<? super V> distQ)
      Run k-medoids clustering with a given distance query.
      Not a very elegant API, but needed for some types of nested k-medoids.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Class logger.
      • sampling

        double sampling
        Sampling rate. If less than 1, it is considered to be a relative value.
      • numsamples

        int numsamples
        Number of samples to draw (i.e. iterations).
      • keepmed

        boolean keepmed
        Keep the previous medoids in the sample (see page 145).
      • random

        elki.utilities.random.RandomFactory random
        Random factory for initialization.
    • Constructor Detail

      • CLARA

        public CLARA​(elki.distance.Distance<? super V> distance,
                     int k,
                     int maxiter,
                     KMedoidsInitialization<V> initializer,
                     int numsamples,
                     double sampling,
                     boolean keepmed,
                     elki.utilities.random.RandomFactory random)
        Constructor.
        Parameters:
        distance - Distance function to use
        k - Number of clusters to produce
        maxiter - Maximum number of iterations
        initializer - Initialization function
        numsamples - Number of samples (sampling iterations)
        sampling - Sampling rate (absolute or relative)
        keepmed - Keep the previous medoids in the next sample
        random - Random generator
    • Method Detail

      • run

        public Clustering<MedoidModel> run​(elki.database.relation.Relation<V> relation,
                                           int k,
                                           elki.database.query.distance.DistanceQuery<? super V> distQ)
        Description copied from interface: KMedoidsClustering
        Run k-medoids clustering with a given distance query.
        Not a very elegant API, but needed for some types of nested k-medoids.
        Specified by:
        run in interface KMedoidsClustering<V>
        Overrides:
        run in class PAM<V>
        Parameters:
        relation - relation to use
        k - Number of clusters
        distQ - Distance query to use
        Returns:
        result
      • randomSample

        static elki.database.ids.DBIDs randomSample​(elki.database.ids.DBIDs ids,
                                                    int samplesize,
                                                    java.util.Random rnd,
                                                    elki.database.ids.DBIDs previous)
        Draw a random sample of the desired size.
        Parameters:
        ids - IDs to sample from
        samplesize - Sample size
        rnd - Random generator
        previous - Previous medoids to always include in the sample.
        Returns:
        Sample
      • assignRemainingToNearestCluster

        protected static double assignRemainingToNearestCluster​(elki.database.ids.ArrayDBIDs means,
                                                                elki.database.ids.DBIDs ids,
                                                                elki.database.ids.DBIDs rids,
                                                                elki.database.datastore.WritableIntegerDataStore assignment,
                                                                elki.database.query.distance.DistanceQuery<?> distQ)
        Returns a list of clusters. The kth cluster contains the ids of those FeatureVectors, that are nearest to the kth mean.
        Parameters:
        means - Object centroids
        ids - Object ids
        rids - Sample that was already assigned
        assignment - cluster assignment
        distQ - distance query
        Returns:
        Sum of distances.