Class DOC

  • All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<SubspaceModel>>, SubspaceClusteringAlgorithm<SubspaceModel>
    Direct Known Subclasses:
    FastDOC

    @Title("DOC: Density-based Optimal projective Clustering")
    @Reference(authors="C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali",
               title="A Monte Carlo algorithm for fast projective clustering",
               booktitle="Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD \'02)",
               url="https://doi.org/10.1145/564691.564739",
               bibkey="DBLP:conf/sigmod/ProcopiucJAM02")
    public class DOC
    extends java.lang.Object
    implements SubspaceClusteringAlgorithm<SubspaceModel>
    DOC is a sampling based subspace clustering algorithm.

    Reference:

    C. M. Procopiuc, M. Jones, P. K. Agarwal, T. M. Murali
    A Monte Carlo algorithm for fast projective clustering
    In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD '02).

    Since:
    0.6.0
    Author:
    Florian Nuecke
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  DOC.Par
      Parameterization class.
      • Nested classes/interfaces inherited from interface elki.Algorithm

        elki.Algorithm.Utils
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected double alpha
      Relative density threshold parameter alpha.
      protected double beta
      Balancing parameter for importance of points vs. dimensions
      private static elki.logging.Logging LOG
      The logger for this class.
      protected elki.utilities.random.RandomFactory rnd
      Randomizer used internally for sampling points.
      protected double w
      Half width parameter.
    • Constructor Summary

      Constructors 
      Constructor Description
      DOC​(double alpha, double beta, double w, elki.utilities.random.RandomFactory random)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected double computeClusterQuality​(int clusterSize, int numRelevantDimensions)
      Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.
      protected boolean dimensionIsRelevant​(int dimension, elki.database.relation.Relation<? extends elki.data.NumberVector> relation, elki.database.ids.DBIDs points)
      Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).
      protected elki.database.ids.DBIDs findNeighbors​(elki.database.ids.DBIDRef q, long[] nD, elki.database.ids.ArrayModifiableDBIDs S, elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
      Find the neighbors of point q in the given subspace
      elki.data.type.TypeInformation[] getInputTypeRestriction()  
      protected Cluster<SubspaceModel> makeCluster​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, elki.database.ids.DBIDs C, long[] D)
      Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.
      Clustering<SubspaceModel> run​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
      Performs the DOC or FastDOC (as configured) algorithm.
      protected Cluster<SubspaceModel> runDOC​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, elki.database.ids.ArrayModifiableDBIDs S, int d, int n, int m, int r, int minClusterSize)
      Performs a single run of DOC, finding a single cluster.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        The logger for this class.
      • alpha

        protected double alpha
        Relative density threshold parameter alpha.
      • beta

        protected double beta
        Balancing parameter for importance of points vs. dimensions
      • w

        protected double w
        Half width parameter.
      • rnd

        protected elki.utilities.random.RandomFactory rnd
        Randomizer used internally for sampling points.
    • Constructor Detail

      • DOC

        public DOC​(double alpha,
                   double beta,
                   double w,
                   elki.utilities.random.RandomFactory random)
        Constructor.
        Parameters:
        alpha - α relative density threshold.
        beta - β balancing parameter for size vs. dimensionality.
        w - half width parameter.
        random - Random factory
    • Method Detail

      • getInputTypeRestriction

        public elki.data.type.TypeInformation[] getInputTypeRestriction()
        Specified by:
        getInputTypeRestriction in interface elki.Algorithm
      • run

        public Clustering<SubspaceModel> run​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
        Performs the DOC or FastDOC (as configured) algorithm.

        This will run exhaustively, i.e. run DOC until no clusters are found anymore / the database size has shrunk below the threshold for minimum cluster size.

        Parameters:
        relation - Data relation
      • runDOC

        protected Cluster<SubspaceModel> runDOC​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                                elki.database.ids.ArrayModifiableDBIDs S,
                                                int d,
                                                int n,
                                                int m,
                                                int r,
                                                int minClusterSize)
        Performs a single run of DOC, finding a single cluster.
        Parameters:
        relation - used to get actual values for DBIDs.
        S - The set of points we're working on.
        d - Dimensionality of the data set we're currently working on.
        r - Size of random samples.
        m - Number of inner iterations (per seed point).
        n - Number of outer iterations (seed points).
        minClusterSize - Minimum size a cluster must have to be accepted.
        Returns:
        a cluster, if one is found, else null.
      • findNeighbors

        protected elki.database.ids.DBIDs findNeighbors​(elki.database.ids.DBIDRef q,
                                                        long[] nD,
                                                        elki.database.ids.ArrayModifiableDBIDs S,
                                                        elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
        Find the neighbors of point q in the given subspace
        Parameters:
        q - Query point
        nD - Subspace mask
        S - Remaining data points
        relation - Data relation
        Returns:
        Neighbors
      • dimensionIsRelevant

        protected boolean dimensionIsRelevant​(int dimension,
                                              elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                              elki.database.ids.DBIDs points)
        Utility method to test if a given dimension is relevant as determined via a set of reference points (i.e. if the variance along the attribute is lower than the threshold).
        Parameters:
        dimension - the dimension to test.
        relation - used to get actual values for DBIDs.
        points - the points to test.
        Returns:
        true if the dimension is relevant.
      • makeCluster

        protected Cluster<SubspaceModel> makeCluster​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                                     elki.database.ids.DBIDs C,
                                                     long[] D)
        Utility method to create a subspace cluster from a list of DBIDs and the relevant attributes.
        Parameters:
        relation - to compute a centroid.
        C - the cluster points.
        D - the relevant dimensions.
        Returns:
        an object representing the subspace cluster.
      • computeClusterQuality

        protected double computeClusterQuality​(int clusterSize,
                                               int numRelevantDimensions)
        Computes the quality of a cluster based on its size and number of relevant attributes, as described via the μ-function from the paper.
        Parameters:
        clusterSize - the size of the cluster.
        numRelevantDimensions - the number of dimensions relevant to the cluster.
        Returns:
        a quality measure (only use this to compare the quality to that other clusters).