Class P3C

  • All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<SubspaceModel>>, SubspaceClusteringAlgorithm<SubspaceModel>

    @Title("P3C: A Robust Projected Clustering Algorithm.")
    @Reference(authors="Gabriela Moise, J\u00f6rg Sander, Martin Ester",
               title="P3C: A Robust Projected Clustering Algorithm",
               booktitle="Proc. Sixth International Conference on Data Mining (ICDM \'06)",
               url="https://doi.org/10.1109/ICDM.2006.123",
               bibkey="DBLP:conf/icdm/MoiseSE06")
    @Priority(190)
    public class P3C
    extends java.lang.Object
    implements SubspaceClusteringAlgorithm<SubspaceModel>
    P3C: A Robust Projected Clustering Algorithm.

    Reference:
    Gabriela Moise, Jörg Sander, Martin Ester
    P3C: A Robust Projected Clustering Algorithm
    In: Proc. Sixth International Conference on Data Mining (ICDM '06)

    This is not a complete implementation of P3C, but good enough for most users. Improvements are welcome. The most obviously missing step is section 3.5 of P3C, where the cluster subspaces are refined.

    Since:
    0.6.0
    Author:
    Florian Nuecke, Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      private static class  P3C.ClusterCandidate
      This class is used to represent potential clusters.
      static class  P3C.Par
      Parameterization class.
      private static class  P3C.Signature
      P3C Cluster signature.
      • Nested classes/interfaces inherited from interface elki.Algorithm

        elki.Algorithm.Utils
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected double alpha
      Alpha threshold for testing.
      protected double emDelta
      Threshold when to stop EM iterations.
      private static elki.logging.Logging LOG
      The logger for this class.
      protected int maxEmIterations
      Maximum number of iterations for the EM step.
      protected int minClusterSize
      Minimum cluster size for noise flagging.
      protected double poissonThreshold
      Parameter for the Poisson test threshold.
    • Constructor Summary

      Constructors 
      Constructor Description
      P3C​(double alpha, double poissonThreshold, int maxEmIterations, double emDelta, int minClusterSize)
      Constructor.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void assignUnassigned​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, elki.database.ids.ModifiableDBIDs unassigned)
      Assign unassigned objects to best candidate based on shortest Mahalanobis distance.
      private int chiSquaredUniformTest​(elki.database.ids.SetDBIDs[] parts, long[] marked, int card)
      Performs a ChiSquared test to determine whether an attribute has a uniform distribution.
      private void computeFuzzyMembership​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, java.util.ArrayList<P3C.Signature> clusterCores, elki.database.ids.ModifiableDBIDs unassigned, elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX, java.util.List<MultivariateGaussianModel> models, int dim)
      Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.
      private java.util.ArrayList<P3C.Signature> constructOneSignatures​(elki.database.ids.SetDBIDs[][] partitions, long[][] markers)
      Construct the 1-signatures by merging adjacent dense bins.
      private void findOutliers​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, java.util.List<MultivariateGaussianModel> models, java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates, elki.database.ids.ModifiableDBIDs noise)
      Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.
      elki.data.type.TypeInformation[] getInputTypeRestriction()  
      private java.util.ArrayList<P3C.ClusterCandidate> hardClustering​(elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX, java.util.List<P3C.Signature> clusterCores, elki.database.ids.DBIDs dbids)
      Creates a hard clustering from the specified soft membership matrix.
      private java.util.ArrayList<P3C.Signature> mergeClusterCores​(int binCount, java.util.ArrayList<P3C.Signature> signatures)
      Merge 1-signatures into p-signatures.
      protected P3C.Signature mergeSignatures​(P3C.Signature first, P3C.Signature second, int numBins)
      Generates a merged signature of this and another one, where the other signature must be a 1-signature.
      private elki.database.ids.SetDBIDs[][] partitionData​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, int bins)
      Partition the data set into bins bins in each dimension independently.
      private java.util.ArrayList<P3C.Signature> pruneRedundantClusterCores​(java.util.ArrayList<P3C.Signature> clusterCores)
      Prune redundant cluster cores
      Clustering<SubspaceModel> run​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
      Performs the P3C algorithm on the given Database.
      protected elki.database.ids.HashSetModifiableDBIDs unionDBIDs​(elki.database.ids.DBIDs[] parts, int start, int end)
      Compute the union of multiple DBID sets.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        The logger for this class.
      • poissonThreshold

        protected double poissonThreshold
        Parameter for the Poisson test threshold.
      • maxEmIterations

        protected int maxEmIterations
        Maximum number of iterations for the EM step.
      • emDelta

        protected double emDelta
        Threshold when to stop EM iterations.
      • minClusterSize

        protected int minClusterSize
        Minimum cluster size for noise flagging. (Not existing in the original publication).
      • alpha

        protected double alpha
        Alpha threshold for testing.
    • Constructor Detail

      • P3C

        public P3C​(double alpha,
                   double poissonThreshold,
                   int maxEmIterations,
                   double emDelta,
                   int minClusterSize)
        Constructor.
        Parameters:
        alpha - ChiSquared test threshold
        poissonThreshold - Poisson test threshold
        maxEmIterations - Maximum number of EM iterations
        emDelta - EM stopping threshold
        minClusterSize - Minimum cluster size
    • Method Detail

      • getInputTypeRestriction

        public elki.data.type.TypeInformation[] getInputTypeRestriction()
        Specified by:
        getInputTypeRestriction in interface elki.Algorithm
      • run

        public Clustering<SubspaceModel> run​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation)
        Performs the P3C algorithm on the given Database.
        Parameters:
        relation - Input data
      • constructOneSignatures

        private java.util.ArrayList<P3C.Signature> constructOneSignatures​(elki.database.ids.SetDBIDs[][] partitions,
                                                                          long[][] markers)
        Construct the 1-signatures by merging adjacent dense bins.
        Parameters:
        partitions - Initial partitions.
        markers - Markers for dense partitions.
        Returns:
        1-signatures
      • mergeClusterCores

        private java.util.ArrayList<P3C.Signature> mergeClusterCores​(int binCount,
                                                                     java.util.ArrayList<P3C.Signature> signatures)
        Merge 1-signatures into p-signatures.
        Parameters:
        binCount - Number of bins in each dimension.
        signatures - 1-signatures
        Returns:
        p-signatures
      • pruneRedundantClusterCores

        private java.util.ArrayList<P3C.Signature> pruneRedundantClusterCores​(java.util.ArrayList<P3C.Signature> clusterCores)
        Prune redundant cluster cores
        Parameters:
        clusterCores - Cluster cores
        Returns:
        Pruned cluster cores
      • partitionData

        private elki.database.ids.SetDBIDs[][] partitionData​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                                             int bins)
        Partition the data set into bins bins in each dimension independently.

        This can be used to construct a grid approximation of the data using O(d n) memory.

        When a dimension is found to be constant, it will not be partitioned, but instead the corresponding array will be set to null.

        Parameters:
        relation - Data relation to partition
        bins - Number of bins
        Returns:
        Partitions of each dimension.
      • unionDBIDs

        protected elki.database.ids.HashSetModifiableDBIDs unionDBIDs​(elki.database.ids.DBIDs[] parts,
                                                                      int start,
                                                                      int end)
        Compute the union of multiple DBID sets.
        Parameters:
        parts - Parts array
        start - Array start index
        end - Array end index (exclusive)
        Returns:
        Union
      • chiSquaredUniformTest

        private int chiSquaredUniformTest​(elki.database.ids.SetDBIDs[] parts,
                                          long[] marked,
                                          int card)
        Performs a ChiSquared test to determine whether an attribute has a uniform distribution.
        Parameters:
        parts - Data partitions.
        marked - the marked bins that should be ignored.
        card - Cardinality
        Returns:
        Position of maximum, or -1 when uniform.
      • computeFuzzyMembership

        private void computeFuzzyMembership​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                            java.util.ArrayList<P3C.Signature> clusterCores,
                                            elki.database.ids.ModifiableDBIDs unassigned,
                                            elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX,
                                            java.util.List<MultivariateGaussianModel> models,
                                            int dim)
        Computes a fuzzy membership with the weights based on which cluster cores each data point is part of.
        Parameters:
        relation - Data relation
        clusterCores - the cluster cores.
        unassigned - set to which to add unassigned points.
        probClusterIGivenX - Membership probabilities.
        models - Cluster models.
        dim - Dimensionality
      • assignUnassigned

        private void assignUnassigned​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                      elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX,
                                      java.util.List<MultivariateGaussianModel> models,
                                      elki.database.ids.ModifiableDBIDs unassigned)
        Assign unassigned objects to best candidate based on shortest Mahalanobis distance.
        Parameters:
        relation - Data relation
        probClusterIGivenX - fuzzy membership matrix.
        models - Cluster models.
        unassigned - the list of points not yet assigned.
      • hardClustering

        private java.util.ArrayList<P3C.ClusterCandidate> hardClustering​(elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX,
                                                                         java.util.List<P3C.Signature> clusterCores,
                                                                         elki.database.ids.DBIDs dbids)
        Creates a hard clustering from the specified soft membership matrix.
        Parameters:
        probClusterIGivenX - the membership matrix.
        dbids - mapping matrix row to DBID.
        Returns:
        a hard clustering based on the matrix.
      • findOutliers

        private void findOutliers​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                  java.util.List<MultivariateGaussianModel> models,
                                  java.util.ArrayList<P3C.ClusterCandidate> clusterCandidates,
                                  elki.database.ids.ModifiableDBIDs noise)
        Performs outlier detection by testing the Mahalanobis distance of each point in a cluster against the critical value of the ChiSquared distribution with as many degrees of freedom as the cluster has relevant attributes.
        Parameters:
        relation - Data relation
        models - Cluster models
        clusterCandidates - the list of clusters to check.
        noise - the set to which to add points deemed outliers.
      • mergeSignatures

        protected P3C.Signature mergeSignatures​(P3C.Signature first,
                                                P3C.Signature second,
                                                int numBins)
        Generates a merged signature of this and another one, where the other signature must be a 1-signature.
        Parameters:
        first - First signature.
        second - Second signature, must be a 1-signature.
        numBins - Number of bins per dimension.
        Returns:
        the merged signature, or null if the merge failed.