Class BetulaGMM

  • All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<EMModel>>
    Direct Known Subclasses:
    BetulaGMMWeighted

    @Reference(authors="Andreas Lang and Erich Schubert",
               title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",
               booktitle="Information Systems",
               url="https://doi.org/10.1016/j.is.2021.101918",
               bibkey="DBLP:journals/is/LangS22")
    public class BetulaGMM
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<EMModel>>
    Clustering by expectation maximization (EM-Algorithm), also known as Gaussian Mixture Modeling (GMM), with optional MAP regularization. This version uses the BIRCH cluster feature centers only for responsibility estimation; the CF variances are only used for computing the models.

    Reference:

    Andreas Lang and Erich Schubert
    BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
    Information Systems

    Since:
    0.8.0
    Author:
    Andreas Lang
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  BetulaGMM.Par
      Parameterizer
      • Nested classes/interfaces inherited from interface elki.Algorithm

        elki.Algorithm.Utils
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) CFTree.Factory<?> cffactory
      CFTree factory.
      private double delta
      Delta parameter
      (package private) BetulaClusterModelFactory<?> initializer
      Maximum number of iterations.
      (package private) int k
      Number of cluster centers to initialize.
      private static elki.logging.Logging LOG
      Class logger.
      (package private) int maxiter
      Maximum number of iterations.
      protected static double MIN_LOGLIKELIHOOD
      Minimum loglikelihood to avoid -infinity.
      private double prior
      Prior to enable MAP estimation (use 0 for MLE)
      private boolean soft
      Retain soft assignments.
      static elki.data.type.SimpleTypeInformation<double[]> SOFT_TYPE
      Soft assignment result type.
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double assignProbabilitiesToInstances​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation, java.util.List<? extends BetulaClusterModel> models, elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX)
      Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.
      double assignProbabilitiesToInstances​(java.util.ArrayList<? extends ClusterFeature> cfs, java.util.List<? extends BetulaClusterModel> models, java.util.Map<ClusterFeature,​double[]> probClusterIGivenX)
      Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.
      elki.data.type.TypeInformation[] getInputTypeRestriction()  
      private boolean isSoft()  
      void recomputeCovarianceMatrices​(java.util.ArrayList<? extends ClusterFeature> cfs, java.util.Map<ClusterFeature,​double[]> probClusterIGivenX, java.util.List<? extends BetulaClusterModel> models, double prior, int n)
      Recompute the covariance matrixes.
      Clustering<EMModel> run​(elki.database.relation.Relation<elki.data.NumberVector> relation)
      Run the clustering algorithm.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Class logger.
      • k

        int k
        Number of cluster centers to initialize.
      • delta

        private double delta
        Delta parameter
      • maxiter

        int maxiter
        Maximum number of iterations.
      • prior

        private double prior
        Prior to enable MAP estimation (use 0 for MLE)
      • soft

        private boolean soft
        Retain soft assignments.
      • MIN_LOGLIKELIHOOD

        protected static final double MIN_LOGLIKELIHOOD
        Minimum loglikelihood to avoid -infinity.
        See Also:
        Constant Field Values
      • SOFT_TYPE

        public static final elki.data.type.SimpleTypeInformation<double[]> SOFT_TYPE
        Soft assignment result type.
    • Constructor Detail

      • BetulaGMM

        public BetulaGMM​(CFTree.Factory<?> cffactory,
                         double delta,
                         int k,
                         int maxiter,
                         boolean soft,
                         BetulaClusterModelFactory<?> initialization,
                         double prior)
        Constructor.
        Parameters:
        cffactory - CFTree factory
        k - Number of clusters
        maxiter - Maximum number of iterations
        soft - Return soft clustering results
        initialization - Initialization method
        prior - MAP prior
    • Method Detail

      • getInputTypeRestriction

        public elki.data.type.TypeInformation[] getInputTypeRestriction()
        Specified by:
        getInputTypeRestriction in interface elki.Algorithm
      • run

        public Clustering<EMModel> run​(elki.database.relation.Relation<elki.data.NumberVector> relation)
        Run the clustering algorithm.
        Parameters:
        relation - Input data
        Returns:
        Clustering
      • isSoft

        private boolean isSoft()
      • assignProbabilitiesToInstances

        public double assignProbabilitiesToInstances​(java.util.ArrayList<? extends ClusterFeature> cfs,
                                                     java.util.List<? extends BetulaClusterModel> models,
                                                     java.util.Map<ClusterFeature,​double[]> probClusterIGivenX)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Parameters:
        cfs - the cluster features to evaluate
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        Returns:
        the expectation value of the current mixture of distributions
      • assignProbabilitiesToInstances

        public double assignProbabilitiesToInstances​(elki.database.relation.Relation<? extends elki.data.NumberVector> relation,
                                                     java.util.List<? extends BetulaClusterModel> models,
                                                     elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Parameters:
        relation - the database used for assignment to instances
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        Returns:
        the expectation value of the current mixture of distributions
      • recomputeCovarianceMatrices

        public void recomputeCovarianceMatrices​(java.util.ArrayList<? extends ClusterFeature> cfs,
                                                java.util.Map<ClusterFeature,​double[]> probClusterIGivenX,
                                                java.util.List<? extends BetulaClusterModel> models,
                                                double prior,
                                                int n)
        Recompute the covariance matrixes.
        Parameters:
        cfs - Cluster features to evaluate
        probClusterIGivenX - Object probabilities
        models - Cluster models to update
        prior - MAP prior (use 0 for MLE)
        n - data set size