Class EM<O,​M extends MeanModel>

  • Type Parameters:
    O - object type to analyze
    M - model type to produce
    All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<M>>

    @Title("EM-Clustering: Clustering by Expectation Maximization")
    @Description("Cluster data via Gaussian mixture modeling and the EM algorithm")
    @Reference(authors="A. P. Dempster, N. M. Laird, D. B. Rubin",title="Maximum Likelihood from Incomplete Data via the EM algorithm",booktitle="Journal of the Royal Statistical Society, Series B, 39(1)",url="http://www.jstor.org/stable/2984875",bibkey="journals/jroyastatsocise2/DempsterLR77") @Reference(title="Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering",authors="C. Fraley, A. E. Raftery",booktitle="J. Classification 24(2)",url="https://doi.org/10.1007/s00357-007-0004-5",bibkey="DBLP:journals/classification/FraleyR07")
    @Priority(200)
    public class EM<O,​M extends MeanModel>
    extends java.lang.Object
    implements ClusteringAlgorithm<Clustering<M>>
    Clustering by expectation maximization (EM-Algorithm), also known as Gaussian Mixture Modeling (GMM), with optional MAP regularization.

    Reference:

    A. P. Dempster, N. M. Laird, D. B. Rubin:
    Maximum Likelihood from Incomplete Data via the EM algorithm.
    Journal of the Royal Statistical Society, Series B, 39(1), 1977, pp. 1-31

    The MAP estimation is derived from

    C. Fraley and A. E. Raftery
    Bayesian Regularization for Normal Mixture Estimation and Model-Based Clustering
    J. Classification 24(2)

    Since:
    0.1
    Author:
    Arthur Zimek, Erich Schubert
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  EM.Par<O,​M extends MeanModel>
      Parameterization class.
      • Nested classes/interfaces inherited from interface elki.Algorithm

        elki.Algorithm.Utils
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected double delta
      Delta parameter
      protected int k
      Number of clusters
      private static java.lang.String KEY
      Key for statistics logging.
      private static elki.logging.Logging LOG
      The logger for this class.
      protected int maxiter
      Maximum number of iterations to allow
      protected EMClusterModelFactory<? super O,​M> mfactory
      Factory for producing the initial cluster model.
      protected static double MIN_LOGLIKELIHOOD
      Minimum loglikelihood to avoid -infinity.
      protected int miniter
      Minimum number of iterations to do
      protected double prior
      Prior to enable MAP estimation (use 0 for MLE)
      protected boolean soft
      Retain soft assignments.
      static elki.data.type.SimpleTypeInformation<double[]> SOFT_TYPE
      Soft assignment result type.
    • Constructor Summary

      Constructors 
      Constructor Description
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int maxiter, boolean soft)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int maxiter, double prior, boolean soft)
      Constructor.
      EM​(int k, double delta, EMClusterModelFactory<? super O,​M> mfactory, int miniter, int maxiter, double prior, boolean soft)
      Constructor.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static <O> double assignProbabilitiesToInstances​(elki.database.relation.Relation<? extends O> relation, java.util.List<? extends EMClusterModel<? super O,​?>> models, elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX, elki.database.datastore.WritableDoubleDataStore loglikelihoods)
      Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.
      elki.data.type.TypeInformation[] getInputTypeRestriction()  
      static double logSumExp​(double[] x)
      Compute log(sum(exp(x_i)), with attention to numerical issues.
      protected static double logSumExp​(double a, double b)
      Compute log(exp(a)+exp(b)), with attention to numerical issues.
      static <O> void recomputeCovarianceMatrices​(elki.database.relation.Relation<? extends O> relation, elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX, java.util.List<? extends EMClusterModel<? super O,​?>> models, double prior)
      Recompute the covariance matrixes.
      Clustering<M> run​(elki.database.relation.Relation<O> relation)
      Performs the EM clustering algorithm on the given database.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        The logger for this class.
      • KEY

        private static final java.lang.String KEY
        Key for statistics logging.
      • k

        protected int k
        Number of clusters
      • delta

        protected double delta
        Delta parameter
      • miniter

        protected int miniter
        Minimum number of iterations to do
      • maxiter

        protected int maxiter
        Maximum number of iterations to allow
      • prior

        protected double prior
        Prior to enable MAP estimation (use 0 for MLE)
      • soft

        protected boolean soft
        Retain soft assignments.
      • MIN_LOGLIKELIHOOD

        protected static final double MIN_LOGLIKELIHOOD
        Minimum loglikelihood to avoid -infinity.
        See Also:
        Constant Field Values
      • SOFT_TYPE

        public static final elki.data.type.SimpleTypeInformation<double[]> SOFT_TYPE
        Soft assignment result type.
    • Constructor Detail

      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int maxiter,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        maxiter - Maximum number of iterations
        soft - Include soft assignments
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int maxiter,
                  double prior,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        maxiter - Maximum number of iterations
        prior - MAP prior
        soft - Include soft assignments
      • EM

        public EM​(int k,
                  double delta,
                  EMClusterModelFactory<? super O,​M> mfactory,
                  int miniter,
                  int maxiter,
                  double prior,
                  boolean soft)
        Constructor.
        Parameters:
        k - k parameter
        delta - delta parameter
        mfactory - EM cluster model factory
        miniter - Minimum number of iterations
        maxiter - Maximum number of iterations
        prior - MAP prior
        soft - Include soft assignments
    • Method Detail

      • getInputTypeRestriction

        public elki.data.type.TypeInformation[] getInputTypeRestriction()
        Specified by:
        getInputTypeRestriction in interface elki.Algorithm
      • run

        public Clustering<M> run​(elki.database.relation.Relation<O> relation)
        Performs the EM clustering algorithm on the given database.

        Finally a hard clustering is provided where each clusters gets assigned the points exhibiting the highest probability to belong to this cluster. But still, the database objects hold associated the complete probability-vector for all models.

        Parameters:
        relation - Relation
        Returns:
        Clustering result
      • recomputeCovarianceMatrices

        public static <O> void recomputeCovarianceMatrices​(elki.database.relation.Relation<? extends O> relation,
                                                           elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX,
                                                           java.util.List<? extends EMClusterModel<? super O,​?>> models,
                                                           double prior)
        Recompute the covariance matrixes.
        Type Parameters:
        O - Object type
        Parameters:
        relation - Vector data
        probClusterIGivenX - Object probabilities
        models - Cluster models to update
        prior - MAP prior (use 0 for MLE)
      • assignProbabilitiesToInstances

        public static <O> double assignProbabilitiesToInstances​(elki.database.relation.Relation<? extends O> relation,
                                                                java.util.List<? extends EMClusterModel<? super O,​?>> models,
                                                                elki.database.datastore.WritableDataStore<double[]> probClusterIGivenX,
                                                                elki.database.datastore.WritableDoubleDataStore loglikelihoods)
        Assigns the current probability values to the instances in the database and compute the expectation value of the current mixture of distributions.

        Computed as the sum of the logarithms of the prior probability of each instance.

        Type Parameters:
        O - Object type
        Parameters:
        relation - the database used for assignment to instances
        models - Cluster models
        probClusterIGivenX - Output storage for cluster probabilities
        loglikelihoods - Per-object log likelihood, for EM Outlier; may be null if not used
        Returns:
        the expectation value of the current mixture of distributions
      • logSumExp

        public static double logSumExp​(double[] x)
        Compute log(sum(exp(x_i)), with attention to numerical issues.
        Parameters:
        x - Input
        Returns:
        Result
      • logSumExp

        protected static double logSumExp​(double a,
                                          double b)
        Compute log(exp(a)+exp(b)), with attention to numerical issues.
        Parameters:
        a - Input 1
        b - Input 2
        Returns:
        Result