Class BetulaLloydKMeans

  • All Implemented Interfaces:
    elki.Algorithm, ClusteringAlgorithm<Clustering<KMeansModel>>, KMeans<elki.data.NumberVector,​KMeansModel>

    @Reference(authors="Andreas Lang and Erich Schubert",
               title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",
               booktitle="Information Systems",
               url="https://doi.org/10.1016/j.is.2021.101918",
               bibkey="DBLP:journals/is/LangS22")
    public class BetulaLloydKMeans
    extends AbstractKMeans<elki.data.NumberVector,​KMeansModel>
    BIRCH/BETULA-based clustering algorithm that simply treats the leafs of the CFTree as clusters.

    References:

    Andreas Lang and Erich Schubert
    BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
    Information Systems

    Since:
    0.8.0
    Author:
    Erich Schubert
    • Field Detail

      • LOG

        private static final elki.logging.Logging LOG
        Class logger.
      • storeIds

        boolean storeIds
        Store ids
      • ignoreWeight

        boolean ignoreWeight
        Ignore weight
      • diststat

        long diststat
        Number of distance caclulations
    • Constructor Detail

      • BetulaLloydKMeans

        public BetulaLloydKMeans​(int k,
                                 int maxiter,
                                 CFTree.Factory<?> cffactory,
                                 AbstractCFKMeansInitialization initialization,
                                 boolean storeIds,
                                 boolean ignoreWeight)
        Constructor.
        Parameters:
        k - Number of clusters
        maxiter - Maximum number of iterations
        cffactory - CFTree factory
        initialization - Initialization method for k-means
        storeIds - Store IDs to avoid reassignment cost
        ignoreWeight - Ignore the leaf weights
    • Method Detail

      • run

        public Clustering<KMeansModel> run​(elki.database.relation.Relation<elki.data.NumberVector> relation)
        Run the clustering algorithm.
        Parameters:
        relation - Input data
        Returns:
        Clustering
      • kmeans

        private double[][] kmeans​(java.util.ArrayList<? extends ClusterFeature> cfs,
                                  int[] assignment,
                                  int[] weights,
                                  CFTree<?> tree)
        Perform k-means clustering.
        Parameters:
        cfs - Cluster features
        assignment - Cluster assignment of each CF
        weights - Cluster weight output
        tree - CF tree
        Returns:
        Cluster means
      • means

        private double[][] means​(int[] assignment,
                                 double[][] means,
                                 java.util.ArrayList<? extends ClusterFeature> cfs,
                                 int[] weights)
        Calculate means of clusters.
        Parameters:
        assignment - Cluster assignment
        means - Means of clusters
        cfs - Clustering features
        weights - Cluster weights
        Returns:
        Means of clusters.
      • assignToNearestCluster

        private int assignToNearestCluster​(int[] assignment,
                                           double[][] means,
                                           java.util.ArrayList<? extends ClusterFeature> cfs,
                                           int[] weights)
        Assign each element to nearest cluster.
        Parameters:
        assignment - Current cluster assignment
        means - k-means cluster means
        cfs - Cluster features
        weights - Cluster weights (output)
        Returns:
        Number of reassigned elements
      • calculateVariances

        protected double[] calculateVariances​(int[] assignment,
                                              double[][] means,
                                              java.util.ArrayList<? extends ClusterFeature> cfs,
                                              int[] weights)
        Calculate variance of clusters based on clustering features.

        The result is only correct after updating the means!

        Parameters:
        assignment - Cluster assignment of CFs
        means - Cluster means
        cfs - CF leaves
        weights - Cluster weights
        Returns:
        Per-cluster variances
      • distance

        private double distance​(elki.data.NumberVector x,
                                double[] y)
        Updates statistics and calculates distance between two Objects based on selected criteria.

        Note: specializing this rather than calling SquaredEuclideanDistance was much faster, as we can avoid wrapping the array.

        Parameters:
        x - Point x
        y - Point y
        Returns:
        distance
      • distance

        private double distance​(double[] x,
                                double[] y)
        Updates statistics and calculates distance between two Objects based on selected criteria.
        Parameters:
        x - Point x
        y - Point y
        Returns:
        distance
      • getLogger

        protected elki.logging.Logging getLogger()
        Description copied from class: AbstractKMeans
        Get the (STATIC) logger for this class.
        Specified by:
        getLogger in class AbstractKMeans<elki.data.NumberVector,​KMeansModel>
        Returns:
        the static logger