Class CFTree<L extends ClusterFeature>


  • @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: An Efficient Data Clustering Method for Very Large Databases",booktitle="Proc. 1996 ACM SIGMOD International Conference on Management of Data",url="https://doi.org/10.1145/233269.233324",bibkey="DBLP:conf/sigmod/ZhangRL96") @Reference(authors="T. Zhang, R. Ramakrishnan, M. Livny",title="BIRCH: A New Data Clustering Algorithm and Its Applications",booktitle="Data Min. Knowl. Discovery",url="https://doi.org/10.1023/A:1009783824328",bibkey="DBLP:journals/datamine/ZhangRL97") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Numerically Stable CF-Trees for BIRCH Clustering",booktitle="Int. Conf on Similarity Search and Applications",url="https://doi.org/10.1007/978-3-030-60936-8_22",bibkey="DBLP:conf/sisap/LangS20") @Reference(authors="Andreas Lang and Erich Schubert",title="BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees",booktitle="Information Systems",url="https://doi.org/10.1016/j.is.2021.101918",bibkey="DBLP:journals/is/LangS22")
    public class CFTree<L extends ClusterFeature>
    extends java.lang.Object
    Partial implementation of the CFTree as used by BIRCH and BETULA.

    Important differences:

    1. Leaf nodes and directory nodes have the same capacity
    2. Condensing and memory limits are not implemented
    3. Merging refinement (merge-resplit) is not implemented
    Because we want to be able to track the cluster assignments of all data points easily, we need to store the point IDs, and it is not possible to implement the originally proposed page size management at the same time.

    Condensing and merging refinement are possible, and improvements to this code are welcome - please send a pull request!

    References:

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: An Efficient Data Clustering Method for Very Large Databases Proc. 1996 ACM SIGMOD International Conference on Management of Data

    T. Zhang, R. Ramakrishnan, M. Livny
    BIRCH: A New Data Clustering Algorithm and Its Applications
    Data Min. Knowl. Discovery

    Andreas Lang and Erich Schubert
    BETULA: Numerically Stable CF-Trees for BIRCH Clustering
    Int. Conf on Similarity Search and Applications 2020

    Andreas Lang and Erich Schubert
    BETULA: Fast Clustering of Large Data with Improved BIRCH CF-Trees
    Information Systems

    Since:
    0.8.0
    Author:
    Erich Schubert, Andreas Lang
    • Field Summary

      Fields 
      Modifier and Type Field Description
      (package private) CFDistance abs
      Absorption criterion
      (package private) long absstat
      Number ob absorption calculations
      (package private) int capacity
      Capacity of a node.
      (package private) CFDistance dist
      Cluster distance
      (package private) long diststat
      Number of distance calculations
      (package private) ClusterFeature.Factory<L> factory
      Cluster feature factory
      (package private) java.util.Map<ClusterFeature,​elki.database.ids.ArrayModifiableDBIDs> idmap
      Stored leaf entry to dbid relation
      (package private) int leaves
      Leaf node counter.
      static elki.logging.Logging LOG
      Class logger.
      (package private) int maxleaves
      Maximum number of leaves allowed
      (package private) int rebuildstat
      Number of tree rebuilds
      (package private) CFNode<L> root
      Current root node.
      (package private) CFTree.Threshold tCriterium
      Threshold heuristic strategy.
      (package private) double thresholdsq
      Squared maximum radius threshold of a clustering feature.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private void estimateThreshold​(CFNode<L> current, java.util.ArrayList<L> cfs, double[] thresholds)  
      L findLeaf​(elki.data.NumberVector nv)
      Find the leaf of a cluster, to get the final cluster assignment.
      private L findLeaf​(CFNode<L> node, elki.data.NumberVector nv)
      Find the leaf of a cluster, to get the final cluster assignment.
      int getCapacity()
      Get the tree capacity
      elki.database.ids.DBIDs getDBIDs​(ClusterFeature cf)
      Get the DBIDs of a cluster feature (if stored).
      java.util.ArrayList<L> getLeaves()
      Extract the leaves of the tree.
      CFNode<L> getRoot()
      Get the trees root node.
      void insert​(elki.data.NumberVector nv, elki.database.ids.DBIDRef dbid)
      Insert a data point into the tree.
      private CFNode<L> insert​(CFNode<L> node, elki.data.NumberVector nv, elki.database.ids.DBIDRef dbid)
      Recursive insertion.
      private CFNode<L> insert​(CFNode<L> node, AsClusterFeature nleaf)
      Recursive insertion.
      CFTree.LeafIterator<L> leafIterator()
      Get an iterator over the leaf nodes.
      int numLeaves()
      Get the number of leaves in the tree.
      protected static java.lang.StringBuilder printDebug​(java.lang.StringBuilder buf, ClusterFeature n, int d)
      Utility function for debugging.
      protected void rebuildTree()
      Rebuild the CFTree to condense it to approximately half the size.
      private CFNode<L> split​(CFNode<L> node, AsClusterFeature newchild)
      Split an overfull node.
      private double sqabsorption​(elki.data.NumberVector nv, ClusterFeature cf)
      Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
      private double sqabsorption​(ClusterFeature cf1, ClusterFeature cf2)
      Updates statistics and calculates distance between two Cluster Features based on selected criteria.
      private double sqdistance​(elki.data.NumberVector nv, ClusterFeature cf)
      Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
      private double sqdistance​(ClusterFeature cf1, ClusterFeature cf2)
      Updates statistics and calculates distance between two Cluster Features based on selected criteria.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • LOG

        public static final elki.logging.Logging LOG
        Class logger.
      • thresholdsq

        double thresholdsq
        Squared maximum radius threshold of a clustering feature.
      • capacity

        int capacity
        Capacity of a node.
      • leaves

        int leaves
        Leaf node counter.
      • maxleaves

        int maxleaves
        Maximum number of leaves allowed
      • rebuildstat

        int rebuildstat
        Number of tree rebuilds
      • diststat

        long diststat
        Number of distance calculations
      • absstat

        long absstat
        Number ob absorption calculations
      • idmap

        java.util.Map<ClusterFeature,​elki.database.ids.ArrayModifiableDBIDs> idmap
        Stored leaf entry to dbid relation
    • Constructor Detail

      • CFTree

        public CFTree​(ClusterFeature.Factory<L> factory,
                      CFDistance dist,
                      CFDistance abs,
                      double threshold,
                      int capacity,
                      CFTree.Threshold tCriterium,
                      int maxleaves,
                      boolean storeIds)
        Constructor.
        Parameters:
        factory - Cluster feature factory
        dist - Distance function to choose nearest
        abs - Absorption criterion
        threshold - Distance threshold
        capacity - Maximum number of leaves
        tCriterium - threshold adjustment rule
        maxleaves - Maximum number of leaves
        storeIds - Store object ids
    • Method Detail

      • insert

        public void insert​(elki.data.NumberVector nv,
                           elki.database.ids.DBIDRef dbid)
        Insert a data point into the tree.
        Parameters:
        nv - Object data
        dbid - Object id
      • rebuildTree

        protected void rebuildTree()
        Rebuild the CFTree to condense it to approximately half the size.
      • estimateThreshold

        private void estimateThreshold​(CFNode<L> current,
                                       java.util.ArrayList<L> cfs,
                                       double[] thresholds)
      • insert

        private CFNode<L> insert​(CFNode<L> node,
                                 elki.data.NumberVector nv,
                                 elki.database.ids.DBIDRef dbid)
        Recursive insertion.
        Parameters:
        node - Current node
        nv - Object data
        dbid - Object id
        Returns:
        New sibling, if the node was split.
      • findLeaf

        public L findLeaf​(elki.data.NumberVector nv)
        Find the leaf of a cluster, to get the final cluster assignment.

        In contrast to insert(elki.data.NumberVector, elki.database.ids.DBIDRef), this does not modify the tree.

        Parameters:
        nv - Object data
        Returns:
        Leaf this vector should be assigned to
      • findLeaf

        private L findLeaf​(CFNode<L> node,
                           elki.data.NumberVector nv)
        Find the leaf of a cluster, to get the final cluster assignment.

        In contrast to insert(elki.data.NumberVector, elki.database.ids.DBIDRef), this does not modify the tree.

        TODO: allow "outliers"?

        Parameters:
        node - Current node
        nv - Object data
        Returns:
        Leaf this vector should be assigned to
      • split

        private CFNode<L> split​(CFNode<L> node,
                                AsClusterFeature newchild)
        Split an overfull node.
        Parameters:
        node - Node to split
        newchild - Additional child
        Returns:
        New sibling of node
      • insert

        private CFNode<L> insert​(CFNode<L> node,
                                 AsClusterFeature nleaf)
        Recursive insertion.
        Parameters:
        node - Current node
        nleaf - Leaf entry to add.
        Returns:
        New sibling, if the node was split.
      • leafIterator

        public CFTree.LeafIterator<L> leafIterator()
        Get an iterator over the leaf nodes.
        Returns:
        Leaf node iterator.
      • getLeaves

        public java.util.ArrayList<L> getLeaves()
        Extract the leaves of the tree.
        Returns:
        Leaves
      • printDebug

        protected static java.lang.StringBuilder printDebug​(java.lang.StringBuilder buf,
                                                            ClusterFeature n,
                                                            int d)
        Utility function for debugging.
        Parameters:
        buf - Output buffer
        n - Current node
        d - Depth
        Returns:
        Output buffer
      • sqdistance

        private double sqdistance​(ClusterFeature cf1,
                                  ClusterFeature cf2)
        Updates statistics and calculates distance between two Cluster Features based on selected criteria.
        Parameters:
        cf1 - First Cluster Feature
        cf2 - Second Cluster Feature
        Returns:
        Distance
      • sqdistance

        private double sqdistance​(elki.data.NumberVector nv,
                                  ClusterFeature cf)
        Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
        Parameters:
        nv - Number Vector
        cf - Cluster Feature
        Returns:
        Distance
      • sqabsorption

        private double sqabsorption​(ClusterFeature cf1,
                                    ClusterFeature cf2)
        Updates statistics and calculates distance between two Cluster Features based on selected criteria.
        Parameters:
        cf1 - First Cluster Feature
        cf2 - Second Cluster Feature
        Returns:
        Distance
      • sqabsorption

        private double sqabsorption​(elki.data.NumberVector nv,
                                    ClusterFeature cf)
        Updates statistics and calculates distance between a Number Vector and a Cluster Feature based on selected criteria.
        Parameters:
        nv - Number Vector
        cf - Cluster Feature
        Returns:
        Distance
      • numLeaves

        public int numLeaves()
        Get the number of leaves in the tree.
        Returns:
        number of leaves
      • getRoot

        public CFNode<L> getRoot()
        Get the trees root node.
        Returns:
        root node
      • getCapacity

        public int getCapacity()
        Get the tree capacity
        Returns:
        tree capacity
      • getDBIDs

        public elki.database.ids.DBIDs getDBIDs​(ClusterFeature cf)
        Get the DBIDs of a cluster feature (if stored).
        Parameters:
        cf - Cluster feature
        Returns:
        Object ids