Class TDigest

  • All Implemented Interfaces:
    Serializable
    Direct Known Subclasses:
    AbstractTDigest

    public abstract class TDigest
    extends Object
    implements Serializable
    Adaptive histogram based on something like streaming k-means crossed with Q-digest. The special characteristics of this algorithm are: - smaller summaries than Q-digest - works on doubles as well as integers. - provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles - fast - simple - test coverage roughly at 90% - easy to adapt for use with map-reduce
    See Also:
    Serialized Form
    • Constructor Detail

      • TDigest

        public TDigest()
    • Method Detail

      • createMergingDigest

        public static TDigest createMergingDigest​(double compression)
        Creates an MergingDigest. This is generally the best known implementation right now.
        Parameters:
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the MergingDigest
      • createAvlTreeDigest

        public static TDigest createAvlTreeDigest​(double compression)
        Creates an AVLTreeDigest. AVLTreeDigest is nearly the best known implementation right now.
        Parameters:
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the AvlTreeDigest
      • createDigest

        public static TDigest createDigest​(double compression)
        Creates a TDigest of whichever type is the currently recommended type. MergingDigest is generally the best known implementation right now.
        Parameters:
        compression - The compression parameter. 100 is a common value for normal uses. 1000 is extremely large. The number of centroids retained will be a smallish (usually less than 10) multiple of this number.
        Returns:
        the TDigest
      • add

        public abstract void add​(double x,
                                 int w)
        Adds a sample to a histogram.
        Parameters:
        x - The value to add.
        w - The weight of this point.
      • add

        public abstract void add​(List<? extends TDigest> others)
      • compress

        public abstract void compress()
        Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe.
      • size

        public abstract long size()
        Returns the number of points that have been added to this TDigest.
        Returns:
        The sum of the weights on all centroids.
      • cdf

        public abstract double cdf​(double x)
        Returns the fraction of all points added which are <= x.
      • quantile

        public abstract double quantile​(double q)
        Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
        Parameters:
        q - The desired fraction
        Returns:
        The value x such that cdf(x) == q
      • centroids

        public abstract Collection<Centroid> centroids()
        A Collection that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
        Returns:
        The centroids in the form of a Collection.
      • compression

        public abstract double compression()
        Returns the current compression factor.
        Returns:
        The compression factor originally used to set up the TDigest.
      • byteSize

        public abstract int byteSize()
        Returns the number of bytes required to encode this TDigest using #asBytes().
        Returns:
        The number of bytes required.
      • smallByteSize

        public abstract int smallByteSize()
        Returns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.
        Returns:
        The number of bytes required.
      • asBytes

        public abstract void asBytes​(ByteBuffer buf)
        Serialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • asSmallBytes

        public abstract void asSmallBytes​(ByteBuffer buf)
        Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • recordAllData

        public abstract TDigest recordAllData()
        Tell this TDigest to record the original data as much as possible for test purposes.
        Returns:
        This TDigest so that configurations can be done in fluent style.
      • isRecording

        public abstract boolean isRecording()
      • add

        public abstract void add​(double x)
        Add a sample to this TDigest.
        Parameters:
        x - The data value to add
      • add

        public abstract void add​(TDigest other)
        Add all of the centroids of another TDigest to this one.
        Parameters:
        other - The other TDigest
      • centroidCount

        public abstract int centroidCount()
      • getMin

        public double getMin()
      • getMax

        public double getMax()