Package com.tdunning.math.stats
Class MergingDigest
- java.lang.Object
-
- com.tdunning.math.stats.TDigest
-
- com.tdunning.math.stats.AbstractTDigest
-
- com.tdunning.math.stats.MergingDigest
-
- All Implemented Interfaces:
Serializable
public class MergingDigest extends AbstractTDigest
Maintains a t-digest by collecting new points in a buffer that is then sorted occasionally and merged into a sorted array that contains previously computed centroids. This can be very fast because the cost of sorting and merging is amortized over several insertion. If we keep N centroids total and have the input array is k long, then the amortized cost is something like N/k + log k These costs even out when N/k = log k. Balancing costs is often a good place to start in optimizing an algorithm. For different values of compression factor, the following table shows estimated asymptotic values of N and suggested values of k:
The virtues of this kind of t-digest implementation include:Compression N k 50 78 25 100 157 42 200 314 73 - No allocation is required after initialization
- The data structure automatically compresses existing centroids when possible
- No Java object overhead is incurred for centroids since data is kept in primitive arrays
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classMergingDigest.Encoding
-
Constructor Summary
Constructors Constructor Description MergingDigest(double compression)Allocates a buffer merging t-digest.MergingDigest(double compression, int bufferSize)If you know the size of the temporary buffer for incoming points, you can use this entry point.MergingDigest(double compression, int bufferSize, int size)Fully specified constructor.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description voidadd(double x, int w)Adds a sample to a histogram.voidadd(List<? extends TDigest> others)voidasBytes(ByteBuffer buf)Serialize this TDigest into a byte buffer.voidasSmallBytes(ByteBuffer buf)Serialize this TDigest into a byte buffer.intbyteSize()Returns the number of bytes required to encode this TDigest using #asBytes().doublecdf(double x)Returns the fraction of all points added which are <= x.intcentroidCount()Collection<Centroid>centroids()ACollectionthat lets you go through the centroids in ascending order by mean.voidcompress()Re-examines a t-digest to determine whether some centroids are redundant.doublecompression()Returns the current compression factor.static MergingDigestfromBytes(ByteBuffer buf)doublequantile(double q)Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.TDigestrecordAllData()Turns on internal data recording.longsize()Returns the number of points that have been added to this TDigest.intsmallByteSize()Returns the number of bytes required to encode this TDigest using #asSmallBytes().-
Methods inherited from class com.tdunning.math.stats.AbstractTDigest
add, add, createCentroid, isRecording
-
Methods inherited from class com.tdunning.math.stats.TDigest
createAvlTreeDigest, createDigest, createMergingDigest, getMax, getMin
-
-
-
-
Constructor Detail
-
MergingDigest
public MergingDigest(double compression)
Allocates a buffer merging t-digest. This is the normally used constructor that allocates default sized internal arrays. Other versions are available, but should only be used for special cases.- Parameters:
compression- The compression factor
-
MergingDigest
public MergingDigest(double compression, int bufferSize)If you know the size of the temporary buffer for incoming points, you can use this entry point.- Parameters:
compression- Compression factor for t-digest. Same as 1/\delta in the paper.bufferSize- How many samples to retain before merging.
-
MergingDigest
public MergingDigest(double compression, int bufferSize, int size)Fully specified constructor. Normally only used for deserializing a buffer t-digest.- Parameters:
compression- Compression factorbufferSize- Number of temporary centroidssize- Size of main buffer
-
-
Method Detail
-
recordAllData
public TDigest recordAllData()
Turns on internal data recording.- Overrides:
recordAllDatain classAbstractTDigest- Returns:
- This TDigest so that configurations can be done in fluent style.
-
add
public void add(double x, int w)Description copied from class:TDigestAdds a sample to a histogram.
-
compress
public void compress()
Description copied from class:TDigestRe-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space. The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression. This is a destructive operation that is not thread-safe.
-
size
public long size()
Description copied from class:TDigestReturns the number of points that have been added to this TDigest.
-
cdf
public double cdf(double x)
Description copied from class:TDigestReturns the fraction of all points added which are <= x.
-
quantile
public double quantile(double q)
Description copied from class:TDigestReturns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
-
centroidCount
public int centroidCount()
- Specified by:
centroidCountin classTDigest
-
centroids
public Collection<Centroid> centroids()
Description copied from class:TDigestACollectionthat lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
-
compression
public double compression()
Description copied from class:TDigestReturns the current compression factor.- Specified by:
compressionin classTDigest- Returns:
- The compression factor originally used to set up the TDigest.
-
byteSize
public int byteSize()
Description copied from class:TDigestReturns the number of bytes required to encode this TDigest using #asBytes().
-
smallByteSize
public int smallByteSize()
Description copied from class:TDigestReturns the number of bytes required to encode this TDigest using #asSmallBytes(). Note that this is just as expensive as actually compressing the digest. If you don't care about time, but want to never over-allocate, this is fine. If you care about compression and speed, you pretty much just have to overallocate by using allocating #byteSize() bytes.- Specified by:
smallByteSizein classTDigest- Returns:
- The number of bytes required.
-
asBytes
public void asBytes(ByteBuffer buf)
Description copied from class:TDigestSerialize this TDigest into a byte buffer. Note that the serialization used is very straightforward and is considerably larger than strictly necessary.
-
asSmallBytes
public void asSmallBytes(ByteBuffer buf)
Description copied from class:TDigestSerialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.- Specified by:
asSmallBytesin classTDigest- Parameters:
buf- The byte buffer into which the TDigest should be serialized.
-
fromBytes
public static MergingDigest fromBytes(ByteBuffer buf)
-
-