public interface DDSketches
DDSketch.
DDSketch works by mapping floating-point input values to bins and counting the number of values for each bin.
The mapping to bins is handled by an implementation of IndexMapping, while the underlying structure that
keeps track of bin counts is Store.
LogarithmicMapping is a simple mapping that maps values to their logarithm to a properly chosen base that
ensures the relative accuracy. It is also the one that offers the smallest memory footprint under the relative
accuracy guarantee of the sketch. However, because the logarithm may be costly to compute, other mappings can be
favored, such as the CubicallyInterpolatedMapping, which computes indexes at a faster rate but requires
slightly more memory to ensure the relative accuracy (about 1% more than the LogarithmicMapping). See IndexMapping for more details and more mappings.
Bin counts are tracked by instances of Store (one for positive values and another one for negative values).
They are essentially objects that map int indexes to double counters. Multiple implementations can be
used with different behaviors and properties. Implementations of DenseStore are backed by an array and offer
constant-time sketch insertion, but they may waste memory if input values are sparse as they keep track of contiguous
bins. SparseStore only keeps track of non-empty bins, hence a better memory efficiency, but its insertion
speed is logarithmic in the number of non-empty bins.
As an order of magnitude, when using UnboundedSizeDenseStore (e.g., unboundedDense(double) and logarithmicUnboundedDense(double)), the size of the sketch depends on the logarithmic range that is covered by input
values. If \(\alpha\) is the relative accuracy of the sketch, the number of bins that are needed to cover positive
values between \(a\) and \(b\) is \(\frac{\log b - \log a}{\log \gamma}\) where \(\gamma =
\frac{1+\alpha}{1-\alpha}\). Given that bin counters are tracked using an array of double, each bin takes 8
bytes of memory. If the sketch contains negative values, the same method gives the additional memory size required to
track them. To that, a constant memory size needs to be added for other data that the sketch maintains.
As an example, if working with durations using unboundedDense(double) or logarithmicUnboundedDense(double), with a
relative accuracy of 2%, about 2kB (275 bins) is needed to cover values between 1 millisecond and 1 minute, and about
6kB (802 bins) to cover values between 1 nanosecond and 1 day.
Bounded dense stores (e.g., collapsingLowestDense(double, int), collapsingHighestDense(double, int), logarithmicCollapsingLowestDense(double, int) and logarithmicCollapsingHighestDense(double, int)) limit the size of the sketch to
approximately 8 * maxNumBins by collapsing lowest or highest bins, which therefore cause lowest or highest
quantiles to be inaccurate. Collapsing happens only when necessary, that is, when the logarithmic range of input
values is too large to be fully tracked with the allowed number of bins, which can be determined using the formula
above. As shown in the DDSketch paper, the likelihood
of a store collapsing when using the default bound is vanishingly small for most datasets.
| Modifier and Type | Method and Description |
|---|---|
static DDSketch |
collapsingHighestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with highest indices are collapsed, which causes the
relative accuracy guarantee to be lost on highest quantiles. |
static DDSketch |
collapsingLowestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with lowest indices are collapsed, which causes the
relative accuracy guarantee to be lost on lowest quantiles. |
static DDSketch |
logarithmicCollapsingHighestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with highest indices are collapsed, which causes the
relative accuracy guarantee to be lost on highest. |
static DDSketch |
logarithmicCollapsingLowestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with lowest indices are collapsed, which causes the
relative accuracy guarantee to be lost on lowest quantiles. |
static DDSketch |
logarithmicUnboundedDense(double relativeAccuracy)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows indefinitely
to accommodate for the range of input values. |
static DDSketch |
sparse(double relativeAccuracy)
Constructs an instance of
DDSketch that offers insertion time that is logarithmic in the number of
non-empty bins that the sketch contains and whose size grows indefinitely to accommodate for the range of input
values. |
static DDSketch |
unboundedDense(double relativeAccuracy)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size grows indefinitely
to accommodate for the range of input values. |
static DDSketch unboundedDense(double relativeAccuracy)
DDSketch that offers constant-time insertion and whose size grows indefinitely
to accommodate for the range of input values.relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch collapsingLowestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with lowest indices are collapsed, which causes the
relative accuracy guarantee to be lost on lowest quantiles.relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch collapsingHighestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with highest indices are collapsed, which causes the
relative accuracy guarantee to be lost on highest quantiles.relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch sparse(double relativeAccuracy)
DDSketch that offers insertion time that is logarithmic in the number of
non-empty bins that the sketch contains and whose size grows indefinitely to accommodate for the range of input
values. As opposed to unboundedDense(double), this sketch only tracks non-empty bins, hence its smaller memory
footprint, especially if input values are sparse.relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch logarithmicUnboundedDense(double relativeAccuracy)
DDSketch that offers constant-time insertion and whose size grows indefinitely
to accommodate for the range of input values.
As opposed to unboundedDense(double), it uses an exactly logarithmic mapping, which is more costly.
relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch logarithmicCollapsingLowestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with lowest indices are collapsed, which causes the
relative accuracy guarantee to be lost on lowest quantiles.
As opposed to collapsingLowestDense(double, int), it uses an exactly logarithmic mapping, which is more costly.
relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch logarithmicCollapsingHighestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size grows until the
maximum number of bins is reached, at which point bins with highest indices are collapsed, which causes the
relative accuracy guarantee to be lost on highest.
As opposed to collapsingHighestDense(double, int), it uses an exactly logarithmic mapping, which is more costly.
relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketch