public interface DDSketches
DDSketch.
DDSketch works by mapping floating-point input values to bins and counting the number
of values for each bin. The mapping to bins is handled by an implementation of IndexMapping, while the underlying structure that keeps track of bin counts is Store.
LogarithmicMapping is a simple mapping that maps values to their logarithm to a
properly chosen base that ensures the relative accuracy. It is also the one that offers the
smallest memory footprint under the relative accuracy guarantee of the sketch. However, because
the logarithm may be costly to compute, other mappings can be favored, such as the CubicallyInterpolatedMapping, which computes indexes at a faster rate but requires slightly more
memory to ensure the relative accuracy (about 1% more than the LogarithmicMapping). See
IndexMapping for more details and more mappings.
Bin counts are tracked by instances of Store (one for positive values and another one
for negative values). They are essentially objects that map int indexes to double
counters. Multiple implementations can be used with different behaviors and properties.
Implementations of DenseStore are backed by an array and offer constant-time sketch
insertion, but they may waste memory if input values are sparse as they keep track of contiguous
bins. SparseStore only keeps track of non-empty bins, hence a better memory efficiency,
but its insertion speed is logarithmic in the number of non-empty bins.
As an order of magnitude, when using UnboundedSizeDenseStore (e.g., unboundedDense(double) and logarithmicUnboundedDense(double)), the size of the sketch depends on the
logarithmic range that is covered by input values. If \(\alpha\) is the relative accuracy of the
sketch, the number of bins that are needed to cover positive values between \(a\) and \(b\) is
\(\frac{\log b - \log a}{\log \gamma}\) where \(\gamma = \frac{1+\alpha}{1-\alpha}\). Given that
bin counters are tracked using an array of double, each bin takes 8 bytes of memory. If
the sketch contains negative values, the same method gives the additional memory size required to
track them. To that, a constant memory size needs to be added for other data that the sketch
maintains.
As an example, if working with durations using unboundedDense(double) or logarithmicUnboundedDense(double), with a relative accuracy of 2%, about 2kB (275 bins) is needed to
cover values between 1 millisecond and 1 minute, and about 6kB (802 bins) to cover values between
1 nanosecond and 1 day.
Bounded dense stores (e.g., collapsingLowestDense(double, int), collapsingHighestDense(double, int),
logarithmicCollapsingLowestDense(double, int) and logarithmicCollapsingHighestDense(double, int)) limit
the size of the sketch to approximately 8 * maxNumBins by collapsing lowest or highest
bins, which therefore cause lowest or highest quantiles to be inaccurate. Collapsing happens only
when necessary, that is, when the logarithmic range of input values is too large to be fully
tracked with the allowed number of bins, which can be determined using the formula above. As
shown in the DDSketch paper, the
likelihood of a store collapsing when using the default bound is vanishingly small for most
datasets.
| Modifier and Type | Method and Description |
|---|---|
static DDSketch |
collapsingHighestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with highest indices are
collapsed, which causes the relative accuracy guarantee to be lost on highest quantiles. |
static DDSketch |
collapsingLowestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with lowest indices are
collapsed, which causes the relative accuracy guarantee to be lost on lowest quantiles. |
static DDSketch |
logarithmicCollapsingHighestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with highest indices are
collapsed, which causes the relative accuracy guarantee to be lost on highest. |
static DDSketch |
logarithmicCollapsingLowestDense(double relativeAccuracy,
int maxNumBins)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with lowest indices are
collapsed, which causes the relative accuracy guarantee to be lost on lowest quantiles. |
static DDSketch |
logarithmicUnboundedDense(double relativeAccuracy)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows indefinitely to accommodate for the range of input values. |
static DDSketch |
sparse(double relativeAccuracy)
Constructs an instance of
DDSketch that offers insertion time that is logarithmic in
the number of non-empty bins that the sketch contains and whose size grows indefinitely to
accommodate for the range of input values. |
static DDSketch |
unboundedDense(double relativeAccuracy)
Constructs an instance of
DDSketch that offers constant-time insertion and whose size
grows indefinitely to accommodate for the range of input values. |
static DDSketch unboundedDense(double relativeAccuracy)
DDSketch that offers constant-time insertion and whose size
grows indefinitely to accommodate for the range of input values.relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch collapsingLowestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with lowest indices are
collapsed, which causes the relative accuracy guarantee to be lost on lowest quantiles.relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch collapsingHighestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with highest indices are
collapsed, which causes the relative accuracy guarantee to be lost on highest quantiles.relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch sparse(double relativeAccuracy)
DDSketch that offers insertion time that is logarithmic in
the number of non-empty bins that the sketch contains and whose size grows indefinitely to
accommodate for the range of input values. As opposed to unboundedDense(double), this sketch
only tracks non-empty bins, hence its smaller memory footprint, especially if input values are
sparse.relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch logarithmicUnboundedDense(double relativeAccuracy)
DDSketch that offers constant-time insertion and whose size
grows indefinitely to accommodate for the range of input values.
As opposed to unboundedDense(double), it uses an exactly logarithmic mapping, which is more
costly.
relativeAccuracy - the relative accuracy guaranteed by the sketchDDSketchstatic DDSketch logarithmicCollapsingLowestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with lowest indices are
collapsed, which causes the relative accuracy guarantee to be lost on lowest quantiles.
As opposed to collapsingLowestDense(double, int), it uses an exactly logarithmic mapping, which
is more costly.
relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketchstatic DDSketch logarithmicCollapsingHighestDense(double relativeAccuracy, int maxNumBins)
DDSketch that offers constant-time insertion and whose size
grows until the maximum number of bins is reached, at which point bins with highest indices are
collapsed, which causes the relative accuracy guarantee to be lost on highest.
As opposed to collapsingHighestDense(double, int), it uses an exactly logarithmic mapping, which
is more costly.
relativeAccuracy - the relative accuracy guaranteed by the sketch, for non-collapsed binsmaxNumBins - the maximum number of bins to be trackedDDSketch