public class CubicallyInterpolatedMapping
extends java.lang.Object
IndexMapping that approximates the memory-optimal one (namely LogarithmicMapping) by extracting the floor value of the logarithm to the base 2 from the binary
representations of floating-point values and cubically interpolating the logarithm in-between.
Calculating the bucket index with this mapping is much faster than computing the logarithm of
the value (by a factor of 6 according to some benchmarks, although it depends on various
factors), and this mapping incurs a memory usage overhead of only 1% compared to the
memory-optimal LogarithmicMapping, under the relative accuracy condition. In comparison,
the overheads for LinearlyInterpolatedMapping and QuadraticallyInterpolatedMapping are respectively 44% and 8%.
Here are a few words about how to calculate the optimal polynomial coefficients.
The idea is that the exponent of the floating-point representation gives the floor value of the logarithm to base \(2\) of the input value for free. However, we want the logarithm to base \(\gamma = \frac{1+\alpha}{1-\alpha}\), where \(\alpha\) is the relative accuracy of the sketch. We can deduce that from the logarithm to the base \(2\), but that requires more than the floor value to base \(2\), and we need to actually approximate the logarithm between successive powers of \(2\). A way to do that relatively cheaply is to use the significand and to compute operations that are cheap for the CPU such as additions and multiplications. Therefore, writing \(x = 2^e(1+s)\), where \(e\) is an integer and \(0 \leq s \lt 1\), we compute the index as (the floor value of) \(I_{\alpha} = m\frac{\log2}{\log\gamma}(e+P(s))\), where \(P\) is a polynomial (of degree 3 here) and \(m\) is a multiplier (\(\geq1\)) that is large enough to ensure the \(\alpha\)-accuracy of the sketch.
We want that multiplier \(m\) to be as low as possible, because the higher \(m\), the smaller the buckets and the more buckets we need to cover the same range of values (hence the larger sketch memory size). But we still need the buckets to be small enough so that values that are distinct by a multiplying factor equal to \(\gamma\) do not end up in the same bucket (otherwise, the sketch cannot be \(\alpha\)-accurate). That is, we want \(I_{\alpha}(\gamma x) - I_{\alpha}(x) \geq 1\), for any \(\alpha\) and its corresponding \(\gamma\) (\(\leq -1\) would work as well). Writing \(f(x) = e + P(s)\), we can show that that condition amounts to \(f\) increasing and \(m \log 2 (f \circ \exp)' \geq 1\) where \(f\) is differentiable (that is not necessarily the case at powers of \(2\)). Therefore, to achieve the best sketch memory efficiency, we need to maximize the infimum of \((f \circ \exp)'\).
Given that \(f(2x) = f(x) + 1\), we know that \((f \circ \exp)'(y + \log 2) = (f \circ \exp)'(y)\), and it is enough to study \(f \circ \exp\) between \(0\) and \(\log 2\), that is, with \(\exp y = x = 2^e(1+s)\), for \(e\) equal to \(0\) and \(s\) between \(0\) and \(1\). In other words, we want to find \(P\) that maximizes \(\inf_{y \in [0,\log 2[}(P \circ \exp)'(y)\), which is equal to \(\inf_{s \in [0,1[}P'(s)(1 + s)\).
\(f\) is increasing and, it does not have discontinuity points (that would be an underefficient mapping), therefore we can require \(P(0) = 0\) and \(P(1) = 1\). Hence, we can write \(P(s) = s+s(1-s)(u+vs)\) and we end up with only two coefficients \((u,v)\) to optimize. To find the coefficients that maximize the infimum, we can study the variations of \(Q(s) = P'(s)(1+s)\), which is a polynomial of degree \(3\), depending of values of \(u\) and \(v\). We can show that the infimum is maximized if \(u\) and \(v\) are such that the infimum is equal to \(Q(0)\) and \(Q(r)\), where \(r\) is one of the critical point (local minimum) of \(Q\), distinct from \(0\). That gives a quadratic equation in the two variables \(u\) and \(v\), and given that \(Q(0) = 1+u\), we take the solution that maximizes \(u\). Finally, we get \(u = 3/7\) and \(v = -6/35\), or alternatively, \(A\), \(B\) and \(C\) as in the code if we write \(P(s) = As^3+Bs^2+Cs\). You can convince yourself that those are the optimal coefficients by moving the red point on that graph.
With those values, we can choose \(m\) as small as \(\frac{7}{10\log 2}\), which is about
\(1.01\), hence the memory usage overhead of \(1\%\). For the reverse mapping (getting the value
back from the index), implemented as the value(int) method, we need to solve a cubic
equation, which we can done using Cardano's formula.
| Constructor and Description |
|---|
CubicallyInterpolatedMapping(double relativeAccuracy) |
| Modifier and Type | Method and Description |
|---|---|
void |
encode(Output output) |
boolean |
equals(java.lang.Object o) |
int |
hashCode() |
int |
index(double value) |
double |
lowerBound(int index) |
double |
maxIndexableValue() |
double |
minIndexableValue() |
double |
relativeAccuracy() |
void |
serialize(Serializer serializer) |
int |
serializedSize() |
double |
upperBound(int index) |
double |
value(int index) |
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, waitdecodepublic CubicallyInterpolatedMapping(double relativeAccuracy)
public final int index(double value)
index in interface IndexMappingpublic final double value(int index)
value in interface IndexMappingpublic double lowerBound(int index)
lowerBound in interface IndexMappingpublic double upperBound(int index)
upperBound in interface IndexMappingpublic final double relativeAccuracy()
relativeAccuracy in interface IndexMappingpublic double minIndexableValue()
minIndexableValue in interface IndexMappingpublic double maxIndexableValue()
maxIndexableValue in interface IndexMappingpublic boolean equals(java.lang.Object o)
equals in class java.lang.Objectpublic int hashCode()
hashCode in class java.lang.Objectpublic void encode(Output output) throws java.io.IOException
encode in interface IndexMappingjava.io.IOExceptionpublic int serializedSize()
serializedSize in interface IndexMappingpublic void serialize(Serializer serializer)
serialize in interface IndexMapping