-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definition of rank (CDF) #103
Comments
Thank you for spotting this. I believe that this was largely bad documentation in the javadoc. This also came up in a recent (not friendly) evaluation on the Apache DataSketches documentation. The intention was to use a mid-point rule definition. In this definition, cdf is the number of data points <x plus half the number ==x. In the latest code, there is a version of Dist.cdf that allows you to set this value from 0 to 1. I am closing this now in anticipation of the 3.3 release, but will re-open if you think there is still a problem. |
@tdunning Thanks, |
Lee, Glad to hear from you. See https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html On that page, there are a number of inaccuracies that could have been repaired with a bit of dialog or even just a bit of googling. For instance, at the beginning there is this:
In fact, as a quick search shows, the first published jar was from 2014 and there are a number of releases between that and 2017. Next, there is the comment:
IN fact, there is extensive documentation of the interpolation scheme used. See, for instance, Next, when accuracy is examined, it appears that only the median is considered. This makes sense from the point of view of the KLL sketch since performance is so abysmal near the tails (by design, the KLL guarantees absolute, not relative error). But even so, the comparison appears to be subject to errors that can be mitigated by dithering the number of samples. Even if the comparison were updated to use the REQ sketch, the relative performance is still poor on average and the size is unbounded. The accuracy comparison also exhibits clear problems. For a value of δ = 200, the t-digest will always retain at least 100 samples and thus will show zero error for As I mentioned, the comparison only looks at the median. If you shift to a more extreme quantile, however, the t-digest performance improves even more. For the 0.001 quantile (or the equivalent 0.999 quantile) there is zero error all the way up to 10,000 samples and really tiny errors beyond that. So, yeah, that comparison page was pretty poorly done and I would have been happy to help make it better. |
This issue was opened (more than three years ago!) precisely to have that bit of dialog. I am glad that finally it is happening. We are happy to correct our approach. |
On Wed, Apr 7, 2021 at 3:55 PM Alexander Saydakov ***@***.***> wrote:
a number of inaccuracies that could have been repaired with a bit of dialog
This issue was opened (more than two years ago!) precisely to have that
bit of dialog. I am glad that finally it is happening. We are happy to
correct our approach.
Based on our testing, assuming mid-point rule did not seem to yield
expected results, as described and shown on the first graph.
Assuming that somebody in the world would see a random JIRA on a project
that they don't know about isn't much of a way to encourage dialog. My
email is well known and available everywhere and back in that time frame, I
was very well known in Apache. You could have reached out very easily.
As is very clear, your testing was buggy as shown by my results that I
posted earlier today.
See com.tdunning.tdigest.quality.CompareKllTest#testMedianErrorVersusScale in
the latest head version of t-digest for the code that produced these graphs.
|
I am not sure I understand. What random JIRA? What way could be more direct than opening an issue in your personal repo with the code in question? We will take a look at the details you posted. Thanks. |
OK. Sorry about the confusion there. I didn't understand that you were referring to this issue (even though that is exactly what you said ... my bad). I had thought that you were referring to some issue raised in the DataSketches project. But I have to say that the following questions were not raised here:
|
I would add that while I didn't respond well to this issue, but the test public void singleSingleRange() {
TDigest digest = factory(100).create();
digest.add(1);
digest.add(2);
digest.add(3);
// verify the cdf is a step between singletons
assertEquals(0.5 / 3.0, digest.cdf(1), 0);
assertEquals(1 / 3.0, digest.cdf(1 + 1e-10), 0);
assertEquals(1 / 3.0, digest.cdf(2 - 1e-10), 0);
assertEquals(1.5 / 3.0, digest.cdf(2), 0);
assertEquals(2 / 3.0, digest.cdf(2 + 1e-10), 0);
assertEquals(2 / 3.0, digest.cdf(3 - 1e-10), 0);
assertEquals(2.5 / 3.0, digest.cdf(3), 0);
assertEquals(1.0, digest.cdf(3 + 1e-10), 0);
} Again, to give this trouble report the credit it deserves, I only added this test substantially after this issue was first raised and t-digest would have been better had I (or anybody) dug into the issue sooner. I (we) could have done significantly better by responding in a more timely and attentive fashion. |
This is not so. We consider all ranks. The testing method is as follows:
The code for these measurements is published. For t-digest we measured using all 3 rules (min, mid and max). So just ignore min and max if mid-point rule is true. We are open to criticism of this method. We measured the latest code at that time. |
This is not so. We consider all ranks.
Since the error is largest at the median for the t-digest, this is
equivalent to considering only the median.
Where an algorithm specifically targets lower error at certain quantiles,
stratifying by quantile is a more direct comparison. For instance, in my
previous response, I gave separate graphs for q=0.5 and for q=0.001.
This is very important in practice because it is very common for a
distribution sketch to be used to estimate tails. In terms of number of
samples, this prioritization for tail accuracy makes a difference of 100x
in the number of samples for which perfect accuracy is retained (for the
t-digest) and even where perfect accuracy is lost, the errors are reduced
by many orders of magnitude.
|
At what input size? We measure like this:
|
I think I see your objection. Perhaps we ought to reconsider our method with respect to accuracy measurements. |
I think that your method (max absolute error) is absolutely fine for evaluating the KLL by itself. But if you compare against a different system (such as REQ or t-digest), then that different system probably has different design goals and thus a different evaluation is probably necessary to highlight what the differences actually are and to determine whether they are important in any given application. It is also probably reasonable to commit to updating comparisons at least as often as your own performance changes. Specifically, KLL optimizes for worst-case absolute error. That's fine. The t-digest optimizes for bounded and relatively small size, no dynamic allocation while optimizing tail accuracy. REQ gives up bounded size and dynamic allocation but focusses on guarantees relative to tail accuracy. It is fine to talk about absolute error. But if you are comparing these three systems, it is much better to also examine how errors look near the tails. It would be good to look at memory consumption and allocation. It would also be good to examine progressively more perverse data sources to highlight how the heuristic interpolation of t-digest begins to fail for data with bizarre characteristics, but REQ is able to maintain its guaranteed accuracy. The user can then decide whether memory size or tail accuracy or crazy data distributions are important to them. The recent paper by Cormode and others, for instance, used a data distribution which might come up if you had a system with latencies that range from several times shorter than the time light takes to transit a proton all they way out to many orders of magnitude longer than the life of the universe so far. IF your data involves that level of skew, then the t-digest would be a bad choice. On the other hand, if you want small memory size and only care about 10 orders of magnitude of skew and a long track record of use, you might make the opposite choice. |
So this is a classic sort of benchmarking approach which is unable to disentangle the benchmark from the thing being benchmarked. See here for more thoughts on this: https://medium.com/97-things/benchmarking-is-hard-jmh-helps-476a68a51167 |
I don't see any basis for this statement that our benchmarking is flawed. We carefully construct our measurements. We can see fine details of behavior of particular algorithms. For instance, we see spikes at some expected transition points. We see effects of some improvements. Our measurements in Java are quite consistent with similar measurements in C++. |
JMH is useful for performing benchmarks on relatively small chunks of code,
but is slow and cumbersome on larger more complex code. The scores JMH
reports are not useful for estimating actual speed performance at a system
level. The JMH scores are really only useful for comparison.
Also, many of our sketches are stochastic in behavior, and to reduce the
noise produced by the inherent randomness, we often have to run many
millions of trials, which can take a long time. Some of our studies can
run for hours (and sometimes days!). Because JMH is so slow doing this
kind of analysis is totally impractical.
Most of our tests are what we call characterization tests, which is much
more than just benchmarking. We are looking to understand many different
aspects of behavior and not just a single speed score. From our data and
graphs, we can identify important state transitions of the sketch that JMH
would never reveal and examine speed, accuracy and space consumed as a
function of stream size and other dimensions.
In other words, we are not doing "classic benchmarking". Because we are
running millions of trials over many different stream-lengths, warm-up and
HotSpot compilation are no longer an issue because we are doing
measurements long after those processes are completed. We only perform
measurements on a single thread, so multithreading is not an issue. When
necessary, we will perform our measurements on a quiet machine (no internet
interrupts, etc.) and also do profiling to make sure that GCs are not
distorting the measurements. As a result many of the characterization
tests we have performed over the past few years have been very consistent
and repeatable with very small variance.
I agree with the article that benchmarking is hard -- and JMH helps in
certain types of comparisons. But JMH is by no means the only way to do
speed analysis. And JMH is not at all useful for accuracy, error
distributions, space, state transitions and other types of
characterizations.
Cheers,
Lee.
…On Mon, Apr 12, 2021 at 3:52 PM Alexander Saydakov ***@***.***> wrote:
I don't see any basis for this statement that our benchmarking is flawed.
We carefully construct our measurements. We can see fine details of
behavior of particular algorithms. For instance, we see spikes at some
expected transition points. We see effects of some improvements. Our
measurements in Java are quite consistent with similar measurements in C++.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#103 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCXRQTJ5FBGPIAXECYMYDLTIN2SNANCNFSM4ESXF2WQ>
.
|
That is much closer to what I saw in my measurements. |
Much closer than what? This is quite consistent with that report on our web site 3 years ago. It was a bit higher - around 120 ns, but my laptop now is faster, Java made some progress, and perhaps some other factors. |
And your code must have changed too. I don't see that problem with rank we have seen 3 years ago. And also, if you look closer at these update time plots, the transition to some more expensive regime happened around 1000 updates before, but the latest code makes this transition at 2000 updates. |
@tdunning is this change of the transition point from 1000 to 2000 expected? |
It isn't surprising that this changed.
I did a fair bit of work adjusting the buffering strategy over the last few
years. In particular, the strategy was changed to use a single unified
buffer in the merge step which roughly doubled the scale which seems to
correlate to what you saw.
…On Mon, Apr 19, 2021 at 11:49 AM Alexander Saydakov < ***@***.***> wrote:
@tdunning <https://github.com/tdunning> is this change of the transition
point from 1000 to 2000 expected?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#103 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6SPRZIH7GO4V5NY7DLTJR3LLANCNFSM4ESXF2WQ>
.
|
The cdf calculation for TDigest has changed (and is clarified) in version 3.3 of t-digest. See tdunning/t-digest#103
In the TDigest class I see that cdf(x) described as follows:
returns the fraction of all points added which are <= x.
So if one value is added, its rank must be 1.
I am using TestNG:
The text was updated successfully, but these errors were encountered: