Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of rank (CDF) #103

Open
AlexanderSaydakov opened this issue Feb 27, 2018 · 22 comments
Open

Definition of rank (CDF) #103

AlexanderSaydakov opened this issue Feb 27, 2018 · 22 comments

Comments

@AlexanderSaydakov
Copy link

AlexanderSaydakov commented Feb 27, 2018

In the TDigest class I see that cdf(x) described as follows:
returns the fraction of all points added which are <= x.

So if one value is added, its rank must be 1.
I am using TestNG:

  @Test
  public void oneValue() {
    TDigest t = TDigest.createDigest(100);
    t.add(0);
    Assert.assertEquals(t.cdf(0), 1.0); // java.lang.AssertionError: expected [1.0] but found [0.5]
  }

  @Test
  public void twoValues() {
    TDigest t = TDigest.createDigest(100);
    t.add(0);
    t.add(1);
    Assert.assertEquals(t.cdf(0), 0.5); // java.lang.AssertionError: expected [0.5] but found [0.0]
    Assert.assertEquals(t.cdf(1), 1.0); // this is correct
  }
@tdunning
Copy link
Owner

tdunning commented Apr 2, 2021

Thank you for spotting this.

I believe that this was largely bad documentation in the javadoc. This also came up in a recent (not friendly) evaluation on the Apache DataSketches documentation.

The intention was to use a mid-point rule definition. In this definition, cdf is the number of data points <x plus half the number ==x. In the latest code, there is a version of Dist.cdf that allows you to set this value from 0 to 1.

I am closing this now in anticipation of the 3.3 release, but will re-open if you think there is still a problem.

@tdunning tdunning closed this as completed Apr 2, 2021
@leerho
Copy link

leerho commented Apr 7, 2021

@tdunning
We would like to see this (not friendly) evaluation of the Apache DataSketches documentation. Criticisms are welcome, we always want to improve our documentation.

Thanks,
Lee.

@tdunning
Copy link
Owner

tdunning commented Apr 7, 2021

Lee,

Glad to hear from you.

See https://datasketches.apache.org/docs/QuantilesStudies/KllSketchVsTDigest.html

On that page, there are a number of inaccuracies that could have been repaired with a bit of dialog or even just a bit of googling. For instance, at the beginning there is this:

t-digest code as of Sep 14, 2017 (there is no published jar)

In fact, as a quick search shows, the first published jar was from 2014 and there are a number of releases between that and 2017.

Next, there is the comment:

It is not clear what rule t-digest uses.

IN fact, there is extensive documentation of the interpolation scheme used. See, for instance,
this diagram which illustrates how non-singleton centroids are handled. Or this figure which illustrates how singletons and end-points are handled (and which clearly implies a mid-point rule). In addition, the test cases use Dist.quantile() and Dist.cdf() as a reference and both of these make clear that a mid-point rule is used.

Next, when accuracy is examined, it appears that only the median is considered. This makes sense from the point of view of the KLL sketch since performance is so abysmal near the tails (by design, the KLL guarantees absolute, not relative error). But even so, the comparison appears to be subject to errors that can be mitigated by dithering the number of samples. Even if the comparison were updated to use the REQ sketch, the relative performance is still poor on average and the size is unbounded.

The accuracy comparison also exhibits clear problems. For a value of δ = 200, the t-digest will always retain at least 100 samples and thus will show zero error for cdf() and quantile(). The test case on the comparison page is clearly buggy. Here is a corrected version of those accuracy slides (just t-digest) that shows the error in estimating the quantile of a uniform distribution as measured over 50 runs. As expected, error is zero until over a hundred samples. The error is large (but much smaller than the comparison page shows) at small counts because of the transition from keeping all of the samples versus keeping a centroid of two samples. That transition makes it very hard to estimate quantiles accurately. Once the centroids in question have roughly 100 samples, the interpolation becomes very, very accurate and median estimation becomes more accurate.

As I mentioned, the comparison only looks at the median. If you shift to a more extreme quantile, however, the t-digest performance improves even more. For the 0.001 quantile (or the equivalent 0.999 quantile) there is zero error all the way up to 10,000 samples and really tiny errors beyond that.

So, yeah, that comparison page was pretty poorly done and I would have been happy to help make it better.

@tdunning tdunning reopened this Apr 7, 2021
@AlexanderSaydakov
Copy link
Author

AlexanderSaydakov commented Apr 7, 2021

a number of inaccuracies that could have been repaired with a bit of dialog

This issue was opened (more than three years ago!) precisely to have that bit of dialog. I am glad that finally it is happening. We are happy to correct our approach.
Based on our testing, assuming mid-point rule did not seem to yield expected results, as described and shown on the first graph.

@tdunning
Copy link
Owner

tdunning commented Apr 8, 2021 via email

@AlexanderSaydakov
Copy link
Author

Assuming that somebody in the world would see a random JIRA on a project
that they don't know about isn't much of a way to encourage dialog.

I am not sure I understand. What random JIRA? What way could be more direct than opening an issue in your personal repo with the code in question?

We will take a look at the details you posted. Thanks.

@tdunning
Copy link
Owner

tdunning commented Apr 8, 2021

OK. Sorry about the confusion there. I didn't understand that you were referring to this issue (even though that is exactly what you said ... my bad). I had thought that you were referring to some issue raised in the DataSketches project.

But I have to say that the following questions were not raised here:

  • is there a published jar? (answer would have been yes)
  • are these results reasonable? (answer would have been a test case like the one yesterday)
  • do you have speed numbers? (answer would have been benchmarks that show about 50 ns amortized insertion times)

@tdunning
Copy link
Owner

tdunning commented Apr 8, 2021

I would add that while I didn't respond well to this issue, but the test com.tdunning.math.stats.TDigestTest#singleSingleRange addresses this same issue.

   public void singleSingleRange() {
        TDigest digest = factory(100).create();
        digest.add(1);
        digest.add(2);
        digest.add(3);

        // verify the cdf is a step between singletons
        assertEquals(0.5 / 3.0, digest.cdf(1), 0);
        assertEquals(1 / 3.0, digest.cdf(1 + 1e-10), 0);
        assertEquals(1 / 3.0, digest.cdf(2 - 1e-10), 0);
        assertEquals(1.5 / 3.0, digest.cdf(2), 0);
        assertEquals(2 / 3.0, digest.cdf(2 + 1e-10), 0);
        assertEquals(2 / 3.0, digest.cdf(3 - 1e-10), 0);
        assertEquals(2.5 / 3.0, digest.cdf(3), 0);
        assertEquals(1.0, digest.cdf(3 + 1e-10), 0);
    }

Again, to give this trouble report the credit it deserves, I only added this test substantially after this issue was first raised and t-digest would have been better had I (or anybody) dug into the issue sooner. I (we) could have done significantly better by responding in a more timely and attentive fashion.

@AlexanderSaydakov
Copy link
Author

Next, when accuracy is examined, it appears that only the median is considered

This is not so. We consider all ranks. The testing method is as follows:

  1. for each trial generate a set of input values
  2. feed the input values to the sketch
  3. sort the input values to compute true ranks
  4. query the sketch with each value to get estimated rank
  5. keep max rank error per trial
  6. at the end of all trials get 99th percentile of max error
  7. plot that 99pct rank error (y axis) vs input size (x axis)

The code for these measurements is published. For t-digest we measured using all 3 rules (min, mid and max). So just ignore min and max if mid-point rule is true. We are open to criticism of this method. We measured the latest code at that time.

@tdunning
Copy link
Owner

tdunning commented Apr 9, 2021 via email

@AlexanderSaydakov
Copy link
Author

AlexanderSaydakov commented Apr 9, 2021

benchmarks that show about 50 ns amortized insertion times

At what input size?
I am not sure if I understand your point here. Do you mean that you disagree with our speed measurements?

We measure like this:

  1. step through input sizes AKA steam length (say, from 1 to 10M)
  2. run many trials at each input size (usually decreasing - more trials at small input size)
  3. update sketch with all input values, measure the whole time
  4. keep total update time across all trials
  5. divide by input size and number of trials to get an average time of one update at this input size
  6. plot this average time of one update (y axis) vs input size (x axis)
    The plot was published on that page you linked for t-digest 100 and 200. The characterization code was published as well so that anyone could reproduce the measurements, review and criticize the method.

@AlexanderSaydakov
Copy link
Author

Since the error is largest at the median for the t-digest, this is equivalent to considering only the median.

I think I see your objection. Perhaps we ought to reconsider our method with respect to accuracy measurements.

@tdunning
Copy link
Owner

tdunning commented Apr 12, 2021

Perhaps we ought to reconsider our method with respect to accuracy measurements.

I think that your method (max absolute error) is absolutely fine for evaluating the KLL by itself. But if you compare against a different system (such as REQ or t-digest), then that different system probably has different design goals and thus a different evaluation is probably necessary to highlight what the differences actually are and to determine whether they are important in any given application.

It is also probably reasonable to commit to updating comparisons at least as often as your own performance changes.

Specifically, KLL optimizes for worst-case absolute error. That's fine. The t-digest optimizes for bounded and relatively small size, no dynamic allocation while optimizing tail accuracy. REQ gives up bounded size and dynamic allocation but focusses on guarantees relative to tail accuracy.

It is fine to talk about absolute error. But if you are comparing these three systems, it is much better to also examine how errors look near the tails. It would be good to look at memory consumption and allocation. It would also be good to examine progressively more perverse data sources to highlight how the heuristic interpolation of t-digest begins to fail for data with bizarre characteristics, but REQ is able to maintain its guaranteed accuracy.

The user can then decide whether memory size or tail accuracy or crazy data distributions are important to them.

The recent paper by Cormode and others, for instance, used a data distribution which might come up if you had a system with latencies that range from several times shorter than the time light takes to transit a proton all they way out to many orders of magnitude longer than the life of the universe so far. IF your data involves that level of skew, then the t-digest would be a bad choice. On the other hand, if you want small memory size and only care about 10 orders of magnitude of skew and a long track record of use, you might make the opposite choice.

@tdunning
Copy link
Owner

benchmarks that show about 50 ns amortized insertion times

At what input size?
I am not sure if I understand your point here. Do you mean that you disagree with our speed measurements?

We measure like this:

...
2. run many trials at each input size (usually decreasing - more trials at small input size)
...
4. keep total update time across all trials
5. divide by input size and number of trials to get an average time of one update at this input size

So this is a classic sort of benchmarking approach which is unable to disentangle the benchmark from the thing being benchmarked.

See here for more thoughts on this:

https://medium.com/97-things/benchmarking-is-hard-jmh-helps-476a68a51167

@AlexanderSaydakov
Copy link
Author

I don't see any basis for this statement that our benchmarking is flawed. We carefully construct our measurements. We can see fine details of behavior of particular algorithms. For instance, we see spikes at some expected transition points. We see effects of some improvements. Our measurements in Java are quite consistent with similar measurements in C++.

@leerho
Copy link

leerho commented Apr 12, 2021 via email

@AlexanderSaydakov
Copy link
Author

AlexanderSaydakov commented Apr 15, 2021

tdigest 200 update time

Here is a fresh measurement of update time of t-digest with compression=200
Uniform random input (Random.nextFloat)

@tdunning
Copy link
Owner

That is much closer to what I saw in my measurements.

@AlexanderSaydakov
Copy link
Author

Much closer than what? This is quite consistent with that report on our web site 3 years ago. It was a bit higher - around 120 ns, but my laptop now is faster, Java made some progress, and perhaps some other factors.

@AlexanderSaydakov
Copy link
Author

And your code must have changed too. I don't see that problem with rank we have seen 3 years ago. And also, if you look closer at these update time plots, the transition to some more expensive regime happened around 1000 updates before, but the latest code makes this transition at 2000 updates.

@AlexanderSaydakov
Copy link
Author

@tdunning is this change of the transition point from 1000 to 2000 expected?

@tdunning
Copy link
Owner

tdunning commented Apr 19, 2021 via email

averbraeck added a commit to averbraeck/djutils that referenced this issue Aug 24, 2024
The cdf calculation for TDigest has changed (and is clarified) in
version 3.3 of t-digest. See
tdunning/t-digest#103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants