-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determining quality #205
Comments
Can you say more about what your test does?
There were some changes in 3.3 that tended to make the number of retained
centroids smaller (i.e. closer to the specified number). This has the
effect of degrading accuracy for a constant compression factor although
accuracy relative to the size of the digest typically improves.
This affected another package where the compression factor for the tests
was a relatively small value of 100. IN their case upping the default value
to 200 had the desired effect (same memory use, improved accuracy).
The other possible changed before regards previously incorrect handling of
some cases with small amounts of data. This can interact with poor test
design. The basic problem stems from the fact that while the cumulative
distribution is a function, the inverse (the quantile function) is not.
T-digest is supposed to use treat unit centroids as being impulses in the
EPDF located at the sample value. This causes the cumulative distribution
to take half steps at discontinuities at each distinct sample value.
For the inverse cumulative distribution, there is no well defined value for
values of q in the range of this discontinuity. This can make it hard to
write good tests for the desired behavior.
To figure out which of these phenomena you are seeing or even a novel
issue, I need to know more about what you are doing in your test.
…On Wed, Dec 14, 2022 at 10:12 AM Devin Smith ***@***.***> wrote:
After updating from 3.2 to 3.3, some of our tests with hardcoded "quality"
checks were failing. I captured my notes and speculated a bit in
deephaven/deephaven-core#3204
<deephaven/deephaven-core#3204>. I'm sure there
is a lot of nuance to determining "accuracy and quality", but I'm wondering
if there is any general guidance on how one should think / measure quality
wrt t-digests.
Maybe related to the note on the README:
describe accuracy using the quality suite
Potential related to #85 <#85>
It may be that our test is not indicative of "real life" usage.
I'm sure it's a balancing act, trying to optimize t-digest in general may
make some specific datasets perform worse.
—
Reply to this email directly, view it on GitHub
<#205>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6WIQO4ARACJNY6SGZDWNIEYZANCNFSM6AAAAAAS6ZICLE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
The test that passes in 3.2 and fails in 3.3 is essentially a (seeded) uniform random distribution of 10000 doubles in the range Upon seeing these fail in 3.3, I added a single root-mean-square error calculation that accumulates the errors across this distribution, as well as re-seeding the test w/ 1000 distinct seeds. A bit hand-wavy, but in the end the calculation produces 3.2 RMSE = 0.00090 I suspect I might find
, so I will look into that. I think another factor is I don't have a good idea of what "compression" is, or how to think about how developers may need to tweak compression from release to release. The javadocs say "100 is a common value for normal uses". I would have expected a given compression X to be "equivalent" from release to release in one of these two dimensions:
But it seems like compression is a more nuanced value? (ie, based on your statements and my findings, X=100 uses less memory, but also less "quality" on 3.3, so it doesn't follow one of the two dimensions above). I wonder if it would be useful to have user-level abstractions that could be exposed as the two dimensions above? Ie, public static double qualityToCompression(double quality) {
return quality * QUALITY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}
public static double memoryToCompression(double memory) {
return quality * MEMORY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
} Then user-code would be able to try and lock-in on the mode they are trying to optimize for: TDigest tDigestForTesting = createDigest(qualityToCompression(100.0));
...
TDigest tDigestForProduction = createDigest(memoryToCompression(...));
... This is a bit long winded... but I very much appreciate your response, and will get back after I analyze the memory usage a bit more. |
Sorry to be slow.
The compression factor is an upper bound on the number of centroids.
Half the compression factor is a lower bound on the number of centroids for
K_1 scale factor (and practically true for K_2 and K_3).
…On Thu, Dec 15, 2022 at 12:24 PM Devin Smith ***@***.***> wrote:
The test that passes in 3.2 and fails in 3.3 is essentially a (seeded)
uniform random distribution of 10000 doubles in the range [-10000.0,
10000.0]. We check abs((p_x-t_x)/p_x) < 0.005 for x in [75, 99, 99.9] (p_x
is percentile, t_x is t-digest with compression 100). (The test also does
the same checks on 4 non-overlapping subsets that in union equal the
original distribution.)
Upon seeing these fail in 3.3, I added a single root-mean-square error
calculation that accumulates the errors across this distribution, as well
as re-seeding the test w/ 1000 distinct seeds. A bit hand-wavy, but in the
end the calculation produces
3.2 RMSE = 0.00090
3.3 RMSE = 0.00127
I suspect I might find
upping the default value to 200 had the desired effect (same memory use,
improved accuracy)
, so I will look into that.
I think another factor is I don't have a good idea of what "compression"
is, or how to think about how developers may need to tweak compression from
release to release. The javadocs say "100 is a common value for normal
uses".
I would have expected a given compression X to be "equivalent" from
release to release in one of these two dimensions:
1. compression X achieves the same memory usage (and hopefully
improved "quality")
2. compression X achieves the same "quality" (and hopefully improved
memory usage)
But it seems like compression is a more nuanced value? (ie, based on your
statements and my findings, X=100 uses less memory, but also less "quality"
on 3.3, so it doesn't follow one of the two dimensions above).
I wonder if it would be useful to have user-level abstractions that could
be exposed as the two dimensions above?
Ie,
public static double qualityToCompression(double quality) {
return quality * QUALITY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}
public static double memoryToCompression(double memory) {
return quality * MEMORY_TO_COMPRESS_FACTOR; // maybe it's not a constant factor, but more complex function
}
Then user-code would be able to try and lock-in on the mode they are
trying to optimize for:
TDigest tDigestForTesting = createDigest(qualityToCompression(100.0));
...TDigest tDigestForProduction = createDigest(memoryToCompression(...));
...
This is a bit long winded... but I very much appreciate your response, and
will get back after I analyze the memory usage a bit more.
—
Reply to this email directly, view it on GitHub
<#205 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6WQERLYRJ6Y7Q5BZV3WNN46LANCNFSM6AAAAAAS6ZICLE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
After updating from 3.2 to 3.3, some of our tests with hardcoded "quality" checks were failing. I captured my notes and speculated a bit in deephaven/deephaven-core#3204. I'm sure there is a lot of nuance to determining "accuracy and quality", but I'm wondering if there is any general guidance on how one should think / measure quality wrt t-digests.
Maybe related to the note on the README:
Potential related to #85
It may be that our test is not indicative of "real life" usage.
I'm sure it's a balancing act, trying to optimize t-digest in general may make some specific datasets perform worse.
The text was updated successfully, but these errors were encountered: