Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more mathematically grounded heuristic for recluster triggering #14

Open
erikerlandson opened this issue Jan 17, 2019 · 0 comments
Open

Comments

@erikerlandson
Copy link
Member

the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N) - if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.

The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.

Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant