more mathematically grounded heuristic for recluster triggering #14

erikerlandson · 2019-01-17T22:02:28Z

the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N) - if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.

The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.

Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

more mathematically grounded heuristic for recluster triggering #14

more mathematically grounded heuristic for recluster triggering #14

erikerlandson commented Jan 17, 2019

more mathematically grounded heuristic for recluster triggering #14

more mathematically grounded heuristic for recluster triggering #14

Comments

erikerlandson commented Jan 17, 2019