You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is 2log(N) - if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.
The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.
Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.
The text was updated successfully, but these errors were encountered:
the main triggering scenario for re-clustering is if the distribution of the incoming data begins to move. In the "null-hypothesis" case of "the distribution is stationary" then the expected number of new clusters (as a function of N, the number of samples seen so far) is
2log(N)
- if the actual number of clusters is growing substantially faster than that, then it is most likely because new extrema are appearing faster than they "should be" due to new extrema being encountered.The idea would be to work out the details of how to turn this basic idea into a more statistically grounded test for when to trigger a re-clustering.
Related: if the sketch needs a re-clustering due to the above scenario, then it implies that only the tails of the distribution need to be re-clustered. That is, a bunch of singleton-clusters that are accumulating at one or both ends. Figuring out a good way to re-cluster only these tails would be faster and probably more numerically stable. See for example this. Possibly there is a way to re-insert these tail clusters while using some kind of discounted measure of the quantile, so the insert logic is persuaded to make larger clusters closer in to the distribution body.
The text was updated successfully, but these errors were encountered: