Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW FEATURE] Adding min-max linkage to Agglomerative Hierarchical Clustering #93

Open
NimaSarajpoor opened this issue Mar 21, 2021 · 0 comments

Comments

@NimaSarajpoor
Copy link

I mentioned it on scikit-learn, and someone suggested mentioning it here on extra library as it seems a more appropriate place to add this new feature.

I think many people (who are on the application side of ML) use scikit-learn for their projects and are not aware of many algorithms that are available. Thus, implementing a new algorithm can let them know such a method exists. Please let me know if it is a good idea to add this method to the Agglomerative Clustering. If so, I can work on it (as I already implemented it myself) and add it to the package.

======================================

Challenge: The linkages in agglemorative clustering that can work with the distance matrix are complete, linkage, and average. However, the resulted clusters don't have a clear centroid. So, a method that can work with a distance matrix and provide a centroid-based structured clusters can provide a lot of flexibility. It can be used for any distance measure and since it utilizes the distance matrix only, it can be performed very fast for large-size data. It also provides centroids which can be used as the representatives of the clusters.

Solution: A min-max linkage proposed by Jacob Bien & Robert Tibshirani (2011) is a good solution. It merges two clusters if a ball that is required to cover the points of the two clusters together is the smallest compare to others. A ball for each cluster can be identified as follows: First, identify the furthest neighbor of each point and calculate its distance. Then, choose the point (M) that has the shortest distance to its furthest neighbor. That point M is the centroid and that shortest distance is the radius of the ball. I recently used it for my paper where I had to use DTW distance, but I also need a fast algorithm that provides a reasonable centroid rather than computing DBA at each step of the process.

Alternative solution: N/A

Additional Info:
Paper:
Bien, Jacob, and Robert Tibshirani. "Hierarchical clustering with prototypes via minimax linkage." Journal of the American Statistical Association 106.495 (2011): 1075-1084.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant