-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustering in 2D - is that the best choice? #2
Comments
Thanks @robinlabs, that's definitely a great question. The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler... In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online... Definitely hope to improve this, and any ideas / contributions are welcome! |
HDBSCAN may very well break in 300D, but 5-10D may be reasonable while On Tue, Jul 26, 2016 at 1:52 PM, Lance Legel [email protected]
-- |
Definitely! I suspect at some point 3D HDBSCAN is going to be awesome to set up (probably when we're hooking up an internal dashboard for overlap.ai) and around that time I'll do a check on all this and report back. |
Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is hard :) Still, the 2D projection must add a significant metric distortion. Why not a middle ground, say, 5-10D ? Have you tried that?
The text was updated successfully, but these errors were encountered: