Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering in 2D - is that the best choice? #2

Open
robinlabs opened this issue Jul 26, 2016 · 3 comments
Open

Clustering in 2D - is that the best choice? #2

robinlabs opened this issue Jul 26, 2016 · 3 comments

Comments

@robinlabs
Copy link

Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is hard :) Still, the 2D projection must add a significant metric distortion. Why not a middle ground, say, 5-10D ? Have you tried that?

@legel
Copy link
Owner

legel commented Jul 26, 2016

Thanks @robinlabs, that's definitely a great question.

The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...

Definitely hope to improve this, and any ideas / contributions are welcome!

@robinlabs
Copy link
Author

HDBSCAN may very well break in 300D, but 5-10D may be reasonable while
forcing less metric distortion & still with quite a bit of noise
suppression. If you do try that, would be interesting to know the results!

On Tue, Jul 26, 2016 at 1:52 PM, Lance Legel [email protected]
wrote:

Thanks @robinlabs https://github.com/robinlabs, that's definitely a
great question.

The short answer is you're right, 2D is not necessarily an optimum. It's
clearly nice for data visualization, although 3D would probably be even
cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that
could be interesting to try out. It would also be interesting to consider
if there's some way to measure a "maximum likelihood" value for D that
balances preservation of information with suppression of noise in the
derived 300D vectors. t-SNE helps to reduce the noise that naturally
emerges when averaging 25 completely different vectors for keywords found
online...

Definitely hope to improve this, and any ideas / contributions are welcome!


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE6d3mMCZYIw9hc9kEHmm0oHaKHqrwxTks5qZnOqgaJpZM4JVj5y
.

--
Ilya Eckstein, PhD
cofounder / CEO @ Robin Labs *
*650-223-5797

www.robinlabs.com http://www.robinlabs.com

@legel
Copy link
Owner

legel commented Jul 26, 2016

Definitely!

I suspect at some point 3D HDBSCAN is going to be awesome to set up (probably when we're hooking up an internal dashboard for overlap.ai) and around that time I'll do a check on all this and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants