Clustering in 2D - is that the best choice? #2

robinlabs · 2016-07-26T20:33:17Z

Question to Y-hat folks: why cluster in 2D? Granted, clustering in 300D is hard :) Still, the 2D projection must add a significant metric distortion. Why not a middle ground, say, 5-10D ? Have you tried that?

legel · 2016-07-26T20:51:07Z

Thanks @robinlabs, that's definitely a great question.

The short answer is you're right, 2D is not necessarily an optimum. It's clearly nice for data visualization, although 3D would probably be even cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that could be interesting to try out. It would also be interesting to consider if there's some way to measure a "maximum likelihood" value for D that balances preservation of information with suppression of noise in the derived 300D vectors. t-SNE helps to reduce the noise that naturally emerges when averaging 25 completely different vectors for keywords found online...

Definitely hope to improve this, and any ideas / contributions are welcome!

robinlabs · 2016-07-26T21:03:55Z

HDBSCAN may very well break in 300D, but 5-10D may be reasonable while
forcing less metric distortion & still with quite a bit of noise
suppression. If you do try that, would be interesting to know the results!

On Tue, Jul 26, 2016 at 1:52 PM, Lance Legel [email protected]
wrote:

Thanks @robinlabs https://github.com/robinlabs, that's definitely a
great question.

The short answer is you're right, 2D is not necessarily an optimum. It's
clearly nice for data visualization, although 3D would probably be even
cooler...

In any case, I haven't yet tried full HDBSCAN clustering in 300D, so that
could be interesting to try out. It would also be interesting to consider
if there's some way to measure a "maximum likelihood" value for D that
balances preservation of information with suppression of noise in the
derived 300D vectors. t-SNE helps to reduce the noise that naturally
emerges when averaging 25 completely different vectors for keywords found
online...

Definitely hope to improve this, and any ideas / contributions are welcome!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE6d3mMCZYIw9hc9kEHmm0oHaKHqrwxTks5qZnOqgaJpZM4JVj5y
.

--
Ilya Eckstein, PhD
cofounder / CEO @ Robin Labs *
*650-223-5797
www.robinlabs.com http://www.robinlabs.com

legel · 2016-07-26T21:11:19Z

Definitely!

I suspect at some point 3D HDBSCAN is going to be awesome to set up (probably when we're hooking up an internal dashboard for overlap.ai) and around that time I'll do a check on all this and report back.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering in 2D - is that the best choice? #2

Clustering in 2D - is that the best choice? #2

robinlabs commented Jul 26, 2016

legel commented Jul 26, 2016

robinlabs commented Jul 26, 2016

legel commented Jul 26, 2016

Clustering in 2D - is that the best choice? #2

Clustering in 2D - is that the best choice? #2

Comments

robinlabs commented Jul 26, 2016

legel commented Jul 26, 2016

robinlabs commented Jul 26, 2016

legel commented Jul 26, 2016