Skip to content

GiacomoGasparotto/Distributed-K-means

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed-K-means

We implement the k-Means|| and Mini-batch k-Means|| algorithms in Spark's distributed framework.

We evaluate the algorithm’s performance using a Gaussian Mixture synthetic dataset, and the kddcup99 dataset available on Scikit-learn.

1. Clustering - k-Means||

The k-Means method is widely utilized for unsupervised clustering tasks. One common technique for weight initialization in k-Means is known as k-Means++, although it’s primarily a sequential algorithm.

A scalable version, referred to as k-Means|| (parallel k-Means), has been introduced and described in detail in the paper available at: https://arxiv.org/abs/1203.6402.

At the heart of k-Means|| lies the initialization procedure, which is depicted in the following pseudo-code as presented in the aforementioned paper:

Algorithm 2: k-means|| (k, ℓ) initialization.

1: C ← sample a point uniformly at random from X
2: ψ ← ϕ_X(C)
3: for O(log ψ) times do
4: C′ ← sample each point x ∈ X independently with probability
pₓ = ℓ ⋅ d²(x, C) / ϕ_X(C)
5: C ← C ∪ C′
6: end for
7: For x ∈ C, set wₓ to be the number of points in X closer to x than any other point in C
8: Recluster the weighted points in C into k clusters

2. Clustering - Mini-batch k-Means||

Another alternative to k-Means++ and k-Means|| is the Mini-batch k-Means algorithm, detailed in this paper: https://sci-hub.se/10.1145/1772690.1772862.

This approach employs small (mini) batches to optimize k-Means clustering instead of relying on a single large-batch optimization procedure. It has been demonstrated to offer faster convergence and can be implemented to scale k-Means with low computation cost on large datasets.


We implement and benchmark the above mentioned algorithms using Spark's distributed framework.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors