Distributed-K-means

We implement the k-Means|| and Mini-batch k-Means|| algorithms in Spark's distributed framework.

We evaluate the algorithm’s performance using a Gaussian Mixture synthetic dataset, and the kddcup99 dataset available on Scikit-learn.

1. Clustering - k-Means||

The k-Means method is widely utilized for unsupervised clustering tasks. One common technique for weight initialization in k-Means is known as k-Means++, although it’s primarily a sequential algorithm.

A scalable version, referred to as k-Means|| (parallel k-Means), has been introduced and described in detail in the paper available at: https://arxiv.org/abs/1203.6402.

At the heart of k-Means|| lies the initialization procedure, which is depicted in the following pseudo-code as presented in the aforementioned paper:

Algorithm 2: k-means|| (k, ℓ) initialization.

1: C ← sample a point uniformly at random from X
2: ψ ← ϕ_X(C)
3: for O(log ψ) times do
4: C′ ← sample each point x ∈ X independently with probability
pₓ = ℓ ⋅ d²(x, C) / ϕ_X(C)
5: C ← C ∪ C′
6: end for
7: For x ∈ C, set wₓ to be the number of points in X closer to x than any other point in C
8: Recluster the weighted points in C into k clusters

2. Clustering - Mini-batch k-Means||

Another alternative to k-Means++ and k-Means|| is the Mini-batch k-Means algorithm, detailed in this paper: https://sci-hub.se/10.1145/1772690.1772862.

This approach employs small (mini) batches to optimize k-Means clustering instead of relying on a single large-batch optimization procedure. It has been demonstrated to offer faster convergence and can be implemented to scale k-Means with low computation cost on large datasets.

We implement and benchmark the above mentioned algorithms using Spark's distributed framework.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
config		config
src		src
Distributed_K_means-4.pdf		Distributed_K_means-4.pdf
README.md		README.md
distributed_k_means.ipynb		distributed_k_means.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed-K-means

1. Clustering - k-Means||

2. Clustering - Mini-batch k-Means||

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed-K-means

1. Clustering - k-Means||

2. Clustering - Mini-batch k-Means||

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages