We implement the k-Means|| and Mini-batch k-Means|| algorithms in Spark's distributed framework.
We evaluate the algorithm’s performance using a Gaussian Mixture synthetic dataset, and the kddcup99 dataset available on Scikit-learn.
The k-Means method is widely utilized for unsupervised clustering tasks. One common technique for weight initialization in k-Means is known as k-Means++, although it’s primarily a sequential algorithm.
A scalable version, referred to as k-Means|| (parallel k-Means), has been introduced and described in detail in the paper available at: https://arxiv.org/abs/1203.6402.
At the heart of k-Means|| lies the initialization procedure, which is depicted in the following pseudo-code as presented in the aforementioned paper:
Algorithm 2: k-means|| (k, ℓ) initialization.
1: C ← sample a point uniformly at random from X
2: ψ ← ϕ_X(C)
3: for O(log ψ) times do
4: C′ ← sample each point x ∈ X independently with probability
pₓ = ℓ ⋅ d²(x, C) / ϕ_X(C)
5: C ← C ∪ C′
6: end for
7: For x ∈ C, set wₓ to be the number of points in X closer to x than any other point in C
8: Recluster the weighted points in C into k clusters
Another alternative to k-Means++ and k-Means|| is the Mini-batch k-Means algorithm, detailed in this paper: https://sci-hub.se/10.1145/1772690.1772862.
This approach employs small (mini) batches to optimize k-Means clustering instead of relying on a single large-batch optimization procedure. It has been demonstrated to offer faster convergence and can be implemented to scale k-Means with low computation cost on large datasets.
We implement and benchmark the above mentioned algorithms using Spark's distributed framework.