A scalable and concurrent programming implementation of Affinity Propagation clustering.
Affinity Propagation is a clustering algorithm based on passing messages between data-points.
Storing and updating matrices of 'affinities', 'responsibilities' and 'similarities' between samples can be memory-intensive. We address this issue through the use of an HDF5 data structure, allowing Affinity Propagation clustering of arbitrary large data-sets, where other Python implementations would return a MemoryError on most machines.
We also significantly speed up the computations by splitting them up across subprocesses, thereby taking full advantage of the resources of multi-core processors and bypassing the Global Interpreter Lock of the standard Python interpreter, CPython.
Concurrent_AP requires Python 2.7 along with the following packages and a few modules from the Standard Python Library:
- NumPy >= 1.9
- psutil
- PyTables
- scikit-learn
- setuptools
It is suggested that you check that the required dependencies are installed, although the pipcommand below should do this automatically for you. You can indeed most conveniently download Concurrent_AP from the official Python Package Index (PyPI) as follows:
- open a terminal window;
- type in the command:
pip install Concurrent_AP
The code herewith has been tested on Fedora, OS X and Ubuntu and should work fine on any other member of the Unix-like family of operating systems.
See the docstrings associated to each function of the Concurrent_AP module for more information and an understanding of how different tasks are organized and shared among subprocesses.
Usage: Concurrent_AP [options] file_name, where file_name denotes the path where the data to be processed by Affinity Propagation clustering is held. Each row corresponds to a sample and the features must be tab-separated (*.tsv file). It is also assumed that this file does not display any header.
-cor--convergence: specify the number of iterations without change in the number of clusters that signals convergence (defaults to 15);-dor--damping: the damping parameter of Affinity Propagation (defaults to 0.5);-for--file: option to specify the file name or file handle of the hierarchical data format where the matrices involved in Affinity Propagation clustering will be stored, as part of a group named 'aff_prop_group' at the root of the HDF5 (defaults to a temporary file); the dataset itself won't be accessed there and needn't be stored there either;-ior--iterations: maximum number of message-passing iterations (defaults to 200);-mor--multiprocessing: the number of processes to use;-por--preference: the preference parameter of Affinity Propagation (if not specified, will be determined as the median of the distribution of pairwise L2 Euclidean distances between samples);-sor--similarities: determine if a similarity matrix has been pre-computed and stored in the HDF5 data structure accessible at the location specified through the command line option-for--file(see above);-vor--verbose: whether to be verbose.
The following few lines illustrate the use of Concurrent_AP on the 'Iris data-set' from the UCI Machine Learning Repository. While the number of samples is here way too small for the benefits of the present multi-tasking implementation and the use of an HDF5 data structure to come fully into play, this data-set has the advantage of being well-known and prone to a quick comparison with scikit-learn's version of Affinity Propagation clustering.
- In a Python interpreter console, enter the following few lines, whose purpose is to create a file containing the Iris data-set that will be later subjected to Affinity Propagation clustering via Concurrent_AP:
>>> import numpy as np
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> with open('./iris_data.tsv', 'w') as f:
np.savetxt(f, data, fmt = '%.4f', delimiter = '\t')
- Open a terminal window.
- Type in:
Concurrent_AP --preference -5.47 --v iris_data.tsvor simplyConcurrent_AP iris_data.tsv
The latter will automatically compute a preference parameter from the data-set.
When the rounds of message-passing among data-points have completed, a folder containing a file of cluster labels and a file of cluster centers indices both in tab-separated format is created in your current working directory.
Brendan J. Frey and Delbert Dueck., "Clustering by Passing Messages Between Data Points". In: Science, Vol. 315, no. 5814, pp. 972-976. 2007