dphmix

Unsupervised and Semi-supervised Dirichlet Process Heterogeneous Mixtures

A Python package that implements Dirichlet Process Heterogeneous Mixtures (DPHMs) of exponential family distributions for clustering heterogeneous data without choosing the number of clusters. Inference can be performed with Gibbs sampling [1] or coordinate ascent mean-field variational inference (MFVI) [2]. For semi-supervised learning, Gibbs sampling supports must-link and can't-link constraints [3]. A novel variational inference algorithm was derived to handle must-link constraints.

It currently supports the following distributions:

Normal: continuous variables are inferred to be Gaussian
Bernoulli: binary {0,1} variables are inferred to be Bernoullian
Poisson: non-negative discrete variables are inferred to be Poissonian

Dependencies

joblib
numpy
pandas
scipy

Tested on Python 3.6.4

Installation

pip install dphmix

Tutorial

In this tutorial, we cluster genomic regions bound by the Nkx2-5 transcription factor based on biological features.

Download the Nkx2-5 dataset from https://github.com/tahmidmehdi/dphmix/tree/master/data.
In Python, import the package and store the data in a dataframe. The rows represent observations and columns represent features:

from dphmix.VariationalDPHM import *
X = pd.read_csv('nkxData.csv', index_col=0, header=0)
# drop cluster and class assignment columns. These are just results from our paper.
X.drop(['Cluster', 'Class'] , axis=1, inplace=True)

Set hyperparameters for the model. Descriptions for hyperparameters are shown in the next section.

# prior parameters for 100 Gaussian features
nu = [0]*100
rho = [1]*100
a = [1]*100
b = [1]*100
# prior parameters for 5 Bernoulli features
gamma = [1]*5
delta = [1]*5
# prior parameters for 9 Poisson features
zeta = 1/np.std(X[X.columns[105:]])
eps = zeta*np.mean(X[X.columns[105:]])
# dictionary of hyperparameters
h = dict(nu=nu, rho=rho, a=a, b=b, gamma=gamma, delta=delta, eps=eps, zeta=zeta)

Initialize a 'VariationalDPHM' model and fit it to X. Information about parameters and outputs is provided in the next section.

model = VariationalDPHM(alpha=1, iterations=1000 , max_clusters=100 , tol=10, random_state=42)
clusters = model.fit_predict(X, hyperparameters=h)

Features with 'float' data types are Gaussian. Features with 'int' data types that only have values of 0 or 1 are Bernoulli. Features with 'int' that only have non-negative values with at least one integer greater than 1 are Poisson. If your feature has negative integer values, you can add a constant to them to shift them to the non-negative integer space.

Classes

VariationalDPHM

VariationalDPHM(alpha, iterations, max_clusters, tol=1e-3, n_jobs=1, random_state=None)

Implements MFVI with conjugate priors using the Stick-breaking Process for Dirichlet Process Mixture Models where the independent variates follow Normal, Bernoulli & Poisson distributions. It can also handle must-link constraints.

Parameter	Data type	Description
alpha	float>0	required. The alpha parameter for the Dirichlet Process. Determines how precisely the model should look for clusters. Higher values will create more clusters.
iterations	int>=1	required. The maximum number of iterations.
max_clusters	int>=2	required. The maximum number of clusters the model can create.
tol	float>0	optional (default: 1e-3). The algorithm stops when the difference between evidence lower bounds (ELBOs) in 2 consecutive iterations is less than tol.
n_jobs	int>=1	optional (default: 1). The number of cores to use.
random_state	int>=0 or None	optional (default: None). Determines the initial clusters and ensures reproducibility.

Method	Description
fit_predict(X[, ml, hyperparameters])	Fits a DPHM model & predicts cluster assignments for each observation in X. ml can be passed to force certain observations to cluster together.
predict(X)	Predicts cluster assignments for each observation in X along with probabilities of belonging to each cluster using the fitted model.
get_params()	Gets arguments for the model.

fit_predict(X, ml=[], hyperparameters=None)

Returns a MFVICluster object.

Parameter	Data type	Description
X	pandas.DataFrame	required. The data to cluster. Rows are observations and columns are variables.
ml	list	optional. A list of must-link constraints where a must-link constraint is a list of indices of observations that must cluster together.
hyperparameters	dict or None	optional (there are built-in default hyperparameters). If X has G, B & P continuous, binary & ordinal features, respectively, then hyperparameters are contained in dictionaries where keys=['nu','rho','a','b','gamma','delta','eps','zeta']. The values of 'nu','rho','a' & 'b' must be lists of numbers with length G (all positive except 'nu'), the values of 'gamma' & 'delta' must be lists of positive numbers with length B, the values of 'eps' & 'zeta' must be lists of positive numbers with length P. 'nu', 'rho', 'a' & 'b' are the mean, precision, shape and rate parameters, respectively, of the NormalGamma distributions that generate means and precisions of Gaussian features. 'gamma' & 'delta' are the shape parameters of the Beta distributions that generate probability parameters for Bernoullian features. 'eps' & 'zeta' are the shape and rate parameters, respectively, of the Gamma distributions that generate average rate parameters for Poissonian features. In every list, the values must be in the same order that their corresponding features appear in X.

predict(X)

Returns a vector of cluster assignments for observations in X and a probability matrix where rows correspond to observations & columns correspond to clusters.

Parameter	Data type	Description
X	pandas.DataFrame	required. The data to cluster.

get_params()

Returns a dictionary of arguments for the model

GibbsDPHM

GibbsDPHM(alpha, iterations, max_clusters, n_jobs=1)

Implements a Gibbs sampler with conjugate priors using the Chinese Restaurant Process for Dirichlet Process Mixture Models where the independent variates follow Normal, Bernoulli & Poisson distributions. It can handle must-link & can't-link constraints.

Parameter	Data type	Description
alpha	float>0	required. The alpha parameter for the Dirichlet Process. Determines how precisely the model should look for clusters. Higher values will create more clusters.
iterations	int>=1	required. The maximum number of iterations.
max_clusters	int>=2	required. The maximum number of clusters the model can create.
n_jobs	int>=1	optional (default: 1). The number of cores to use.

Method	Description
fit_predict(X[, ml, cl, hyperparameters])	Fits a model & predicts cluster assignments for each observation in X. ml can be passed to force certain observations to cluster together. cl can be passed to force certain observations to be in different clusters.
predict(X)	Predicts cluster assignments for each observation in X along with probabilities of belonging to each cluster using the fitted model.
get_params()	Gets arguments for the model.

fit_predict(X, ml=[], cl=[], hyperparameters=None)

Returns a GibbsCluster object.

Parameter	Data type	Description
X	pandas.DataFrame	required. The data to cluster. Rows are observations and columns are variables.
ml	list	optional. A list of must-link constraints.
cl	list	optional. A list of can't-link constraints where a can't-link constraint is a list of 2 indices corresponding to observations that cannot cluster together.
hyperparameters	dict or None	optional (there are built-in default hyperparameters). Same as the VariationalDPHM.fit_predict hyperparameters

predict(X)

Returns a vector of cluster assignments for observations in X and a probability matrix where rows correspond to observations & columns correspond to clusters.

Parameter	Data type	Description
X	pandas.DataFrame	required. The data to cluster.

get_params()

Returns a dictionary of arguments for the model

MFVICluster

Attribute	Data type	Description
c	numpy.array	Cluster assignments of the observations in the data in the order they appear in the data.
variational_parameters	list	List of parameters where the ith element is a dict of variational parameters for the ith cluster.
moments	list	List of moments where the ith element is a dict of expectations of distribution parameters for the ith cluster.
Ez	numpy.ndarray	matrix where the element in row i, column j is the probability of observation i being in cluster j.
cluster_map	dict	mapping of cluster indices in the model (0-max_clusters) to indices in c.
n_clusters	int>=1	The number of clusters found
ELBO	float<=0	The final ELBO

GibbsCluster

Attribute	Data type	Description
c	numpy.array	Cluster assignments of the observations in the data in the order they appear in the data.
phi	list	List of parameters where the ith element is a dict of parameters for the ith cluster.
n_clusters	int>=1	The number of clusters found

Citation

Tahmid F Mehdi, Gurdeep Singh, Jennifer A Mitchell, Alan M Moses, Variational infinite heterogeneous mixture model for semi-supervised clustering of heart enhancers, Bioinformatics, Volume 35, Issue 18, 15 September 2019, Pages 3232�3239, https://doi.org/10.1093/bioinformatics/btz064

References:

[1] RM Neal. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics, 9(2), 249-265.

[2] DM Blei & MI Jordan. (2006). Variational Inference for Dirichlet Process Mixtures. Bayesian Analysis, 1(1), 121-144.

[3] A Vlachos, A Korhonen & Z Ghahramani. (2009). Unsupervised and Constrained Dirichlet Process Mixture Models for Verb Clustering. Proceedings of the EACL 2009 Workshop on GEMS: GEometical Models of Natural Language Semantics, 74-82.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
dphmix		dphmix
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dphmix

Unsupervised and Semi-supervised Dirichlet Process Heterogeneous Mixtures

Dependencies

Installation

Tutorial

Classes

VariationalDPHM

GibbsDPHM

MFVICluster

GibbsCluster

Citation

References:

About

Releases 2

Packages

Contributors 2

Languages

License

tahmidmehdi/dphmix

Folders and files

Latest commit

History

Repository files navigation

dphmix

Unsupervised and Semi-supervised Dirichlet Process Heterogeneous Mixtures

Dependencies

Installation

Tutorial

Classes

VariationalDPHM

GibbsDPHM

MFVICluster

GibbsCluster

Citation

References:

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages