Notes on feature selection #116

kaizhang · 2023-03-14T17:14:55Z

kaizhang
Mar 14, 2023
Maintainer

Feature selection plays a critical role in dimension reduction analysis. Unfortunately, there is no consensus on the best feature selection method in scATAC-seq analysis. As the scATAC-seq count matrix is sparse, computing the variability of features is difficult, so dispersion-based methods used in scRNA-seq cannot be applied here.

A simple strategy is to select features based on their total accessibility across all cells. In such a strategy, the top N accessible features are selected for dimension reduction analysis. By carefully selecting the value of N, one can achieve significant variations in the quality and nature of the resulting embeddings. This is because the choice of N impacts the level of detail and specificity captured in the representation. Thus, it is crucial to explore and evaluate the effects of different N values to ensure that the resulting embeddings align with the intended objectives and provide valuable insights for the given application. As a rule of thumb, large datasets with more complex structures will benefit from larger Ns, while small datasets with fewer cell types or less prominent cluster structures should go with smaller Ns.

Another strategy is to use multiple rounds of feature selections, as implemented in ArchR. First, an initial feature set is selected using the strategy above. Dimension reduction and clustering are then performed using this feature set to get initial clusters. Single cells are then grouped and aggregated according to cluster labels and variable features are identified at the cluster level. One can choose to continue this process or stop and use these variable features in downstream analysis.

Both strategies have been implemented in SnapATAC2's pp.select_features function. It is advised for users to play with the n_features parameter for different datasets and visualize the differences. The iterative feature selection can be turned on by setting max_iter >= 2. However, I'm a bit skeptical about this method, as this method is likely to propagate the clustering error or noise to subsequent rounds of feature selection steps and produce artificial clusters (despite being visually pleasing). In a nutshell, the iterative feature selection use some arbitrary parameters to obtain clusters, and then use that information to select features (train the model) in order to make these cluster structure more prominent and visually pleasing.

This issue is meant to collect feedback on different algorithms and suggestions for improvements. I hope the discussion here will lead to a more sophisticated feature selection method in the future :-)

kaizhang · 2023-03-14T17:21:49Z

kaizhang
Mar 14, 2023
Maintainer Author

The following examples demonstrate how the selection of various parameters can significantly affect the resulting embeddings and their quality. The dataset used here is from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5238385.

As you can see, the clustering result of max_iter=2 is largely driven by the result of the first iteration, but the topology has been distorted. One can expect that the clustering parameters used in the first iteration will heavily affect the final result.

0 replies

dagarfield · 2023-04-26T21:48:44Z

dagarfield
Apr 26, 2023

What would be your proposed alternative? Its a striking result.

0 replies

beyondpie · 2023-04-30T23:54:25Z

beyondpie
Apr 30, 2023
Collaborator

Can I only select features like:

only remove features located in the blacklist
only remove features with lower_quantile (0.05) or higher_quantile (0.95) based on the accessibility across the cells ?

which is just like what SnapATAC did.

1 reply

kaizhang May 1, 2023
Maintainer Author

Just set "n_features" large enough to include all features. For example, n_features=100000000

newtonharry · 2024-03-30T04:10:35Z

newtonharry
Mar 30, 2024

@kaizhang I imagine that perhaps over clustering using a resolution like 4.0 and then performing cluster merging algorithms like Cytocipher/CHOIR (but for scATAC-seq) that merge non-significantly different clusters, could potentially enhance the results here. This is an alternative to just selecting a resolution of 1.0.

0 replies

emidalla · 2025-05-28T08:15:37Z

emidalla
May 28, 2025

@kaizhang I just read the note here and am a bit confused.
Am I correct assuming that n_features=100000 defines the dimension of the anndata object that I will obtain?
Because I used this command
snap.pp.select_features(adatas, n_features=50000)
but ended up with the following
AnnData object with n_obs x n_vars = 3306 x 6062095

Am I doing something wrong? Is there a different way to reduce the number of accessible regions for the downstream prosessing?
I would need to do this because the object is too big and the process crashes. Thank you very much!

1 reply

nsheff Jul 22, 2025

From the docs:

This function does not perform the actual subsetting. The feature mask is used by various functions to generate submatrices on the fly.

You are not doing anything wrong. It's the way the software was designed. The downstream functions you call are supposed to just know that you've selected features, and subset the matrix accordingly. The process is a big opaque, but you're doing it correctly.

Notes on feature selection #116

Uh oh!

Uh oh!

kaizhang Mar 14, 2023 Maintainer

Replies: 5 comments · 2 replies

Uh oh!

Uh oh!

kaizhang Mar 14, 2023 Maintainer Author

Uh oh!

dagarfield Apr 26, 2023

Uh oh!

Uh oh!

beyondpie Apr 30, 2023 Collaborator

Uh oh!

kaizhang May 1, 2023 Maintainer Author

Uh oh!

Uh oh!

newtonharry Mar 30, 2024

Uh oh!

emidalla May 28, 2025

Uh oh!

Uh oh!

nsheff Jul 22, 2025

kaizhang
Mar 14, 2023
Maintainer

Replies: 5 comments 2 replies

kaizhang
Mar 14, 2023
Maintainer Author

dagarfield
Apr 26, 2023

beyondpie
Apr 30, 2023
Collaborator

kaizhang May 1, 2023
Maintainer Author

newtonharry
Mar 30, 2024

emidalla
May 28, 2025