Replies: 5 comments 2 replies
-
The following examples demonstrate how the selection of various parameters can significantly affect the resulting embeddings and their quality. The dataset used here is from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5238385. As you can see, the clustering result of |
Beta Was this translation helpful? Give feedback.
-
What would be your proposed alternative? Its a striking result. |
Beta Was this translation helpful? Give feedback.
-
Can I only select features like:
which is just like what SnapATAC did. |
Beta Was this translation helpful? Give feedback.
-
@kaizhang I imagine that perhaps over clustering using a resolution like 4.0 and then performing cluster merging algorithms like Cytocipher/CHOIR (but for scATAC-seq) that merge non-significantly different clusters, could potentially enhance the results here. This is an alternative to just selecting a resolution of 1.0. |
Beta Was this translation helpful? Give feedback.
-
@kaizhang I just read the note here and am a bit confused. Am I doing something wrong? Is there a different way to reduce the number of accessible regions for the downstream prosessing? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Feature selection plays a critical role in dimension reduction analysis. Unfortunately, there is no consensus on the best feature selection method in scATAC-seq analysis. As the scATAC-seq count matrix is sparse, computing the variability of features is difficult, so dispersion-based methods used in scRNA-seq cannot be applied here.
A simple strategy is to select features based on their total accessibility across all cells. In such a strategy, the top N accessible features are selected for dimension reduction analysis. By carefully selecting the value of N, one can achieve significant variations in the quality and nature of the resulting embeddings. This is because the choice of N impacts the level of detail and specificity captured in the representation. Thus, it is crucial to explore and evaluate the effects of different N values to ensure that the resulting embeddings align with the intended objectives and provide valuable insights for the given application. As a rule of thumb, large datasets with more complex structures will benefit from larger Ns, while small datasets with fewer cell types or less prominent cluster structures should go with smaller Ns.
Another strategy is to use multiple rounds of feature selections, as implemented in ArchR. First, an initial feature set is selected using the strategy above. Dimension reduction and clustering are then performed using this feature set to get initial clusters. Single cells are then grouped and aggregated according to cluster labels and variable features are identified at the cluster level. One can choose to continue this process or stop and use these variable features in downstream analysis.
Both strategies have been implemented in SnapATAC2's
pp.select_features
function. It is advised for users to play with then_features
parameter for different datasets and visualize the differences. The iterative feature selection can be turned on by settingmax_iter
>= 2. However, I'm a bit skeptical about this method, as this method is likely to propagate the clustering error or noise to subsequent rounds of feature selection steps and produce artificial clusters (despite being visually pleasing). In a nutshell, the iterative feature selection use some arbitrary parameters to obtain clusters, and then use that information to select features (train the model) in order to make these cluster structure more prominent and visually pleasing.This issue is meant to collect feedback on different algorithms and suggestions for improvements. I hope the discussion here will lead to a more sophisticated feature selection method in the future :-)
Beta Was this translation helpful? Give feedback.
All reactions