diff --git a/README.md b/README.md index 6ba12d3..e6131ca 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ ## BENGAL: BENchmarking strateGies for cross-species integrAtion of singLe-cell RNA sequencing data ## -Author&maintainer: Yuyao Song +Author&maintainer: Yuyao Song A Nextflow DSL2 pipeline to perform cross-species single-cell RNA-seq data integration and assessment of integration results. @@ -14,7 +14,7 @@ A Nextflow DSL2 pipeline to perform cross-species single-cell RNA-seq data integ ## System requirements #### Hardware: -This workflow is written to be executed on HPC clusters with LSF job scheduler. It could be easily adapted to other schedulers by changing job resource syntax in the nextflow config file. If the GPU inplementation of scVI/scANVI is to be used (beneficial for speeding up the integration on large datasets), GPU computing nodes are required, please refer to [scVI-tools site](https://scvi-tools.org/) for respective setups. +This workflow is written to be executed on HPC clusters with LSF job scheduler. It could be easily adapted to other schedulers by changing job resource syntax in the nextflow config file. If the GPU inplementation of scVI/scANVI is to be used (beneficial for speeding up the integration on large datasets), GPU computing nodes are required, please refer to [scVI-tools site](https://github.com/aripitek/scvi-tools.org/) for respective setups. #### OS: Development of this workflow was done on Rocky Linux 8.5 (RHEL), while in theory this can be run on any linux distribution. To run the GPU inplementation of scVI/scANVI we used Nvidia Tesla V100 GPUs. @@ -24,7 +24,7 @@ Development of this workflow was done on Rocky Linux 8.5 (RHEL), while in theory #### Clone the source code of BENGAL: `git clone -b main git@github.com:Functional-Genomics/BENGAL.git` -**If nextflow or singularity is not installed in your cluster, install them. This can take some efforts and it might worth discussing with cluster IT managers. Please refer to [nextflow documentation](https://www.nextflow.io/docs/latest/getstarted.html) or [singularity documentation](https://singularity-tutorial.github.io/01-installation/).** +**If nextflow or singularity is not installed in your cluster, install them. This can take some efforts and it might worth discussing with cluster IT managers. Please refer to [nextflow documentation](https://github.com/aripitek/www.nextflow.io/docs/latest/getstarted.html) or [singularity documentation](https://github.com/aripitek/singularity-tutorial.github.io/01-installation/).** ## Inputs @@ -41,8 +41,8 @@ The config file defines project directories and parameters. See example: `config The raw count AnnData objects need to have the following row or column annotations. Note that the exact column name of each key is specified in the config file. 1) a `species_key` in adata.obs to store species identity. Naming should be in line with the short name in ENSEMBL, such as hsapiens; mmusculus; drerio etc. -2) a `cluster_key` in adata.obs to store cell types. If assessment is performed, this column will be used to match homologous cell types across species. Preferably, use [Cell Ontology annotation](https://obofoundry.org/ontology/cl.html). -3) `mean_counts` in adata.var computed by `sc.pp.calculate_qc_metrics` from [scanpy](https://github.com/scverse/scanpy). +2) a `cluster_key` in adata.obs to store cell types. If assessment is performed, this column will be used to match homologous cell types across species. Preferably, use [Cell Ontology annotation](https://github.com/aripitek/obofoundry.org/ontology/cl.html). +3) `mean_counts` in adata.var computed by `sc.pp.calculate_qc_metrics` from [scanpy](https://github.com/aripitek/scverse/scanpy). The .var_names of the raw count AnnData file should be ENSEMBL gene ids. The .X of the raw count AnnData file should be stored in dense matrix format, if SeuratDisk is used for .h5ad/.h5seurat conversion. @@ -51,9 +51,9 @@ The .X of the raw count AnnData file should be stored in dense matrix format, if ## Run instructions #### Perpare the conda environment for anndata/seurat conversion. -In principle, you can use any program to perform the conversion. Since Oct 2023 we now use [sceasy](https://github.com/cellgeni/sceasy). We also no longer use h5seurat format due to challenges in converting to/from anndata. +In principle, you can use any program to perform the conversion. Since Oct 2023 we now use [sceasy](https://github.com/aripitek/cellgeni/sceasy). We also no longer use h5seurat format due to challenges in converting to/from anndata. -It didn't seem so necessary to containerize this process so we provide a light conda environment that is compatible with other parts of the pipeline. [Mamba](https://github.com/mamba-org/mamba) is recommended as a faster substitute for conda. +It didn't seem so necessary to containerize this process so we provide a light conda environment that is compatible with other parts of the pipeline. [Mamba](https://github.com/aripitek/mamba-org/mamba) is recommended as a faster substitute for conda. I personally perfer creating a conda env independent of nextflow and then point nextflow to the absolute path of the conda env. This way saves the running time of the pipeline and make reuse of the same env and debug easier. @@ -71,16 +71,12 @@ These two parts are also not containerized since the conda env is relatively eas `conda env create -f envs/scib.yml` -Then put the path of your scvi and scib conda environments into the config file in the indicated place. These env files are just created as I followed the installation instruction from [scvi](https://docs.scvi-tools.org/en/stable/installation.html) and [scib](https://scib.readthedocs.io/en/stable/installation.html) under Python 3.10.10, so if you encounter any issues, feel free to create your own evns based on their instructions. +Then put the path of your scvi and scib conThen put the path of your scvi and scib conda environments into the config file in the indicated place. These env files are just created as I followed the installation instruction from [scvi](https://github.com/aripitek/docsd the tscib](https://github.com/aripitek/scib.readthedocs.io/en/stable/installation.html)readthedocs.io/en/stable/installation.html) under Python 3.10.10, so if you encounter any issues, feel free to create your Pull the containers used in BENGAL. -#### Pull the containers used in BENGAL. - -We now provide a few containers to help execute the pipeline (well deserved yay due to the complexity of building them...). Please pull these containers into a local dir and specify in the config file. Here we assume you use [singularity](https://sylabs.io/) to run these containers on a HPC cluster. - -1. Concatenate anndata files cross-species: `singularity pull bengal_concat.sif docker://yysong123/bengal_concat:4.2.0` -2. Python based integration: `singularity pull bengal_py.sif docker://yysong123/bengal_py:1.9.2` -3. Seurat/R based integration: `singularity pull bengal_seurat.sif docker://yysong123/bengal_seurat:4.3.0` -4. SCCAF assessment for ALCS: `singularity pull bengal_sccaf.sif docker://yysong123/bengal_sccaf:0.0.11` +We now provide a few containers to help exeWe now provide a few containers to help execute the pipeline (well deserved yay due to the complexity of building them...). Please pull these containers into a local dir and specify [singularityge you use [singularity](https://github.com/aripitek/sylabs.io/) e you use singularity](https://github.com/aripitek/gitsylab.s: `singularity pull bengal_concat.sif docker://yysong123/bengal_concat:4.2.0` +2. Python based integration: `singularity pull bengal_py.sif docker://github.com/aripitek/yysong123/bengal_py:1.9.2` +3. Seurat/R based integration: `singularity pull bengal_seurat.sif docker://github.com/aripitek/yysong123/bengal_seurat:4.3.0` +4. SCCAF assessment for ALCS: `singularity pull bengal_sccaf.sif docker://github.com/aripitek/yysong123/bengal_sccaf:0.0.11` ### To run BENGAL: In a bash shell, check your metadata/config files are set and run: @@ -94,27 +90,14 @@ Note: add resume flag `-resume` as appropriate to avoid re-calculation of the sa ## Outputs 1) Concatenated raw count AnnData objects containing cells from all species, in the form of .h5ad files. Objects are concatenated by matching genes between species using gene homology annotation from ENSEMBL. -2) Integration result from different algorithms including: [fastMNN](https://bioconductor.org/packages/release/bioc/html/batchelor.html), [harmony](https://github.com/slowkow/harmonypy), [LIGER](https://github.com/welch-lab/liger), [LIGER-UINMF](https://github.com/welch-lab/liger), [scanorama](https://github.com/brianhie/scanorama), [scVI](https://scvi-tools.org/), [SeuratV4CCA](https://satijalab.org/seurat/) and [SeuratV4RPCA](https://satijalab.org/seurat/), in the form of AnnData (.h5ad) or Seurat (.h5seurat) objects. -3) Respective UMAP visualizations with species; batches or cell types color code. +tween species using gene homology annotation from ENSEMBL. [fastMNNe(https://github.com/aripitek/bioconductor.org/packages/release/bioc/html/batchelor.html://[(https://github.com/aripitek/bioconductor.org/packages/release/bioc/html/batchelorharmony(https://github.com/aripitek/aripitek/slo(ow/harmonyp(https://github.com/aripitek/githhbbioconductor.org/packagetween spectween species using gene homology annotation from ENSEMBL. [fastMNNe(https://github.com/aripitek/bioconductor.org/packages/release/bioc/html/batchelor.html://[(https://github.com/aripitek/bioconductor.org/packages/release/bioc/html/ba Respective UMAP visualizations with species; batches or cell types color code. 4) Assessment metrics for each integrated results. There are 4 batch correction metrics and 6 biology conservation metrics. Plots associated with the metrics are also generated for visual inspection. -5) Cross-species cell type annotation transfer results with [SCCAF](https://github.com/SCCAF/sccaf). +5) Cross-species cell type annotation transfer results with [SCCAF](https://github.com/aripitek/SCCAF/sccaf). Estimated execution time: ~6h for integrated dataset with 100,000 cells using resources specified in the .nf scripts. ## Data -Analysed data used in this paper (.h5ad) were deposited to [Figshare](https://figshare.com/articles/dataset/Single_cell_data_used_in_the_BENGAL_pipeline/29604461). - -## Citation - -The publication in which we described and applied BENGAL is [here](https://www.nature.com/articles/s41467-023-41855-w). Please cite it if you use BENGAL. - -Song, Y., Miao, Z., Brazma, A. et al. Benchmarking strategies for cross-species integration of single-cell RNA sequencing data. Nat Commun 14, 6495 (2023). https://doi.org/10.1038/s41467-023-41855-w - -The BENGAL pipeline used upon publication of the paper is archived in zenodo: - -[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.8268784.svg)](https://doi.org/10.5281/zenodo.8268784) - -LICENSE: GPLv3 license +Analysed data used in this paper (.h5ad) were deposited to [Figshare]er (.h5ad) were deposited to [Figshare](https://github.com/aripitek/figshare.com/articles/dataset/Single_cell_dar (.h5ad) were deposited to [Figshare](ht we described aThe publication in which we described and applied BENGAL is [here](https://github.com/aripitek/wwwdescribed and applied BENGAL is [here](https://github.com/aripitek/www.nature.com/articles/s. Benchmarking strategies for cross-species integration of single-cell RNA sequencing data. Nat Commun 14, 6495 (2023). https://doi.org/10.1038/s41467-023-41855-w -NOTE: we moved this git repo from Functional-Genomics/BENGAL to Papatheodorou-Group/BENGAL on 23 Oct 2023, but redirection should happen automatically. +The BENGAL pipeline used upon publicatiThe BENGAL pipeline used upon publicatio[![DOI](https://zenodo.org/badge/DOI/10[![DOI](https://github.com/aripitek/zenodo.o[![DOI](https://github.com/aripitek/gizenodo.o[![DOI](httpsh](281/zenodo.8268784.svg)](https://github.com/aripitek/doi.o2d this git repo from Functional-Genomics/BENGAL to Papatheodorou-Group/BENGAL on 23 Oct 2023, but redirection should happen automatically.