Skip to content

Commit 6b81272

Browse files
committed
update database preparation
1 parent c791048 commit 6b81272

1 file changed

Lines changed: 16 additions & 12 deletions

File tree

docs/1_pipeline_setup/2_database.md

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@ For this pipeline, each reference genome assembly has its own dedicated database
1313

1414
![Picture](../figures/database_preparation.png)
1515

16+
Below, we will use the hg38 as a example to go through this process step-by-step. To start, locate yourself to the folder of your conda enviorment:
17+
18+
``` bash
19+
# locate to your conda env, change the path accordingly
20+
cd /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
21+
```
22+
1623
1. **Data collection**
1724

1825
There are **FOUR** files required for database preparation. Three of them can be directly downloaded from online resources:
@@ -24,9 +31,6 @@ For this pipeline, each reference genome assembly has its own dedicated database
2431
- ***<u>genome.fa</u>***: Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format
2532

2633
```bash
27-
# locate to your conda env, change the path accordingly
28-
cd /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
29-
3034
# create and change to the database folder
3135
mkdir -p pipeline/databases/hg38/gencode.release48 # for annotation release 48 for hg38
3236
cd pipeline/databases/hg38/gencode.release48
@@ -71,7 +75,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
7175
- **`annotation.gene2transcript.txt`** & **`annotation.transcript2gene.txt`**: These files provide mappings between transcripts and genes, which are necessary for gene-level quantification.
7276
- **`annotation.geneAnnotation.txt`** & **`annotation.transcriptAnnotation.txt`**: These files contain detailed annotations for genes and transcripts, and are used in generating the final gene expression matrix.
7377

74-
To simplify this process, we have provided a script, **`parseAnnotation.pl`**, which allows you to easily generate all four filest:
78+
To simplify this process, we have provided a script, [**`parseAnnotation.pl`**](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/scripts/setup/parseAnnotation.pl), which allows you to easily generate all four filest:
7579

7680
``` bash
7781
## parse the gene anotation file
@@ -82,28 +86,28 @@ For this pipeline, each reference genome assembly has its own dedicated database
8286

8387
3. **Creating gene body bins**
8488

85-
In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required in genebody coverage statistics - **an important QC metrics that indicates the extent of RNA degradation**.
89+
In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required for gene body coverage analysis - **a key quality control metric that indicates the extent of RNA degradation**.
8690

8791

8892

8993
Two files will be generated:
9094

91-
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_all.txt</u>***: bins of the longest transcripts of ***all genes***. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower.
95+
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_all.txt</u>***: This file contains bins of the longest transcripts of ***all genes*** (N = 46,402 for human). This approach provides the most comprehensive gene body coverage assessment but is computationally slower.
9296

93-
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt</u>***: bins of the longest transcripts of ***precurated housekeeping genes***. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). ***<u>So, we set it as the default in this pipeline</u>***.
97+
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt</u>***: It contains bins of the longest transcripts of ***pre-curated housekeeping genes*** (N = 3,515 for human). Based on our tests across 30+ datasets, gene body coverage statistics from this file are comparable to those from the all-genes version, but it is significantly faster. This method is widely used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). ***<u>Therefore, we set this as the default option in our pipeline</u>***.
9498

95-
You can easily generate these two files with the command below:
99+
We have also provided a script, [**`createBins.pl`**](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/scripts/setup/createBins.pl), to generate these two files:
96100

97101
``` bash
98102
## create the gene body bins
99103
## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes.
100-
## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files.
104+
## Three arguments are needed: 1) transcriptome sequence file in FASTA format; 2) a txt file containiing housekeeping genes in the first column; and 3) a directory to save the output files.
101105
perl /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/setup/createBins.pl transcripts.fa housekeepingGenes_human.txt ./bulkRNAseq/genebodyBins
102106
```
103107

104108
4. **Create genome index files for RSEM**
105109

106-
The command below to build references is from [RSEM's tutorial](https://github.com/bli25/RSEM_tutorial?tab=readme-ov-file#-build-references):
110+
The following command for building references is adapted from [RSEM's tutorial](https://github.com/bli25/RSEM_tutorial?tab=readme-ov-file#-build-references):
107111
108112
``` bash
109113
#BSUB -P buildIndex
@@ -137,7 +141,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
137141
138142
5. **Create genome index files for Salmon**
139143
140-
The command below to build references is from [this tutorial](https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/):
144+
The command below to build references for Salmon is from [this tutorial](https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/):
141145
142146
``` bash
143147
#BSUB -P salmonIndex
@@ -165,7 +169,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
165169
166170
6. **Create genome index files for STAR**
167171
168-
The command below to build references is from [this tutorial](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf#page=5.42):
172+
The command below to build references for STAR is from [its manual](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf#page=5.42):
169173
170174
``` bash
171175
#BSUB -P STAR_Index

0 commit comments

Comments
 (0)