You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/1_pipeline_setup/2_database.md
+16-12Lines changed: 16 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,13 @@ For this pipeline, each reference genome assembly has its own dedicated database
13
13
14
14

15
15
16
+
Below, we will use the hg38 as a example to go through this process step-by-step. To start, locate yourself to the folder of your conda enviorment:
17
+
18
+
```bash
19
+
# locate to your conda env, change the path accordingly
20
+
cd /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
21
+
```
22
+
16
23
1.**Data collection**
17
24
18
25
There are **FOUR** files required for database preparation. Three of them can be directly downloaded from online resources:
@@ -24,9 +31,6 @@ For this pipeline, each reference genome assembly has its own dedicated database
24
31
-***<u>genome.fa</u>***: Genome sequence file in [FASTA](https://www.ncbi.nlm.nih.gov/genbank/fastaformat/) format
25
32
26
33
```bash
27
-
# locate to your conda env, change the path accordingly
28
-
cd /research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025
29
-
30
34
# create and change to the database folder
31
35
mkdir -p pipeline/databases/hg38/gencode.release48 # for annotation release 48 for hg38
32
36
cd pipeline/databases/hg38/gencode.release48
@@ -71,7 +75,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
71
75
- **`annotation.gene2transcript.txt`**&**`annotation.transcript2gene.txt`**: These files provide mappings between transcripts and genes, which are necessary for gene-level quantification.
72
76
- **`annotation.geneAnnotation.txt`**&**`annotation.transcriptAnnotation.txt`**: These files contain detailed annotations forgenes and transcripts, and are usedin generating the final gene expression matrix.
73
77
74
-
To simplify this process, we have provided a script, **`parseAnnotation.pl`**, which allows you to easily generate all four filest:
78
+
To simplify this process, we have provided a script, [**`parseAnnotation.pl`**](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/scripts/setup/parseAnnotation.pl), which allows you to easily generate all four filest:
75
79
76
80
``` bash
77
81
## parse the gene anotation file
@@ -82,28 +86,28 @@ For this pipeline, each reference genome assembly has its own dedicated database
82
86
83
87
3. **Creating gene body bins**
84
88
85
-
In this step, we will create a bin list forthe longest transcript of each gene, with 100 bins per transcript by default. This list is requiredin genebody coverage statistics - **an important QC metrics that indicates the extent of RNA degradation**.
89
+
In this step, we will create a bin list for the longest transcript of each gene, with 100 bins per transcript by default. This list is required for gene body coverage analysis - **a key quality control metric that indicates the extent of RNA degradation**.
86
90
87
91
88
92
89
93
Two files will be generated:
90
94
91
-
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_all.txt</u>***: bins of the longest transcripts of ***all genes***. This one is the most reliable solution since it calculates the gene body coverage across all genes (N = 46,402 for human). However, it's much slower.
95
+
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_all.txt</u>***: This file contains bins of the longest transcripts of ***all genes*** (N = 46,402 for human). This approach provides the most comprehensive gene body coverage assessment but is computationally slower.
92
96
93
-
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt</u>***: bins of the longest transcripts of ***precurated housekeeping genes***. Though it only considers the housekeeping genes (N = 3,515 in human), based on our tests across 30+ datasets, no significant difference of gene coverate statistics was observed compared to the all-transcript version. And it's way faster. This is widely-used in many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). ***<u>So, we set it as the default in this pipeline</u>***.
97
+
- ***<u>./bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt</u>***: It contains bins of the longest transcripts of ***pre-curated housekeeping genes***(N = 3,515 forhuman). Based on our tests across 30+ datasets, gene body coverage statistics from this file are comparable to those from the all-genes version, but it is significantly faster. This method is widelyusedin many pipelines, including the [RseQC](https://rseqc.sourceforge.net/#genebody-coverage-py). ***<u>Therefore, we set this as the default option in our pipeline</u>***.
94
98
95
-
You can easily generate these two files with the command below:
99
+
We have also provided a script, [**`createBins.pl`**](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/scripts/setup/createBins.pl), to generate these two files:
96
100
97
101
``` bash
98
102
## create the gene body bins
99
103
## This command will generate the two files containing the bin list of the longest transcript of all genes and housekeeping genes.
100
-
## Three arguments are needed: transcriptome sequence file in FASTA format, a txt file containiing housekeeping genes in the first column, and a directory to save the output files.
104
+
## Three arguments are needed: 1) transcriptome sequence file in FASTA format; 2) a txt file containiing housekeeping genes in the first column; and 3) a directory to save the output files.
The commandbelow to build references is from [RSEM's tutorial](https://github.com/bli25/RSEM_tutorial?tab=readme-ov-file#-build-references):
110
+
The following commandfor building references is adapted from [RSEM's tutorial](https://github.com/bli25/RSEM_tutorial?tab=readme-ov-file#-build-references):
107
111
108
112
``` bash
109
113
#BSUB -P buildIndex
@@ -137,7 +141,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
137
141
138
142
5. **Create genome index files for Salmon**
139
143
140
-
The command below to build references is from [this tutorial](https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/):
144
+
The command below to build references for Salmon is from [this tutorial](https://combine-lab.github.io/alevin-tutorial/2019/selective-alignment/):
141
145
142
146
``` bash
143
147
#BSUB -P salmonIndex
@@ -165,7 +169,7 @@ For this pipeline, each reference genome assembly has its own dedicated database
165
169
166
170
6. **Create genome index files for STAR**
167
171
168
-
The command below to build references is from [this tutorial](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf#page=5.42):
172
+
The command below to build references for STAR is from [its manual](https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf#page=5.42):
0 commit comments