Skip to content

Commit 56cdaaf

Browse files
committed
update full tutorial
1 parent 01f3abf commit 56cdaaf

5 files changed

Lines changed: 114 additions & 19 deletions

File tree

.DS_Store

0 Bytes
Binary file not shown.

docs/3_full_tutorial/3_summarization.md

Lines changed: 114 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,31 +5,126 @@ nav_order: 3
55
parent: Full Tutorial
66
---
77

8+
# III. Summarization
89

9-
## Quantification Summary
10+
---
11+
12+
## Why is this necessary?
13+
14+
After you generate the quantification results using Salmon, RSEM or other tools, you probably ask how reliable the quantification results are. This is saying, we need to assess the quality of the quantification analysis. And, if the quality control results look good, you probably want to generate a general-format file summarizing all samples involved in the quantification analysis, so that most of the tools for down-stream analysis (e.g. differential analysis, clustering analysis) can directly read it. So here come the two main objectives in this analysis:
15+
16+
- Perform a comprehensive quality assessment.
17+
18+
- Generate a universal gene expression matrix containing all samples
19+
20+
21+
22+
## 1. Gene body coverage statistics
23+
24+
A major concern for RNA-seq data quality is RNA degradation. RNA molecules are quite fragile, since RNases are everywhere. A exposure to the RNase for a couple of minutes can cause severe degradation of RNA modecules, especially the mRNA. So, we introduced the gene body coverage statistics to help us tell if the input samples are degraded or not.
25+
26+
**Gene body coverage** measures how evenly sequencing reads are distributed along the length of a gene's transcript, from the 5' end to the 3' end. **RNA degradation** typically starts from the ends of RNA molecules, particularly at the 5' end. When RNA is degraded, this results in a coverage bias, typically a noticeable "drop-off" at one or both ends of the transcript.
27+
28+
In this pipeline, we pre-bined the longest transcripts of housekeeping genes (default, all-gene version is also available) into 100 fragments of same length (`genebodyBins_housekeeping.txt`). Then we count the reads mapped to each of these fragments from the transcriptome alignments (`quant.transcript.sorted.bam`) using the command below:
29+
30+
```bash
31+
## 1. gene body coverage statistics
32+
# Housekeeping genes
33+
bedtools multicov \
34+
-bams /path-to-save-outputs/quantRSEM_STAR/quant.transcript.sorted.bam \
35+
-bed /path-to-database/bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt > /path-to-save-outputs/quantRSEM_STAR/genebodyCoverage.txt
36+
37+
# All genes
38+
bedtools multicov \
39+
-bams /path-to-save-outputs/quantRSEM_STAR/quant.transcript.sorted.bam \
40+
-bed /path-to-database/bulkRNAseq/genebodyBins/genebodyBins_all.txt > /path-to-save-outputs/quantRSEM_STAR/genebodyCoverage.txt
41+
```
42+
43+
The only output in this step is the `genebodyCoverage.txt` file. It contains the counts of reads mapped to each of the pre-generated bins of selected transcripts. This file will be used to generate the HTML quality control report.
44+
45+
## 2. Individual sample QC report
46+
47+
The QC report for individual samples (see below) summarizes key statistics and quality control metrics from the quantification analysis, including:
48+
49+
- **Alignment statistics**: Key statistics of transcriptome alignment.
50+
- **Quantification statistics**: Numbers of genes and transcripts identified by **Salmon** and **RSEM_STAR**, as well as their overlaps.
51+
- **Biotype distribution**: Compositon of gene types at both gene and transcript levels.
52+
- **Quantification accuracy**: Correlation of abundance estimates by **Salmon** and **RSEM_STAR** at both gene and transcript levels.
53+
- **Genebody coverage statistics**: **Visualization** and statistics of gene body coverage, including **Mean of Coverage**, **Coefficient of Skewness**.
54+
55+
![image-20230901163554962](../figures/qc_report_individual.png)
56+
57+
To generate this report, we created a R markdown script, **`summarizationIndividual.Rmd`**, which collects information from the files list below:
58+
59+
* `/path-to-save-outputs/preProcessing/adapterTrimming.json`: for **Alignment statistics**.
60+
* `/path-to-save-outputs/quantRSEM/quant.stat/quant.cnt`: for **Quantification statistics**.
61+
* `/path-to-save-outputs/quantRSEM/genebodyCoverage.txt`: for **Genebody coverage statistics**.
62+
* `/path-to-save-outputs/quantRSEM/quant.isoforms.results`: for **Biotype distribution** and **Quantification accuracy** at the transcript-level.
63+
* `/path-to-save-outputs/quantRSEM/quant.genes.results`: for **Biotype distribution** and **Quantification accuracy** at the gene-level.
64+
* `/path-to-save-outputs/quantSalmon/quant.sf`: for **Quantification accuracy** at the transcript-level.
65+
* `/path-to-save-outputs/quantSalmon/quant.genes.sf`: for **Quantification accuracy** at the gene-level.
1066

11-
For each single sample, we will generate a QC report from the quantification results by Salmon and RSEM. There are five QC metrics in each report:
67+
You can generate this report using the following commands:
1268

13-
* Alignment statistics: showing read counts and mapping rates et. al.;
14-
* Quantification statistics: showing the number of genes/transcripts identified by two methods and the correlations of quantification results of them;
15-
* Biotype distribution: showiing the composition of types of identified transcripts and genes;
16-
* Quantification accuracy: correlations of gene expression measurements by Salmon and RSEM;
17-
* Genebody coverage statistics: showing if the RNA samples were degraded or not.
69+
``` bash
70+
## 2. generate QC reports for individual samples
71+
# For one sample
72+
Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationIndividual.Rmd', clean = TRUE, quiet = F, output_format = 'html_document', output_file = 'summarization.html', output_dir = '/path-to-save-ouputs/sampleID/summarization', params = list(sampleName = 'sample1', dir_quant = '/path-to-save-ouputs/sampleID', dir_anno = './bulkRNAseq_2025/pipeline/databases/hg38/gencode.release48'))"
1873

19-
![image-20230901163554962](/Users/qpan/Library/Application Support/typora-user-images/image-20230901163554962.png)
74+
# For multiple samples involved in sampleTable.txt
75+
/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationIndividual.pl sampleTable.txt
76+
```
77+
78+
This command will:
79+
80+
- Generate the script: **`/path-to-save-outputs/sampleID/summarization/summarization.sh`** and submit it to HPC queues.
81+
82+
Typically, this step takes **~5 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
83+
84+
- **`summarization.html`**: an HTML-format file containg the quality control metrics above (e.g., [example for sample1](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_individual.html)).
85+
86+
- **`quant.genes.txt`**: **gene-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
87+
88+
- **`quant.transcripts.txt`**: **transcript-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
89+
90+
- Some other files/folders
91+
92+
## 3. Combined QC report
93+
94+
The QC report for multiple samples **differs slightly** from the single-sample version and includes the following:
2095

21-
```sh
22-
Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2023/bin/summaryIndividual.Rmd', clean = TRUE, quiet = F, output_format = 'html_document', output_file = 'quantSummary.html', output_dir = '/your_path/quantSummary', params = list(sampleName = 'sample1', quant_dir = '/your_path'))"
96+
- **Paths to the gene expression matries summarizing all samples**: Contains paths to matrices of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by **Salmon** and **RSEM_STAR**.
97+
- **Alignment statistics**: Summarize key transcriptome alignment statistics of all samples.
98+
- **Quantification statistics**: Reports the number of genes and transcripts identified by **Salmon** and **RSEM_STAR**, their overlaps, and correlations, with all samples combined.
99+
- **Biotype distribution**: Shows the compositon of gene types at both gene and transcript levels, aggregated across all samples.
100+
- **Genebody coverage statistics**: Includes visualizations and summary statistics (such as Mean of Coverage, Coefficient of Skewness) for gene body coverage, with all samples combined.
101+
102+
![image-20230901163554962](../figures/qc_report_multiple.png)
103+
104+
You can generate this QC report using the command below:
105+
106+
``` bash
107+
## 3. generate QC reports for multiple samples
108+
/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationMultiple.pl sampleTable.txt /absolute-path-to-save-outputs
23109
```
24110

25-
As shown above, we built a R markdown script, **summaryIndividual.Rmd**, to generate this report from the outputs of quantification pipelines. The only argument you need to pay attention to is the **quant_dir**. The script will call these files to generate the QC report:
111+
This command will first **count the number of reference genome assemblies** (Column #4) present in your **`sampleTable.txt`**, and then:
112+
113+
- **Create a folder for each reference genome assembly** within the output directory (**`absolute-path-to-save-outputs`**). Each folder will be named using the reference genome assembly string, with any "/" characters replaced by "_". You may rename these folders after the analysis is complete.
114+
115+
- Splite the **`sampleTable.txt`** by reference genome assembly into seperate tables and save them in the corresponding folders created above (also named **`sampleTable.txt`**).
116+
117+
- Generate a script (**`summarizationMultiple.sh`**) in each folder and submit them to HPC queues.
118+
119+
``` bash
120+
Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationMultiple.Rmd', clean = TRUE, quiet = FALSE, output_format = 'html_document', output_file = 'summarizationMultiple.html', output_dir = '/absolute-path-to-save-outputs', params = list(sampleTable = '/absolute-path-to-save-outputs/_research_jude_rgs01_jude_grou ps_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/sampleTable.txt', dir_anno = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/databases/hg 38/gencode.release48', dir_output = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAse q_2025_pipeline_databases_hg38_gencode.release48'))"
121+
```
122+
123+
Typically, this step takes **~10 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
124+
125+
- **`summarizationMultiple.html`**: an HTML-format file containing the quality control metrics above (e.g., [example for hg38](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_multiple.html)).
126+
127+
- **`01_expressMatrix.*.txt`**: gene expression matries of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by both **Salmon** and **RSEM_STAR**
26128

27-
* quant_dir/preProcessing/adapterTrimming.json
28-
* quant_dir/quantRSEM/quant.isoforms.results
29-
* quant_dir/quantRSEM/quant.genes.results
30-
* quant_dir/quantRSEM/quant.stat/quant.cnt
31-
* quant_dir/quantRSEM/geneCoverage.txt
32-
* quant_dir/quantSalmon/quant.sf
33-
* quant_dir/quantSalmon/quant.genes.sf
129+
- Some other files/folders
34130

35-
As for the outputs, the script will generate a .html file named quantSummary.html which summarizes all five QC metrics. And a few .txt files and .pdf files will be generated as well to provide the source data of the html QC report. These files can be also used to generate the multi-sample QC report if multiple samples were sequenced.
1.96 MB
Loading
2.14 MB
Loading

0 commit comments

Comments
 (0)