You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After you generate the quantification results using Salmon, RSEM or other tools, you probably ask how reliable the quantification results are. This is saying, we need to assess the quality of the quantification analysis. And, if the quality control results look good, you probably want to generate a general-format file summarizing all samples involved in the quantification analysis, so that most of the tools for down-stream analysis (e.g. differential analysis, clustering analysis) can directly read it. So here come the two main objectives in this analysis:
15
+
16
+
- Perform a comprehensive quality assessment.
17
+
18
+
- Generate a universal gene expression matrix containing all samples
19
+
20
+
21
+
22
+
## 1. Gene body coverage statistics
23
+
24
+
A major concern for RNA-seq data quality is RNA degradation. RNA molecules are quite fragile, since RNases are everywhere. A exposure to the RNase for a couple of minutes can cause severe degradation of RNA modecules, especially the mRNA. So, we introduced the gene body coverage statistics to help us tell if the input samples are degraded or not.
25
+
26
+
**Gene body coverage** measures how evenly sequencing reads are distributed along the length of a gene's transcript, from the 5' end to the 3' end. **RNA degradation** typically starts from the ends of RNA molecules, particularly at the 5' end. When RNA is degraded, this results in a coverage bias, typically a noticeable "drop-off" at one or both ends of the transcript.
27
+
28
+
In this pipeline, we pre-bined the longest transcripts of housekeeping genes (default, all-gene version is also available) into 100 fragments of same length (`genebodyBins_housekeeping.txt`). Then we count the reads mapped to each of these fragments from the transcriptome alignments (`quant.transcript.sorted.bam`) using the command below:
The only output in this step is the `genebodyCoverage.txt` file. It contains the counts of reads mapped to each of the pre-generated bins of selected transcripts. This file will be used to generate the HTML quality control report.
44
+
45
+
## 2. Individual sample QC report
46
+
47
+
The QC report for individual samples (see below) summarizes key statistics and quality control metrics from the quantification analysis, including:
48
+
49
+
-**Alignment statistics**: Key statistics of transcriptome alignment.
50
+
-**Quantification statistics**: Numbers of genes and transcripts identified by **Salmon** and **RSEM_STAR**, as well as their overlaps.
51
+
-**Biotype distribution**: Compositon of gene types at both gene and transcript levels.
52
+
-**Quantification accuracy**: Correlation of abundance estimates by **Salmon** and **RSEM_STAR** at both gene and transcript levels.
53
+
-**Genebody coverage statistics**: **Visualization** and statistics of gene body coverage, including **Mean of Coverage**, **Coefficient of Skewness**.
- Generate the script: **`/path-to-save-outputs/sampleID/summarization/summarization.sh`** and submit it to HPC queues.
81
+
82
+
Typically, this step takes **~5 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
83
+
84
+
-**`summarization.html`**: an HTML-format file containg the quality control metrics above (e.g., [example for sample1](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_individual.html)).
85
+
86
+
-**`quant.genes.txt`**: **gene-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
87
+
88
+
-**`quant.transcripts.txt`**: **transcript-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
89
+
90
+
- Some other files/folders
91
+
92
+
## 3. Combined QC report
93
+
94
+
The QC report for multiple samples **differs slightly** from the single-sample version and includes the following:
-**Paths to the gene expression matries summarizing all samples**: Contains paths to matrices of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by **Salmon** and **RSEM_STAR**.
97
+
-**Alignment statistics**: Summarize key transcriptome alignment statistics of all samples.
98
+
-**Quantification statistics**: Reports the number of genes and transcripts identified by **Salmon** and **RSEM_STAR**, their overlaps, and correlations, with all samples combined.
99
+
-**Biotype distribution**: Shows the compositon of gene types at both gene and transcript levels, aggregated across all samples.
100
+
-**Genebody coverage statistics**: Includes visualizations and summary statistics (such as Mean of Coverage, Coefficient of Skewness) for gene body coverage, with all samples combined.
As shown above, we built a R markdown script, **summaryIndividual.Rmd**, to generate this report from the outputs of quantification pipelines. The only argument you need to pay attention to is the **quant_dir**. The script will call these files to generate the QC report:
111
+
This command will first **count the number of reference genome assemblies** (Column #4) present in your **`sampleTable.txt`**, and then:
112
+
113
+
-**Create a folder for each reference genome assembly** within the output directory (**`absolute-path-to-save-outputs`**). Each folder will be named using the reference genome assembly string, with any "/" characters replaced by "_". You may rename these folders after the analysis is complete.
114
+
115
+
- Splite the **`sampleTable.txt`** by reference genome assembly into seperate tables and save them in the corresponding folders created above (also named **`sampleTable.txt`**).
116
+
117
+
- Generate a script (**`summarizationMultiple.sh`**) in each folder and submit them to HPC queues.
Typically, this step takes **~10 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
124
+
125
+
-**`summarizationMultiple.html`**: an HTML-format file containing the quality control metrics above (e.g., [example for hg38](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_multiple.html)).
126
+
127
+
-**`01_expressMatrix.*.txt`**: gene expression matries of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by both **Salmon** and **RSEM_STAR**
26
128
27
-
* quant_dir/preProcessing/adapterTrimming.json
28
-
* quant_dir/quantRSEM/quant.isoforms.results
29
-
* quant_dir/quantRSEM/quant.genes.results
30
-
* quant_dir/quantRSEM/quant.stat/quant.cnt
31
-
* quant_dir/quantRSEM/geneCoverage.txt
32
-
* quant_dir/quantSalmon/quant.sf
33
-
* quant_dir/quantSalmon/quant.genes.sf
129
+
- Some other files/folders
34
130
35
-
As for the outputs, the script will generate a .html file named quantSummary.html which summarizes all five QC metrics. And a few .txt files and .pdf files will be generated as well to provide the source data of the html QC report. These files can be also used to generate the multi-sample QC report if multiple samples were sequenced.
0 commit comments