jyyulab
diff --git a/‎.DS_Store‎
0 Bytes b/‎.DS_Store‎
0 Bytes
diff --git a/‎docs/3_full_tutorial/2_quantSalmon.md‎ ‎docs/3_full_tutorial/2_quantification.md‎docs/3_full_tutorial/2_quantSalmon.md renamed to docs/3_full_tutorial/2_quantification.md b/‎docs/3_full_tutorial/2_quantSalmon.md‎ ‎docs/3_full_tutorial/2_quantification.md‎docs/3_full_tutorial/2_quantSalmon.md renamed to docs/3_full_tutorial/2_quantification.md
diff --git a/‎docs/3_full_tutorial/3_summarization.md‎
Lines changed: 114 additions & 19 deletions b/‎docs/3_full_tutorial/3_summarization.md‎
Lines changed: 114 additions & 19 deletions
diff --git a/‎docs/figures/qc_report_individual.png‎
1.96 MB b/‎docs/figures/qc_report_individual.png‎
1.96 MB
diff --git a/‎docs/figures/qc_report_multiple.png‎
2.14 MB b/‎docs/figures/qc_report_multiple.png‎
2.14 MB
@@ -5,31 +5,126 @@ nav_order: 3
 parent: Full Tutorial
 ---
 
+# III. Summarization
 
-## Quantification Summary
+---
+
+## Why is this necessary?
+
+After you generate the quantification results using Salmon, RSEM or other tools, you probably ask how reliable the quantification results are. This is saying, we need to assess the quality of the quantification analysis. And, if the quality control results look good, you probably want to generate a general-format file summarizing all samples involved in the quantification analysis, so that most of the tools for down-stream analysis (e.g. differential analysis, clustering analysis) can directly read it. So here come the two main objectives in this analysis:
+
+- Perform a comprehensive quality assessment.
+
+- Generate a universal gene expression matrix containing all samples
+
+  
+
+## 1. Gene body coverage statistics
+
+A major concern for RNA-seq data quality is RNA degradation. RNA molecules are quite fragile, since RNases are everywhere. A exposure to the RNase for a couple of minutes can cause severe degradation of RNA modecules, especially the mRNA. So, we introduced the gene body coverage statistics to help us tell if the input samples are degraded or not. 
+
+**Gene body coverage** measures how evenly sequencing reads are distributed along the length of a gene's transcript, from the 5' end to the 3' end. **RNA degradation** typically starts from the ends of RNA molecules, particularly at the 5' end. When RNA is degraded, this results in a coverage bias, typically a noticeable "drop-off" at one or both ends of the transcript.
+
+In this pipeline, we pre-bined the longest transcripts of housekeeping genes (default, all-gene version is also available) into 100 fragments of same length (`genebodyBins_housekeeping.txt`). Then we count the reads mapped to each of these fragments from the transcriptome alignments (`quant.transcript.sorted.bam`) using the command below:
+
+```bash
+## 1. gene body coverage statistics
+# Housekeeping genes
+bedtools multicov \
+	-bams /path-to-save-outputs/quantRSEM_STAR/quant.transcript.sorted.bam \
+	-bed /path-to-database/bulkRNAseq/genebodyBins/genebodyBins_housekeeping.txt > /path-to-save-outputs/quantRSEM_STAR/genebodyCoverage.txt
+
+# All genes
+bedtools multicov \
+	-bams /path-to-save-outputs/quantRSEM_STAR/quant.transcript.sorted.bam \
+	-bed /path-to-database/bulkRNAseq/genebodyBins/genebodyBins_all.txt > /path-to-save-outputs/quantRSEM_STAR/genebodyCoverage.txt
+```
+
+The only output in this step is the `genebodyCoverage.txt` file. It contains the counts of reads mapped to each of the pre-generated bins of selected transcripts. This file will be used to generate the HTML quality control report.
+
+## 2. Individual sample QC report
+
+The QC report for individual samples (see below) summarizes key statistics and quality control metrics from the quantification analysis, including:
+
+- **Alignment statistics**: Key statistics of transcriptome alignment.
+- **Quantification statistics**: Numbers of genes and transcripts identified by **Salmon** and **RSEM_STAR**, as well as their overlaps.
+- **Biotype distribution**: Compositon of gene types at both gene and transcript levels.
+- **Quantification accuracy**: Correlation of abundance estimates by **Salmon** and **RSEM_STAR** at both gene and transcript levels.
+- **Genebody coverage statistics**: **Visualization** and statistics of gene body coverage, including **Mean of Coverage**, **Coefficient of Skewness**.
+
+![image-20230901163554962](../figures/qc_report_individual.png)
+
+To generate this report, we created a R markdown script, **`summarizationIndividual.Rmd`**, which collects information from the files list below:
+
+* `/path-to-save-outputs/preProcessing/adapterTrimming.json`: for **Alignment statistics**.
+* `/path-to-save-outputs/quantRSEM/quant.stat/quant.cnt`: for **Quantification statistics**.
+* `/path-to-save-outputs/quantRSEM/genebodyCoverage.txt`: for **Genebody coverage statistics**.
+* `/path-to-save-outputs/quantRSEM/quant.isoforms.results`: for **Biotype distribution** and **Quantification accuracy** at the transcript-level.
+* `/path-to-save-outputs/quantRSEM/quant.genes.results`:  for **Biotype distribution** and **Quantification accuracy** at the gene-level.
+* `/path-to-save-outputs/quantSalmon/quant.sf`: for **Quantification accuracy** at the transcript-level.
+* `/path-to-save-outputs/quantSalmon/quant.genes.sf`: for **Quantification accuracy** at the gene-level.
 
-For each single sample, we will generate a QC report from the quantification results by Salmon and RSEM. There are five QC metrics in each report:
+You can generate this report using the following commands:
 
-* Alignment statistics: showing read counts and mapping rates et. al.;
-* Quantification statistics: showing the number of genes/transcripts identified by two methods and the correlations of quantification results of them;
-* Biotype distribution: showiing the composition of types of identified transcripts and genes;
-* Quantification accuracy: correlations of gene expression measurements by Salmon and RSEM;
-* Genebody coverage statistics: showing if the RNA samples were degraded or not.
+``` bash
+## 2. generate QC reports for individual samples
+# For one sample
+Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationIndividual.Rmd', clean = TRUE, quiet = F, output_format = 'html_document', output_file = 'summarization.html', output_dir = '/path-to-save-ouputs/sampleID/summarization', params = list(sampleName = 'sample1', dir_quant = '/path-to-save-ouputs/sampleID', dir_anno = './bulkRNAseq_2025/pipeline/databases/hg38/gencode.release48'))"
 
-![image-20230901163554962](/Users/qpan/Library/Application Support/typora-user-images/image-20230901163554962.png)
+# For multiple samples involved in sampleTable.txt
+/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationIndividual.pl sampleTable.txt
+```
+
+This command will:
+
+- Generate the script: **`/path-to-save-outputs/sampleID/summarization/summarization.sh`** and submit it to HPC queues.
+
+Typically, this step takes **~5 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
+
+- **`summarization.html`**: an HTML-format file containg the quality control metrics above (e.g., [example for sample1](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_individual.html)).
+
+- **`quant.genes.txt`**: **gene-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
+
+- **`quant.transcripts.txt`**: **transcript-level** quantification resluts by both **Salmon** and **RSEM_STAR**.
+
+- Some other files/folders
+
+## 3. Combined QC report
+
+The QC report for multiple samples **differs slightly** from the single-sample version and includes the following:
 
-```sh
-Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2023/bin/summaryIndividual.Rmd', clean = TRUE, quiet = F, output_format = 'html_document', output_file = 'quantSummary.html', output_dir = '/your_path/quantSummary', params = list(sampleName = 'sample1', quant_dir = '/your_path'))"
+- **Paths to the gene expression matries summarizing all samples**: Contains paths to matrices of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by **Salmon** and **RSEM_STAR**.
+- **Alignment statistics**: Summarize key transcriptome alignment statistics of all samples.
+- **Quantification statistics**: Reports the number of genes and transcripts identified by **Salmon** and **RSEM_STAR**, their overlaps, and correlations, with all samples combined.
+- **Biotype distribution**: Shows the compositon of gene types at both gene and transcript levels, aggregated across all samples.
+- **Genebody coverage statistics**: Includes visualizations and summary statistics (such as Mean of Coverage, Coefficient of Skewness) for gene body coverage, with all samples combined.
+
+![image-20230901163554962](../figures/qc_report_multiple.png)
+
+You can generate this QC report using the command below:
+
+``` bash
+## 3. generate QC reports for multiple samples
+/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationMultiple.pl sampleTable.txt /absolute-path-to-save-outputs
 ```
 
-As shown above, we built a R markdown script, **summaryIndividual.Rmd**, to generate this report from the outputs of quantification pipelines. The only argument you need to pay attention to is the **quant_dir**. The script will call these files to generate the QC report:
+This command will first **count the number of reference genome assemblies** (Column #4) present in your **`sampleTable.txt`**, and then:
+
+- **Create a folder for each reference genome assembly** within the output directory (**`absolute-path-to-save-outputs`**). Each folder will be named using the reference genome assembly string, with any "/" characters replaced by "_". You may rename these folders after the analysis is complete.
+
+- Splite the **`sampleTable.txt`** by reference genome assembly into seperate tables and save them in the corresponding folders created above (also named **`sampleTable.txt`**).
+
+- Generate a script (**`summarizationMultiple.sh`**) in each folder and submit them to HPC queues.
+
+  ``` bash
+  Rscript -e "rmarkdown::render(input = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/scripts/run/summarizationMultiple.Rmd', clean = TRUE, quiet = FALSE, output_format = 'html_document', output_file = 'summarizationMultiple.html', output_dir = '/absolute-path-to-save-outputs', params = list(sampleTable = '/absolute-path-to-save-outputs/_research_jude_rgs01_jude_grou    ps_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAseq_2025_pipeline_databases_hg38_gencode.release48/sampleTable.txt', dir_anno = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/databases/hg    38/gencode.release48', dir_output = '/research_jude/rgs01_jude/groups/yu3grp/projects/software_JY/yu3grp/conda_env/bulkRNAseq_2025/pipeline/testdata/Quantification/_research_jude_rgs01_jude_groups_yu3grp_projects_software_JY_yu3grp_conda_env_bulkRNAse    q_2025_pipeline_databases_hg38_gencode.release48'))"
+  ```
+
+Typically, this step takes **~10 mins** to complete (for 150M PE-100 reads). The stardard outputs include:
+
+- **`summarizationMultiple.html`**: an HTML-format file containing the quality control metrics above (e.g., [example for hg38](https://github.com/jyyulab/bulkRNAseq_quantification_pipeline/blob/main/testdata/summarization_multiple.html)).
+
+- **`01_expressMatrix.*.txt`**: gene expression matries of **raw counts**, **TPM** and **FPKM** values at both gene and transcript levels, quantified by both **Salmon** and **RSEM_STAR**
 
-* quant_dir/preProcessing/adapterTrimming.json
-* quant_dir/quantRSEM/quant.isoforms.results
-* quant_dir/quantRSEM/quant.genes.results
-* quant_dir/quantRSEM/quant.stat/quant.cnt
-* quant_dir/quantRSEM/geneCoverage.txt
-* quant_dir/quantSalmon/quant.sf
-* quant_dir/quantSalmon/quant.genes.sf
+- Some other files/folders
 
-As for the outputs, the script will generate a .html file named quantSummary.html which summarizes all five QC metrics. And a few .txt files and .pdf files will be generated as well to provide the source data of the html QC report. These files can be also used to generate the multi-sample QC report if multiple samples were sequenced.