-
Notifications
You must be signed in to change notification settings - Fork 21
/
Copy pathbulk-rnaseq-analysis-guide.Rmd
executable file
·1016 lines (784 loc) · 66.7 KB
/
bulk-rnaseq-analysis-guide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Bulk RNA-sequencing pipeline and differential gene expression analysis"
author: Erick Lu
date: "March 1, 2020 \n\n[GitHub Repo](https://github.com/erilu/bulk-rnaseq-analysis)"
output:
html_document:
toc: true
toc_float: true
df_print: paged
---
## Introduction
RNA-sequencing is a powerful technique that can assess differences in global gene expression between groups of samples. For example, it can be used to:
* Identify differences between knockout and control samples
* Understand the effects of treating cells/animals with therapeutics
* Observe the gene expression changes that occur across development
However, processing raw sequencing data and performing differential gene expression analysis can be daunting for researchers with limited experience in programming. My goal is to help researchers learn how to analyze bulk RNA-sequencing data by providing code, explanations, and the expected output for each step of the process. Here, I present an example of a complete bulk RNA-sequencing pipeline which includes:
1. Finding and downloading raw data from GEO using NCBI SRA tools and Python
2. Mapping FASTQ files using STAR
3. Differential gene expression analysis using DESeq2
4. Visualizations for bulk RNA-seq results
If you are performing your own RNA-seq experiment and have obtained raw data from your sequencing facility, you can start this guide from the section: [Mapping reads using STAR].
## Obtaining raw data from GEO
The dataset that we will be working with comes from [Guo et al. Nature Communications 2019](https://www.ncbi.nlm.nih.gov/pubmed/30655535). To find the raw sequencing data, we can navigate through the Gene Expression Omnibus (GEO) using the accession number provided in the publication: GSE106305. The GEO page corresponding to this accession number is https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE106305. On this webpage, there is information about where the sequencing data for each sample in the study is stored.
Our goal is to find differentially expressed genes in response to hypoxia for the LNCaP and PC3 cell lines. Therefore, we will select the control samples for both cell lines (Empty_Vector for LNCaP and siCtrl for PC3) in conditions of either normoxia or hypoxia. The specific samples we need to download are outlined in the table below:
Sample Name | GSM Identifier | SRA Identifier (SRX) | SRA Runs (SRR, download these)
------------------ | ----------------- | -------------------------- | --------------------------------------------------------
LNCaP_RNA-Seq_Empty_Vector_Normoxia_rep1 | GSM3145509 | SRX4096735 | SRR7179504, SRR7179505, SRR7179506, and SRR7179507
LNCaP_RNA-Seq_Empty_Vector_Normoxia_rep2 | GSM3145510 | SRX4096736 | SRR7179508, SRR7179509, SRR7179510, and SRR7179511
LNCaP_RNA-Seq_Empty_Vector_Hypoxia_rep1 | GSM3145513 | SRX4096739 | SRR7179520, SRR7179521, SRR7179522, and SRR7179523
LNCaP_RNA-Seq_Empty_Vector_Hypoxia_rep2 | GSM3145514 | SRX4096740 | SRR7179524, SRR7179525, SRR7179526, and SRR7179527
PC3_RNA-Seq_siCtrl_Normoxia_rep1 | GSM3145517 | SRX4096743 | SRR7179536
PC3_RNA-Seq_siCtrl_Normoxia_rep2 | GSM3145518 | SRX4096744 | SRR7179537
PC3_RNA-Seq_siCtrl_Hypoxia_rep1 | GSM3145521 | SRX4096747 | SRR7179540
PC3_RNA-Seq_siCtrl_Hypoxia_rep2 | GSM3145522 | SRX4096748 | SRR7179541
The NCBI stores sequencing data in the form of Sequence Read Archive (SRA) files. Each of the samples is associated with a set of SRA accession numbers, indicated above. First, we need to download the SRA runs for each sample. Then, we will use the SRA files to generate FASTQ files. The FASTQ file is the data format required for bulk RNA-sequencing analysis.
### Download FASTQ files using SRA tools
To download the SRA files onto our machine, we will use the NCBI's SRA toolkit. If you are using linux, you can type: `sudo apt install sra-toolkit` in your command line to install the toolkit. You can read more about SRA toolkit here: https://www.ncbi.nlm.nih.gov/books/NBK242621/.
The toolkit works by first using the `prefetch` command to download the SRA file associated with the specified SRA ID. The SRA file contains a set of "instructions" for downloading the sequencing data associated with the SRA ID from NCBI. For example, to download the first SRA run for LNCaP_RNA-Seq_Empty_Vector_Normoxia_rep1, we would write:
```bash
prefetch SRR7179504
```
```{}
2020-02-14T22:15:36 prefetch.2.8.2: 1) Downloading 'SRR7179504'...
2020-02-14T22:15:36 prefetch.2.8.2: Downloading via https...
2020-02-14T22:18:39 prefetch.2.8.2: 1) 'SRR7179504' was downloaded successfully
2020-02-14T22:18:39 prefetch.2.8.2: 'SRR7179504' has 0 unresolved dependencies
```
This will download the file `SRR7179504.sra` into our home directory at ~/ncbi/public/sra/. We can now use `fastq-dump` to extract the contents of it into a FASTQ file. The Edwards lab at SDSU provides a nice tutorial for how to use fastq-dump here: https://edwards.sdsu.edu/research/fastq-dump/. To create the FASTQ file for SRR7179504, we would write:
```bash
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ~/ncbi/public/sra/SRR7179504.sra
```
```{}
Rejected 13548432 READS because of filtering out non-biological READS
Read 13548432 spots for /home/ericklu/ncbi/public/sra/SRR7179504.sra
Written 13548432 spots for /home/ericklu/ncbi/public/sra/SRR7179504.sra
```
This will create a file called `SRR7179504_pass.fastq.gz` in a subdirectory called `fastq` inside the directory where fastq-dump was called. The ".gz" just means that the file is compressed.
Since there are many SRA runs associated with our samples, it would take a long time to manually run `prefetch` and `fastq-dump` for each run. To automate all the downloads, I wrote a small script in Python to call `fastq-dump` on each of the SRA runs we want. The code is shown below and also provided in this repo as `fastq_download.py`:
```py
import subprocess
sra_numbers = [
"SRR7179504", "SRR7179505", "SRR7179506", "SRR7179507",
"SRR7179508", "SRR7179509", "SRR7179510", "SRR7179511",
"SRR7179520", "SRR7179521", "SRR7179522", "SRR7179523",
"SRR7179524", "SRR7179525", "SRR7179526", "SRR7179527",
"SRR7179536", "SRR7179537", "SRR7179540","SRR7179541"
]
# this will download the .sra files to ~/ncbi/public/sra/ (will create directory if not present)
for sra_id in sra_numbers:
print ("Currently downloading: " + sra_id)
prefetch = "prefetch " + sra_id
print ("The command used was: " + prefetch)
subprocess.call(prefetch, shell=True)
# this will extract the .sra files from above into a folder named 'fastq'
for sra_id in sra_numbers:
print ("Generating fastq for: " + sra_id)
fastq_dump = "fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip ~/ncbi/public/sra/" + sra_id + ".sra"
print ("The command used was: " + fastq_dump)
subprocess.call(fastq_dump, shell=True)
```
We can run the Python script by simply navigating to the folder on your machine where you want to store the FASTQ files (via the command line), then running `python fastq_download.py`. After running the Python script, all the FASTQ files should be sitting in a directory called 'fastq'. If you would like more information about navigating through GEO and downloading SRA Runs, you can check out my other [repo](https://erilu.github.io/python-fastq-downloader/).
### Concatenating FASTQ files
Each of the LNCaP samples was associated with four SRA runs, which means that we obtained four resulting FASTQ files for each sample after running `fastq_download.py`. For each sample, we should concatenate the four files into a single FASTQ file by using the command `cat`. Below, I perform the concatenation for each of the LNCaP samples:
```bash
cat SRR7179504_pass.fastq.gz SRR7179505_pass.fastq.gz SRR7179506_pass.fastq.gz SRR7179507_pass.fastq.gz > LNCAP_Normoxia_S1.fastq.gz
cat SRR7179508_pass.fastq.gz SRR7179509_pass.fastq.gz SRR7179510_pass.fastq.gz SRR7179511_pass.fastq.gz > LNCAP_Normoxia_S2.fastq.gz
cat SRR7179520_pass.fastq.gz SRR7179521_pass.fastq.gz SRR7179522_pass.fastq.gz SRR7179523_pass.fastq.gz > LNCAP_Hypoxia_S1.fastq.gz
cat SRR7179524_pass.fastq.gz SRR7179525_pass.fastq.gz SRR7179526_pass.fastq.gz SRR7179527_pass.fastq.gz > LNCAP_Hypoxia_S2.fastq.gz
```
In contrast, there is only one FASTQ file for each of the PC3 samples. We can just rename them from their SRR identifiers to their real sample names using `mv`:
```bash
mv SRR7179536_pass.fastq.gz PC3_Normoxia_S1.fastq.gz
mv SRR7179537_pass.fastq.gz PC3_Normoxia_S2.fastq.gz
mv SRR7179540_pass.fastq.gz PC3_Hypoxia_S1.fastq.gz
mv SRR7179541_pass.fastq.gz PC3_Hypoxia_S2.fastq.gz
```
We won't need the individual SRA runs anymore, so we can remove them all using the command `rm SRR*`, which removes all the files in the folder that begin with "SRR". Now, the folder should contain a total of 8 FASTQ files: 4 for LNCaP and 4 for PC3. We are ready to begin aligning these FASTQ files to the reference genome!
## Mapping reads using STAR
Each read in your FASTQ file must be "mapped" to a reference genome. "Mapping" can be described as finding the place in the reference genome that best matches the fragment of the mRNA transcript captured in the read. In this section, I will describe the concepts behind mapping reads to a reference genome, and how to do it using the tool STAR (Spliced Transcripts Alignment to a Reference, https://github.com/alexdobin/STAR) and some Python scripting.
### Concepts behind mapping reads
In order to perform bulk RNA-sequencing, a "library" must first be prepared from the RNA of your cells of interest. This library is sequenced to create the FASTQ files. To create a library, mRNA from the cells is first captured and reverse transcribed into cDNA. This is then fragmented into small pieces, and sequencing adaptors are attached to the ends of each fragment to create the library. A sequencer will "read" these fragments and store the sequences in a FASTQ file. We can observe what the sequences look like by printing out one of the "reads", which are stored in 4-line chunks inside the FASTQ file.
```ba
zcat LNCAP_Hypoxia_S1.fastq.gz | head -4
```
```{}
@SRR7179520.1.1 1 length=76
GTGAANATAGGCCTTAGAGCACTTGANGTGNTAGNGCANGTNGNNCCGGAACGNNNNNNNNAGGTNGNNNGNGTTG
+SRR7179520.1.1 1 length=76
AAAAA#EEEEEEEEEEEEEEEEEEEE#EEE#EEE#EEE#EE#E##EEEEEEEE########EEEE#E###E#EAEA
```
Each line of the read contains a different set of information:
* Line 1 contains a sequence identifier with an optional description. This line must start with "@".
* Line 2 contains the actual sequence of the read.
* Line 3 must start with "+", and usually contains the same sequence identifier as line 1. Having the identifier is not required, and sometimes this line consists of just a single "+".
* Line 4 contains quality values for each base designation from ine 2. For example, the first base "G" has a quality value of "A". This is basically an assessment of how confident the machine is that it assigned the right base.
Using the sequence for each fragment (line 3), we can use software to figure out which part of the genome that fragment originally came from, based on sequence similarity. This tells us which gene was being transcribed to create the original mRNA that was captured to make the library. Counting the number of times a read mapped to a specific gene gives us information about how "high" or "low" a gene was being expressed.
### Building the genome index
A great tool for mapping reads is STAR. In order to install STAR, follow the installation instructions provided in its GitHub [repo](https://github.com/alexdobin/STAR). I will be using version 2.7.0e. Once STAR is installed, we need to first build a genome index. Since we are analyzing RNA-seq data from human cells, we will build the human reference genome.
To build the genome index, we have to run STAR in `--genomeGenerate` mode and supply it with the human reference genome sequence (FASTA) and annotation (GTF) files. We can download the FASTA and GTF files from the Ensembl website (https://uswest.ensembl.org/Homo_sapiens/Info/Index) using the command line:
```bash
wget ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-99/gtf/homo_sapiens/Homo_sapiens.GRCh38.99.gtf.gz
gunzip *.gz
```
Once these are downloaded and unzipped, we can run the command below to make the genome index file. For the `--genomeDir` parameter, you should specify the directory where you are keeping the FASTA and GTF files.
```bash
STAR --runThreadN 64 --runMode genomeGenerate --genomeDir /data/genomes/h38/STAR/ --genomeFastaFiles /data/genomes/h38/STAR/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile /data/genomes/h38/STAR/Homo_sapiens.GRCh38.99.gtf
```
The output from successfully running the command looks like:
```{}
Feb 15 20:06:34 ..... started STAR run
Feb 15 20:06:34 ... starting to generate Genome files
Feb 15 20:07:28 ... starting to sort Suffix Array. This may take a long time...
Feb 15 20:07:41 ... sorting Suffix Array chunks and saving them to disk...
Feb 15 20:20:11 ... loading chunks from disk, packing SA...
Feb 15 20:21:15 ... finished generating suffix array
Feb 15 20:21:15 ... generating Suffix Array index
Feb 15 20:25:16 ... completed Suffix Array index
Feb 15 20:25:16 ..... processing annotations GTF
Feb 15 20:25:26 ..... inserting junctions into the genome indices
Feb 15 20:27:58 ... writing Genome to disk ...
Feb 15 20:28:01 ... writing Suffix Array to disk ...
Feb 15 20:28:18 ... writing SAindex to disk
Feb 15 20:28:20 ..... finished successfully
```
After running the command, you should see a bunch of genome index files in the directory specified by `--genomeDir`. We can now start mapping reads from the FASTQ files to this newly-created reference genome, which I will explain in the next section.
### Mapping reads
Here is an example STAR command used to map a single sample, LNCAP_Normoxia_S1_R1_001.fastq.gz, to the reference genome:
```bash
STAR --runThreadN 64 --genomeDir /data/genomes/h38/STAR --outFilterType BySJout --outFilterMismatchNoverLmax 0.04 --outFilterMismatchNmax 999 --alignSJDBoverhangMin 1 --alignSJoverhangMin 8 --outFilterMultimapNmax 20 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesIn /fastq/LNCAP_Normoxia_S1_R1_001.fastq.gz --clip3pAdapterSeq GATCGGAAGAGCACACGTCTGAACTCCAGTCAC --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix LNCAP_Normoxia_S1
```
The parameters that you should pay attention to are:
* `--genomeDir`, which indicates the directory that contains the genome index files
* `--readFilesIn`, which specifies the FASTQ file location for sample LNCAP_Normoxia_S1_R1_001.fastq.gz
* `--outFileNamePrefix`, which specifies the name we want to give the STAR output
The rest of the parameters are default settings recommended by STAR, which are usually sufficient. Since we have 8 samples to run, and each sample will take a long time to run, we can use a Python script to run the command for each FASTQ in the directory:
```py
import subprocess
import os
output_directory = "STAR_output/"
genome_directory = "/data/genomes/h38/STAR/"
fastq_directory = "/data/analysis/hypoxia/fastq/"
os.mkdir(output_directory)
for fastq in os.listdir(fastq_directory):
# only process files that end in fastq.gz
if fastq.endswith('.fastq.gz'):
prefix=fastq.strip(".fastq.gz") + "_output"
# make an output folder for the current fastq file
os.mkdir(output_directory + prefix)
print ("Currently mapping: " + fastq)
# run STAR on the current fastq file
subprocess.call("STAR --runThreadN 64 --genomeDir " + genome_directory + " --readFilesCommand zcat --outFilterType BySJout --outFilterMismatchNoverLmax 0.04 --outFilterMismatchNmax 999 --alignSJDBoverhangMin 1 --alignSJoverhangMin 8 --outFilterMultimapNmax 20 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --readFilesIn "+ fastq_directory + fastq + " --clip3pAdapterSeq GATCGGAAGAGCACACGTCTGAACTCCAGTCAC --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix " + output_directory + prefix + "/", shell=True)
```
We can save the script as `align_STAR.py`, and run the script by simply typing `python align_STAR.py` in the command line. The output of a successful STAR alignment looks like this, for one iteration of the loop:
```{}
Currently mapping: PC3_Normoxia_S1.fastq.gz
Feb 16 23:28:42 ..... started STAR run
Feb 16 23:28:42 ..... loading genome
Feb 16 23:28:59 ..... started mapping
Feb 16 23:31:07 ..... finished mapping
Feb 16 23:31:10 ..... started sorting BAM
Feb 16 23:32:59 ..... finished successfully
```
After the script has finished running, the output folders containing the alignment files for each sample should be found in the directory `STAR_output/`.
### Understanding ReadsPerGene.out.tab
In the `STAR_output/` directory, you will find a separate folder of results for each of the FASTQ files you mapped. Inside each folder, you should see the following files:
```{}
Aligned.sortedByCoord.out.bam
Log.final.out
Log.out
Log.progress.out
ReadsPerGene.out.tab
SJ.out.tab
```
The file that contains the data you need is `ReadsPerGene.out.tab`. This file contains the number of reads that were mapped to each gene in the transcriptome. A quick glance at the file using the command below shows the structure of the file. There are some QC metrics at the beginning of the file (N_unmapped, etc.) followed by each gene in the transcriptome.
```ba
cat ReadsPerGene.out.tab | head -n 10
```
```{}
N_unmapped 2461926 2461926 2461926
N_multimapping 4255262 4255262 4255262
N_noFeature 4048236 41792820 4475290
N_ambiguous 3081395 50727 1691846
ENSG00000223972 0 0 0
ENSG00000227232 3 0 3
ENSG00000278267 13 0 13
ENSG00000243485 0 0 0
ENSG00000284332 0 0 0
ENSG00000237613 0 0 0
```
The first column corresponds to the Ensembl gene ID. Columns 2, 3, and 4 correspond to the number of reads mapped to each gene. You might be wondering why there are 3 columns of counts. Each column corresponds to the "strandedness" of the read mapping. An RNA-seq library is composed of DNA, which has two strands: a "sense" strand and an "anti-sense" strand. RNA-seq libraries can be prepared as either "unstranded" or "stranded". The column you choose for downstream analysis is typically dictated by the strandedness of the library kit that was used to prepare the samples.
* If an "unstranded" library prep kit was used, column 2 should be selected. For unstranded libraries, we don't know which strand the original mRNA was produced from, and the reads are mapped to both the sense and anti-sense strand.
* If a "stranded" library prep kit was used, we must choose from column 3 or 4. The information from which strand the mRNA originated from is retained. Column 3 corresponds to the first read strand aligned, and column 4 corresponds to the second read strand aligned.
To determine the strandedness for our samples, the we need to find out the exact library preparation kit that was used. The sample information at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3145522 states that the TruSeq® Stranded mRNA Library Prep Kit (RS-122-2101, Illumina) was used to prepare these samples. Since this is a stranded library prep kit, we must select from column 3 or 4. Looking at the output above, we observe that column 4 contains the reads whereas column 3 does not. We will proceed with using column 4 from each of the samples for downstream analysis.
### Selecting read counts by sample
For each of our samples, we need to select column 4 from `ReadsPerGene.out.tab` and place them into one data file for downstream analysis. Rather than doing this manually, we can write a Python script to loop through each folder within `STAR_output/`, obtain column 4 of the `ReadsPerGene.out.tab`, and write the data to a single output file. The script `combine_STAR_output.py` does the job by storing the the counts per gene for each sample in a dictionary of dictionaries, and writes the aggregated data to a file called `raw_counts.csv`.
```py
import os
import csv
# modify desired_column based on strandedness of library kit (index 3 is column 4)
desired_column = 3
output_directory = "STAR_output/"
sample_dict = {}
sample_names = []
# loop through each folder and store data from ReadsPerGene.out.tab in sample_dict
for folder in os.listdir(output_directory):
if folder.endswith("_output"):
file_name=folder.strip('_output')
sample_names.append(file_name)
gene_dict = {}
with open(output_directory + folder + "/ReadsPerGene.out.tab") as tabfile:
print("Column " + str(desired_column) + " of " + file_name + " stored")
reader = csv.reader(tabfile, delimiter = "\t")
for row in reader:
gene_dict[row[0]]=row[desired_column]
# store gene_dict for the current file in sample_dict
sample_dict[file_name] = gene_dict
# write sample_dict to output files
# the qc metrics are stored in qc.csv, and the gene counts are stored in raw_counts.csv
with open("raw_counts.csv", "wt") as counts_file, open("qc.csv", "wt") as qc_file:
counts_writer = csv.writer(counts_file)
qc_writer = csv.writer(qc_file)
counts_writer.writerow( ['ensembl_id']+ sample_names )
qc_writer.writerow( ['qc_metric']+ sample_names )
sorted_genes = sorted(list(sample_dict[sample_names[0]].keys()))
for gene in sorted_genes:
output=[gene]
for sample in sample_names:
output.append(sample_dict[sample][gene])
if gene.startswith("N_"):
qc_writer.writerow(output)
else:
counts_writer.writerow(output)
```
The first column of the output file `raw_counts.csv` is the Ensembl ID of each of the genes in the transcriptome. The rest of the columns correspond to the counts per gene for the 8 samples that we mapped. Another file named `qc.csv` is created which contains the qc metrics that were at the beginning of the `ReadsPerGene.out.tab` file for each sample.
## Differential expression analysis
In order to perform differential gene expression analysis, we will be using the R package `DESeq2`. This package provides a set of data normalization and processing tools designed specifically for bulk RNA-seq data and differential gene expression analysis. A great tutorial that goes into the nitty-gritty details of DESeq2 can be found here: https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html. In the sections below, I will use DESeq2 to determine the genes upregulated by hypoxia in both LNCaP and PC3 cells. I also provide custom functions and visualizations to streamline the analysis process and gain further insight into the data.
We can install `DESeq2` from Bioconductor using the following commands:
```{r eval=F}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
```
Once installed, we will load the packages that are required for our analysis. I will be using version 1.22.2 of `DESeq2`. The `tidyverse` will be used to wrangle gene expression tables and create visualizations.
```{r message=F}
library("DESeq2")
library("tidyverse")
```
We can now read in `raw_counts.csv` into R, and take a look at the structure of the data:
```{r}
data <- read.csv("raw_counts.csv", header = T, row.names = "ensembl_id")
data <- data[,sort(colnames(data))]
head(data)
```
We see that there are 60676 rows and 8 columns to the data, in which each row is a gene and each column corresponds to the reads per gene for each sample. We can quantify the total reads per sample by taking the column sums:
```{r}
colSums(data)
```
We observe that each sample has been sequenced at a different "depth", in which there are varying numbers of total reads for each sample. For example, PC3_Hypoxia_S2 has only 12 million reads, which is less than 1/3rd of the reads for PC3_Hypoxia_S1. This means that we cannot directly compare the count values between samples without normalizing each column based on its total read count. After the data have been processed using `DESeq2`, we will have normalized data that we can use to make comparisons.
### Creating the DESeq2 object
To create the DESeq2 object, we can use the function `DESeqDataSetFromMatrix()`, which requires three items:
1. `countData` - The data in the form of an integer matrix, in which the rownames correspond to the gene IDs. This is the `data` object that we created from `raw_counts.csv`.
2. `colData` - A table that provides the sample groups to be compared. It should also assign each sample to a group (i.e. biological replicates).
3. `design` - A formula that dictates how the counts for each gene depend on the variables in `colData`. We will use the groups defined in colData.
To construct the `colData`, we first define the groups and how many samples each group has:
```{r}
# identify biological replicates
condition <- c(rep("LNCAP_Hypoxia", 2), rep("LNCAP_Normoxia", 2), rep("PC3_Hypoxia", 2), rep("PC3_Normoxia", 2))
```
Then, we assign individual sample names (corresponding to the column names of `raw_counts.csv`) to the grouping categories by binding them within a data frame:
```{r}
# assign replicates to each sample name to construct colData
my_colData <- as.data.frame(condition)
rownames(my_colData) <- colnames(data)
my_colData
```
The vector above named `condition` can be used for the `design` argument. Now that we have all the required inputs, we will create a DESeq2 object (usually named `dds`) using `DESeqDataSetFromMatrix()`:
```{r, message=F, warning=F}
dds <- DESeqDataSetFromMatrix(countData = data,
colData = my_colData,
design = ~condition)
```
This can then be processed using the function `DESeq()`, which will perform differential expression analysis based on a negative binomial distribution. From the [documentation](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html), this function will perform:
1. estimation of size factors: estimateSizeFactors
2. estimation of dispersion: estimateDispersions
3. Negative Binomial GLM fitting and Wald statistics: nbinomWaldTest
```{r message=F, warning=F}
dds <- DESeq(dds)
```
```{}
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
```
We can observe some basic information about the `dds` object just by calling its name:
```{r}
dds
```
We can also sift through the `dds` object using `@` and `$`. For example, we can find the raw counts data this way:
```{r eval=F}
head(dds@assays@data$counts)
```
```{r echo=F}
head(as.data.frame(dds@assays@data$counts))
```
Another way to get the raw counts from the `dds` object is by using the function `counts(dds, normalized = F)`. If you want the normalized counts instead, we set `normalized = T`. For example:
```{r}
normalized_counts <- counts(dds, normalized = T)
```
```{r eval=F}
head(normalized_counts)
```
```{r echo=F}
head(as.data.frame(normalized_counts))
```
As you can see, the normalized counts are different in value than the raw counts, as expected. These normalized counts can be used to compare gene expression values between samples.
### Making an annotation file from BioMart
Right now, the genes are labeled using their Ensembl ID (e.g., ENSG00000111640). It would be nice to have each Ensembl ID translated into its gene name, so that we can search the data for our gene of interest without having to look up the Ensembl ID each time. We will also need to know the gene names for each Ensembl ID when generating visualizations and differential gene lists.
In order to map Ensembl IDs to their gene names, we have to build a genome annotation file using the BioMart at ensembl.org. Here's how to do so:
1. Go to the Ensembl BioMart at http://uswest.ensembl.org/biomart/martview/
2. Choose the database to use, in this case "Ensembl Genes 99".
3. Choose an annotation to use, in this case "Human Genes (GRCh38.p13)".
4. On the lefthand sidebar, click "Attributes", which will allow you to select the items you want to map. For this analysis, you want:
* "Gene stable ID", which is the Ensembl Id in our raw_counts.csv file (e.g. ENSG00000111640)
* "Gene name", which is the most commonly used gene name (e.g. GAPDH)
* "Gene type", which specifies which genes are protein-coding. If we want to perform Gene Set Enrichment Analysis (GSEA), we will need this category.
* You can un-check "Gene stable ID version", "Transcript stable ID", and "Transcript stable ID version".
5. Click on the results button, check "unique results only", select "export as csv", and hit the "Go" button!
The annotation should download to your computer as "mart_export.txt". We can rename it to "GRCh38.p13_annotation.csv", and read it into R:
```{r}
annotation <- read.csv("GRCh38.p13_annotation.csv", header = T, stringsAsFactors = F)
head(annotation)
```
We observe that there are three columns corresponding to the annotations that we selected. To map the gene names to another dataset, we can join two tables based on the Ensembl IDs using a join function from the tidyverse. For example, I can map the annotation to the normalized counts table using `right_join()`:
```{r}
normalized_counts <- rownames_to_column(as.data.frame(normalized_counts), var = "ensembl_id")
annotated_data <- right_join(annotation, normalized_counts, by = c("Gene.stable.ID" = "ensembl_id"))
head(annotated_data)
write.csv(annotated_data, file = "gene_annotated_normalized_counts.csv")
```
Now, the gene names are now placed directly next to the Ensembl IDs. We can export this normalized counts file and provide it to collaborating scientists, who will be able to quickly search for their genes of interest! For examples of how to plot the normalized counts for a given gene across samples, see the section: [Visualizations for DE results].
## Visualizations for sample variability
At this point, we can start to assess sample-to-sample variability. This is important for determining whether you have outliers in your dataset, as well as whether or not your data makes sense. For example, the gene expression patterns of biological/technical replicates should look fairly similar to one another. In this section, we will cover some visualizations that can be used for this purpose. In order to use these functions, we first perform a variance stabilizing transformation on the data using `vst()`:
```{r}
vsd <- vst(dds, blind = TRUE)
```
### 1. Distance Plot
The distance plot calculates the Euclidean distance between the expression values for each individual sample. This is a way to identify which samples are most closely related in terms of their gene expression patterns.
```{r distance-plot}
plotDists = function (vsd.obj) {
sampleDists <- dist(t(assay(vsd.obj)))
sampleDistMatrix <- as.matrix( sampleDists )
rownames(sampleDistMatrix) <- paste( vsd.obj$condition )
colors <- colorRampPalette( rev(RColorBrewer::brewer.pal(9, "Blues")) )(255)
pheatmap::pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors)
}
plotDists(vsd)
```
We observe that all the LNCaP samples cluster together and all the PC3 samples cluster together. Within each cell type, the hypoxia samples cluster together and the normoxia samples cluster together. These clustering patterns make sense, suggesting that the experiment was successful.
### 2. Variable Genes Heatmap
We can identify the genes that are driving the clustering between samples and display them in a heatmap. To do so, we first obtain the top most variable genes across samples by selecting the genes with the highest `rowVars()` value. Then, we plug the normalized counts for these genes into a heatmap using `pheatmap::pheatmap()`. The heatmap will cluster the samples based on expression similarity as well as display the genes that are associated with each cluster on the right-hand side. Note that the genes are simply chosen based on variability across samples. No statistical tests are performed to determine whether they are actually significantly different between groups.
```{r variable-gene-heatmap-all-samples}
variable_gene_heatmap <- function (vsd.obj, num_genes = 500, annotation, title = "") {
brewer_palette <- "RdBu"
# Ramp the color in order to get the scale.
ramp <- colorRampPalette( RColorBrewer::brewer.pal(11, brewer_palette))
mr <- ramp(256)[256:1]
# get the stabilized counts from the vsd object
stabilized_counts <- assay(vsd.obj)
# calculate the variances by row(gene) to find out which genes are the most variable across the samples.
row_variances <- rowVars(stabilized_counts)
# get the top most variable genes
top_variable_genes <- stabilized_counts[order(row_variances, decreasing=T)[1:num_genes],]
# subtract out the means from each row, leaving the variances for each gene
top_variable_genes <- top_variable_genes - rowMeans(top_variable_genes, na.rm=T)
# replace the ensembl ids with the gene names
gene_names <- annotation$Gene.name[match(rownames(top_variable_genes), annotation$Gene.stable.ID)]
rownames(top_variable_genes) <- gene_names
# reconstruct colData without sizeFactors for heatmap labeling
coldata <- as.data.frame(vsd.obj@colData)
coldata$sizeFactor <- NULL
# draw heatmap using pheatmap
pheatmap::pheatmap(top_variable_genes, color = mr, annotation_col = coldata, fontsize_col = 8, fontsize_row = 250/num_genes, border_color = NA, main = title)
}
variable_gene_heatmap(vsd, num_genes = 40, annotation = annotation)
```
We see that the clustering is very similar to what we observed in the distance plot. There are a distinct set of genes that are specific to either LNCaP or PC3 cells. Given that we are analyzing data from cell lines, it is not surprising that the clustering is so clearly defined. If you are working with highly heterogenous samples (e.g. tissues), there may be significant variability across samples. Knowing the genes that are defining the differences between samples is useful for assessing whether there were sample preparation issues such as poor cell purity or tissue contamination.
### 3. PCA Plot
A principal components plot is another way to observe how diverse the samples are. As always, the samples which are most similar with each other in terms of their gene expression values will be closer to each other on the plot. This plot uses a Cartesian coordinate system, and the axes that are displayed correspond to the top two principal components that explain the majority of the variability in the data.
```{r pca-plot, fig.width=8, fig.height=4}
plot_PCA = function (vsd.obj) {
pcaData <- plotPCA(vsd.obj, intgroup = c("condition"), returnData = T)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2, color=condition)) +
geom_point(size=3) +
labs(x = paste0("PC1: ",percentVar[1],"% variance"),
y = paste0("PC2: ",percentVar[2],"% variance"),
title = "PCA Plot colored by condition") +
ggrepel::geom_text_repel(aes(label = name), color = "black")
}
plot_PCA(vsd)
```
We observe that The LNCaP samples are furthest separated from the PC3 samples on the x-axis (PC1), which explains a whopping 99% of the variance in the dataset. Within each cell type, the normoxia samples are separated from the hypoxia samples. The amount of variability introduced by hypoxia is small compared to the inherent difference between the LNCaP and PC3 cell lines. If we are interested in the effects of hypoxia on gene expression, we should separate the LNCaP samples from the PC3 samples first, create separate DESeq2 objects, and then determine the effects of hypoxia on each cell line individually. I will cover how to do this in the next section.
## Making subsetted DESeq2 objects
Because a lot of the variability in the data is due to the difference in the background of the cell lines, we must create separate DESeq2 objects for the LNcaP and PC3 samples. The function `generate_DESeq_object()` below will make a DESeq2 object using your raw counts data and the two groups that you specify to compare. It will select the columns out from the raw data that match the group designations using `grep()`, build the `colData` matrix based on your group designations, and generate the DESeq2 object using `DESeqDataSetFromMatrix()` and `DESeq()`.
```{r}
generate_DESeq_object <- function (my_data, groups) {
data_subset1 <- my_data[,grep(str_c("^", groups[1]), colnames(my_data))]
data_subset2 <- my_data[,grep(str_c("^", groups[2]), colnames(my_data))]
my_countData <- cbind(data_subset1, data_subset2)
condition <- c(rep(groups[1],ncol(data_subset1)), rep(groups[2],ncol(data_subset2)))
my_colData <- as.data.frame(condition)
rownames(my_colData) <- colnames(my_countData)
print(my_colData)
dds <- DESeqDataSetFromMatrix(countData = my_countData,
colData = my_colData,
design = ~ condition)
dds <- DESeq(dds, quiet = T)
return(dds)
}
```
The function should print the `colData` that was used to generate each DESeq2 object, so we can double check to make sure that the appropriate sample columns were selected and that the grouping was correct.
```{r, warning=F}
lncap <- generate_DESeq_object(data, c("LNCAP_Hypoxia", "LNCAP_Normoxia"))
pc3 <- generate_DESeq_object(data, c("PC3_Hypoxia", "PC3_Normoxia"))
```
Now that we have separated the LNCaP and PC3 samples, the `variable_gene_heatmap()` should display genes that are enriched in either the hypoxia vs normoxia conditions.
```{r fig.show="hide"}
lncap_vsd <- vst(lncap, blind = T)
pc3_vsd <- vst(pc3, blind = T)
a <- variable_gene_heatmap(lncap_vsd, 30, annotation = annotation, title = "LNCaP variable genes")
b <- variable_gene_heatmap(pc3_vsd, 30, annotation = annotation, title = "PC3 variable genes")
```
```{r variable-gene-heatmap-separate-samples, fig.width=12, fig.height=5}
gridExtra::grid.arrange(a[[4]],b[[4]], nrow = 1)
```
This is a vastly different set of genes than what was observed when we had samples from both cell lines combined. Our differential gene expression analysis should now yield a greater depth of genes for the comparison we care about: hypoxia vs normoxia.
## Extracting DE results
Now that we have assessed sample-to-sample variability and gained confidence that our experiment was performed correctly, we are ready to start answering the most pressing question: *what genes are differentially upregulated or downregulated by hypoxia in each cell line?* In this section, I will explain how to generate differential expression gene lists from DESeq2 objects.
To extract the differentially expressed genes from the DESeq2 object, we will use the `results()` function:
```{r eval=F}
results(lncap, contrast = c("condition", "LNCAP_Hypoxia", "LNCAP_Normoxia"))
```
```{}
## log2 fold change (MLE): condition LNCAP_Hypoxia vs LNCAP_Normoxia
## Wald test p-value: condition LNCAP_Hypoxia vs LNCAP_Normoxia
## DataFrame with 60676 rows and 6 columns
```
```{r echo=F}
head(as.data.frame(results(lncap, contrast = c("condition", "LNCAP_Hypoxia", "LNCAP_Normoxia"))))
```
We observe that the hypoxia condition was compared to the normoxia condition, and that there are columns corresponding to statistics calculated for each gene. The columns that you should pay the most attention to are `baseMean`, `log2FoldChange`, and `padj`.
* The `baseMean` is a measure of the overall expression of the gene. We can assess whether is it a highly-expressed gene or if it is only barely above baseline.
* The `log2FoldChange` measures how much higher the gene is in the hypoxia condition compared to the normoxia condition. Negative values mean the expression is higher in the normoxia group.
* The `padj` column is a measure of how likely the gene is to be significantly different between the two conditions being compared.
We will use these metrics to filter, organize, and export lists of differentially expressed genes. I chose to use the following criteria:
1. Select genes that are signifcantly different based on `padj`, in which the `padj` is lower than a specified cutoff value.
2. Sort by `log2FoldChange` to identify the genes that are most highly up or down.
3. (optional) Filter out genes that are very lowly-expressed (low `baseMean` values) which are not as likely to be biologically relevant. To do so, we can either use `baseMean` or we can calculate another metric called "counts per million" (cpm). The counts per million value is more intuitive because we can use a similar cutoff value across different sequencing experiments.
4. Generate a ranked gene list file (`.rnk`), which is a list of all the genes in the dataset ordered by `log2FoldChange`. This file is required for Gene Set Enrichment Analysis, which I will explain later in the [GSEA section](#GSEA).
Since there are many variations of sorting / filtering the data, I wrote the function `generate_DE_results()` to perform multiple filtering steps and provide output in the form of csv files. I export these files so they can be easily shared and interpreted by other bench scientists without having to read the data into R.
```{r}
generate_DE_results <- function (dds, comparisons, padjcutoff = 0.001, log2cutoff = 0.5, cpmcutoff = 2) {
# generate average counts per million metric from raw count data
raw_counts <- counts(dds, normalized = F)
cpms <- enframe(rowMeans(edgeR::cpm(raw_counts)))
colnames(cpms) <- c("ensembl_id", "avg_cpm")
# extract DESeq results between the comparisons indicated
res <- results(dds, contrast = c("condition", comparisons[1], comparisons[2]))[,-c(3,4)]
# annotate the data with gene name and average counts per million value
res <- as_tibble(res, rownames = "ensembl_id")
# read in the annotation and append it to the data
my_annotation <- read.csv("GRCh38.p13_annotation.csv", header = T, stringsAsFactors = F)
res <- left_join(res, my_annotation, by = c("ensembl_id" = "Gene.stable.ID"))
# append the average cpm value to the results data
res <- left_join(res, cpms, by = c("ensembl_id" = "ensembl_id"))
# combine normalized counts with entire DE list
normalized_counts <- round(counts(dds, normalized = TRUE),3)
pattern <- str_c(comparisons[1], "|", comparisons[2])
combined_data <- as_tibble(cbind(res, normalized_counts[,grep(pattern, colnames(normalized_counts))] ))
combined_data <- combined_data[order(combined_data$log2FoldChange, decreasing = T),]
# make ordered rank file for GSEA, selecting only protein coding genes
res_prot <- res[which(res$Gene.type == "protein_coding"),]
res_prot_ranked <- res_prot[order(res_prot$log2FoldChange, decreasing = T),c("Gene.name", "log2FoldChange")]
res_prot_ranked <- na.omit(res_prot_ranked)
res_prot_ranked$Gene.name <- str_to_upper(res_prot_ranked$Gene.name)
# generate sorted lists with the indicated cutoff values
res <- res[order(res$log2FoldChange, decreasing=TRUE ),]
de_genes_padj <- res[which(res$padj < padjcutoff),]
de_genes_log2f <- res[which(abs(res$log2FoldChange) > log2cutoff & res$padj < padjcutoff),]
de_genes_cpm <- res[which(res$avg_cpm > cpmcutoff & res$padj < padjcutoff),]
# write output to files
write.csv (de_genes_padj, file = paste0(comparisons[1], "_vs_", comparisons[2], "_padj_cutoff.csv"), row.names =F)
write.csv (de_genes_log2f, file = paste0(comparisons[1], "_vs_", comparisons[2], "_log2f_cutoff.csv"), row.names =F)
write.csv (de_genes_cpm, file = paste0(comparisons[1], "_vs_", comparisons[2], "_cpm_cutoff.csv"), row.names =F)
write.csv (combined_data, file = paste0(comparisons[1], "_vs_", comparisons[2], "_allgenes.csv"), row.names =F)
write.table (res_prot_ranked, file = paste0(comparisons[1], "_vs_", comparisons[2], "_rank.rnk"), sep = "\t", row.names = F, quote = F)
writeLines( paste0("For the comparison: ", comparisons[1], "_vs_", comparisons[2], ", out of ", nrow(combined_data), " genes, there were: \n",
nrow(de_genes_padj), " genes below padj ", padjcutoff, "\n",
nrow(de_genes_log2f), " genes below padj ", padjcutoff, " and above a log2FoldChange of ", log2cutoff, "\n",
nrow(de_genes_cpm), " genes below padj ", padjcutoff, " and above an avg cpm of ", cpmcutoff, "\n",
"Gene lists ordered by log2fchange with the cutoffs above have been generated.") )
gene_count <- tibble (cutoff_parameter = c("padj", "log2fc", "avg_cpm" ),
cutoff_value = c(padjcutoff, log2cutoff, cpmcutoff),
signif_genes = c(nrow(de_genes_padj), nrow(de_genes_log2f), nrow(de_genes_cpm)))
invisible(gene_count)
}
```
After running the function, the console should display the number of genes that passed each filtering critera:
```{r}
lncap_output <- generate_DE_results (lncap, c("LNCAP_Hypoxia", "LNCAP_Normoxia"))
pc3_output <- generate_DE_results(pc3, c("PC3_Hypoxia", "PC3_Normoxia"))
```
We should also see a set of csv files show up in the same folder as the R script. We can read in the output files with `read.csv()` and observe what the file structure is like:
```{r}
res <- read.csv("LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = T)
head(res)
```
The `_allgenes.csv` file will be used in the next section for visualizations. It is an aggregate of all the data we have curated so far: the `results()` output, the gene annotations, and the normalized counts for every gene. We can further organize/filter the data for the visualizations based on user preferences. The other csv files, such as `_padj_cutoff.csv`, can be distributed to researchers who are only interested in the gene lists.
## Visualizations for DE results
Now that we have extracted the results from our DESeq2 object, we can start visualizing the differentially expressed genes. In this section, I will showcase some plotting functions that I wrote to answer questions such as:
1. How is a given gene of interest expressed across groups?
2. What is the sample-to-sample variability for the top differentially expressed genes?
3. Are there more upregulated or downregulated genes?
4. If we are performing multiple comparisons, which differentially expressed genes are shared?
5. What gene pathways are enriched in one group vs another?
### 1. PlotCounts (upgraded)
We are often interested in plotting the normalized counts for certain genes of interest, in order to get a sense of the spread of the data and how genes are expressed across each of the groups. This is especially of interest for genes that show up as differentially expressed in our DESeq2 results. A way to plot the normalized counts for each sample is by using the built-in `DESeq2::plotCounts()` function. However, this function leaves *a lot* to be desired. As an example, I plot the gene IGFBP1 (a hypoxia-inducible gene) using its Ensembl ID below.
```{r default-plot-counts-baseR, fig.width=6, fig.height=4}
plotCounts(dds, gene="ENSG00000146678", intgroup="condition")
```
As you can see, the sample names aren't even all displayed if their names are too long. We can instead use ggplot2 to graph the data, by extracting the normalized counts using `returnData = TRUE`:
```{r default-plot-counts-ggplot2, fig.width=6, fig.height=4}
d <- plotCounts(dds, gene="ENSG00000146678", intgroup="condition", returnData=TRUE)
d
ggplot(d, aes(x=condition, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0), aes (color = condition)) +
scale_y_log10(breaks=c(25,100,400))
```
However, this is still very frustrating to work with, since we have to input the Ensembl ID instead of a gene name. Researchers typically want to search by gene name, and this function forces us to look up the Ensembl ID each time. To solve this problem, I wrote my own function `plot_counts()` to accept either the gene name, Ensembl ID, or an index such as `which.min(res$padj)`. I also provide two ways to normalize the data: by counts per million (cpm) or using the built-in DESeq2 normalized counts.
```{r upgraded-plot-counts, fig.width=6, fig.height=3}
plot_counts <- function (dds, gene, normalization = "DESeq2"){
# read in the annotation file
annotation <- read.csv("GRCh38.p13_annotation.csv", header = T, stringsAsFactors = F)
# obtain normalized data
if (normalization == "cpm") {
normalized_data <- cpm(counts(dds, normalized = F)) # normalize the raw data by counts per million
} else if (normalization == "DESeq2")
normalized_data <- counts(dds, normalized = T) # use DESeq2 normalized counts
# get sample groups from colData
condition <- dds@colData$condition
# get the gene name from the ensembl id
if (is.numeric(gene)) { # check if an index is supplied or if ensembl_id is supplied
if (gene%%1==0 )
ensembl_id <- rownames(normalized_data)[gene]
else
stop("Invalid index supplied.")
} else if (gene %in% annotation$Gene.name){ # check if a gene name is supplied
ensembl_id <- annotation$Gene.stable.ID[which(annotation$Gene.name == gene)]
} else if (gene %in% annotation$Gene.stable.ID){
ensembl_id <- gene
} else {
stop("Gene not found. Check spelling.")
}
expression <- normalized_data[ensembl_id,]
gene_name <- annotation$Gene.name[which(annotation$Gene.stable.ID == ensembl_id)]
# construct a tibble with the grouping and expression
gene_tib <- tibble(condition = condition, expression = expression)
ggplot(gene_tib, aes(x = condition, y = expression))+
geom_boxplot(outlier.size = NULL)+
geom_point()+
labs (title = paste0("Expression of ", gene_name, " - ", ensembl_id), x = "group", y = paste0("Normalized expression (", normalization , ")"))+
theme(axis.text.x = element_text(size = 11), axis.text.y = element_text(size = 11))
}
plot_counts(dds, "IGFBP1")
```
We observe that IGFBP1 is elevated in the the hypoxia conditions relative to the respective normoxia controls. Interestingly, the upregulation of IGFBP1 is much higher in PC3 cells than in LNCaP cells. This is a nice example of how this plot is useful to determine relative expression levels between samples.
### 2. Differential gene heatmap
The most common heatmap used to display RNA-seq results is the differential gene heatmap. This is a heatmap where the normalized values for each sample are plotted for the top most upregulated genes which are significantly different between the comparison groups. Starting from the results object, we first filter the results by `padj` and then take the genes which have the greatest `log2FoldChange` value. The normalized counts for each gene is scaled by row (across samples) in order to emphasize the difference between the comparison groups (hypoxia vs normoxia). The purpose of this visualization is to give a sense of how variable the data are from sample to sample, as well as to quickly examine the top differential genes. I wrote the function `DE_gene_heatmap()` to generate this plot, using `pheatmap::pheatmap()` with the `scale = "row"` argument.
```{r differential-gene-heatmap}
res <- read.csv ("LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = T)
DE_gene_heatmap <- function(res, padj_cutoff = 0.0001, ngenes = 20) {
# generate the color palette
brewer_palette <- "RdBu"
ramp <- colorRampPalette(RColorBrewer::brewer.pal(11, brewer_palette))
mr <- ramp(256)[256:1]
# obtain the significant genes and order by log2FoldChange
significant_genes <- res %>% filter(padj < padj_cutoff) %>% arrange (desc(log2FoldChange)) %>% head (ngenes)
heatmap_values <- as.matrix(significant_genes[,-c(1:8)])
rownames(heatmap_values) <- significant_genes$Gene.name
# plot the heatmap using pheatmap
pheatmap::pheatmap(heatmap_values, color = mr, scale = "row", fontsize_col = 10, fontsize_row = 200/ngenes, fontsize = 5, border_color = NA)
}
DE_gene_heatmap(res, 0.001, 30)
```
We observe that expression of the differential genes is very consistent within groups. We are also able to identify that LNCAP_Hypoxia_S1 has higher expression of the top ~10 genes than LNCAP_Hypoxia_S2.
### 3. Volcano Plot
Another common visualization for bulk RNA-seq is the volcano plot, which is a scatterplot that displays the `log2FoldChange` value vs the `-log(padj)` value for each gene in the comparison. We can use this plot to quickly identify genes that are highly upregulated or downregulated, or identify genes that have highly significant p-values. Below is an example volcano plot showing the most significant genes in the hypoxia vs normoxia comparison for LNCaP cells.
```{r volcano-plot}
res <- read.csv ("LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = T)
plot_volcano <- function (res, padj_cutoff, nlabel = 10, label.by = "padj"){
# assign significance to results based on padj
res <- mutate(res, significance=ifelse(res$padj<padj_cutoff, paste0("padj < ", padj_cutoff), paste0("padj > ", padj_cutoff)))
res = res[!is.na(res$significance),]
significant_genes <- res %>% filter(significance == paste0("padj < ", padj_cutoff))
# get labels for the highest or lowest genes according to either padj or log2FoldChange
if (label.by == "padj") {
top_genes <- significant_genes %>% arrange(padj) %>% head(nlabel)
bottom_genes <- significant_genes %>% filter (log2FoldChange < 0) %>% arrange(padj) %>% head (nlabel)
} else if (label.by == "log2FoldChange") {
top_genes <- head(arrange(significant_genes, desc(log2FoldChange)),nlabel)
bottom_genes <- head(arrange(significant_genes, log2FoldChange),nlabel)
} else
stop ("Invalid label.by argument. Choose either padj or log2FoldChange.")
ggplot(res, aes(log2FoldChange, -log(padj))) +
geom_point(aes(col=significance)) +
scale_color_manual(values=c("red", "black")) +
ggrepel::geom_text_repel(data=top_genes, aes(label=head(Gene.name,nlabel)), size = 3)+
ggrepel::geom_text_repel(data=bottom_genes, aes(label=head(Gene.name,nlabel)), color = "#619CFF", size = 3)+
labs ( x = "Log2FoldChange", y = "-(Log normalized p-value)")+
geom_vline(xintercept = 0, linetype = "dotted")+
theme_minimal()
}
plot_volcano(res, 0.0005, nlabel = 15, label.by = "padj")
```
The downregulated genes are labeled in a different color compared to the upregulated genes. We observe that there are a lot more significantly upregulated genes than there are downregulated genes. Given that we are looking for gene expression changes in cells grown in hypoxic conditions compared to normoxic conditions, this suggests that cells may be ramping up transcription of response pathways.
### 4. LogFoldChange comparison plot
Given that we have performed differential gene expression analysis on two different cell lines, we would like to compare the two lines in terms of how their gene expression responds to hypoxia. We want to answer questions such as:
1. Are the genes that are induced by hypoxia similar in both cell lines?
2. Which significant genes are shared by both cell lines and which genes are unique to each cell line?
To do so, we can take the significant genes from the hypoxia vs normoxia comparison for both lines and generate a scatterplot of their `log2FoldChange` values using the custom function below, `compare_significant_genes()`. This function will display the overlap in the gene lists by color coding significant genes that are shared by both cell lines, as well as those that are unique to each individual cell line.
```{r logfoldchange-comparison-plot}
res1 <- read.csv ("LNCAP_Hypoxia_vs_LNCAP_Normoxia_allgenes.csv", header = T)
res2 <- read.csv ("PC3_Hypoxia_vs_PC3_Normoxia_allgenes.csv", header = T)
compare_significant_genes <- function (res1, res2, padj_cutoff=0.0001, ngenes=250, nlabel=10, samplenames=c("comparison1", "comparison2"), title = "" ) {
# get list of most upregulated or downregulated genes for each results table
genes1 <- rbind(head(res1[which(res1$padj < padj_cutoff),], ngenes), tail(res1[which(res1$padj < padj_cutoff),], ngenes))
genes2 <- rbind(head(res2[which(res2$padj < padj_cutoff),], ngenes), tail(res2[which(res2$padj < padj_cutoff),], ngenes))
# combine the data from both tables
de_union <- union(genes1$ensembl_id,genes2$ensembl_id)
res1_union <- res1[match(de_union, res1$ensembl_id),][c("ensembl_id", "log2FoldChange", "Gene.name")]
res2_union <- res2[match(de_union, res2$ensembl_id),][c("ensembl_id", "log2FoldChange", "Gene.name")]
combined <- left_join(res1_union, res2_union, by = "ensembl_id", suffix = samplenames )
# identify overlap between genes in both tables
combined$de_condition <- 1 # makes a placeholder column
combined$de_condition[which(combined$ensembl_id %in% intersect(genes1$ensembl_id,genes2$ensembl_id))] <- "Significant in Both"
combined$de_condition[which(combined$ensembl_id %in% setdiff(genes1$ensembl_id,genes2$ensembl_id))] <- paste0("Significant in ", samplenames[1])
combined$de_condition[which(combined$ensembl_id %in% setdiff(genes2$ensembl_id,genes1$ensembl_id))] <- paste0("Significant in ", samplenames[2])
combined[is.na(combined)] <- 0
# find the top most genes within each condition to label on the graph
label1 <- rbind(head(combined[which(combined$de_condition==paste0("Significant in ", samplenames[1])),],nlabel),
tail(combined[which(combined$de_condition==paste0("Significant in ", samplenames[1])),],nlabel))
label2 <- rbind(head(combined[which(combined$de_condition==paste0("Significant in ", samplenames[2])),],nlabel),
tail(combined[which(combined$de_condition==paste0("Significant in ", samplenames[2])),],nlabel))
label3 <- rbind(head(combined[which(combined$de_condition=="Significant in Both"),],nlabel),
tail(combined[which(combined$de_condition=="Significant in Both"),],nlabel))
combined_labels <- rbind(label1,label2,label3)
# plot the genes based on log2FoldChange, color coded by significance
ggplot(combined, aes_string(x = paste0("log2FoldChange", samplenames[1]), y = paste0("log2FoldChange", samplenames[2]) )) +
geom_point(aes(color = de_condition), size = 0.7)+
scale_color_manual(values= c("#00BA38", "#619CFF", "#F8766D"))+
ggrepel::geom_text_repel(data= combined_labels, aes_string(label=paste0("Gene.name", samplenames[1]), color = "de_condition"), show.legend = F, size=3)+
geom_vline(xintercept = c(0,0), size = 0.3, linetype = 2)+
geom_hline(yintercept = c(0,0), size = 0.3, linetype = 2)+
labs(title = title,x = paste0("log2FoldChange in ", samplenames[1]), y = paste0("log2FoldChange in ", samplenames[2]))+
theme_minimal()+
theme(legend.title = element_blank())
}
compare_significant_genes(res1,res2, samplenames = c("LNCaP", "PC3"), title = "Hypoxia-induced gene expression differences in LNCaP vs PC3 cells")
```
We observe that there are indeed a small subset of genes that are differentially expressed in both the LNCaP as well as PC3 comparisons (green labels). It is satisfying to see that the majority of these shared genes fall along the x = y diagonal. For example, genes such as CA9, IGFBP5, STC1, and PPFIA4 are greatly induced by hypoxia in both lines. These genes are more likely to be real indicators of the hypoxia response. In contrast, the genes that are significant in only one cell line mostly lie on the vertical or horizontal axis. These genes are likely associated with responses specific to the background of each cell line, and are unlikely to be universal indicators of hypoxia.
### 5. Gene set enrichment analysis {#GSEA}
Gene Set Enrichment Analysis (GSEA) can be used to test whether pathways consisting of multiple genes are enriched in one group compared to another. For example, there is a "gene set" of 200 genes that composes the hypoxia pathway. We would like to determine whether the genes associated with the hypoxia pathway are collectively upregulated in the hypoxia condition, as a way to verify that the experiment was performed correctly. We also want to determine which other pathways hypoxia is affecting inside the cell.
To perform GSEA, we have to rank every gene available in terms of how much higher or lower it is in group A vs group B. This is called a ranked list, and we have already generated one in the above section using the function `generate_DE_results()`. We can plug this ranked list into either the GSEA application from the Broad Institute (https://www.gsea-msigdb.org/gsea/index.jsp), or into the R package `fgsea` from Bioconductor (https://bioconductor.org/packages/release/bioc/html/fgsea.html). If you choose to use the GSEA application rather than the R package, the ranked lists have to be in a very specific format. The genes must be all upper-case, and file should contain one column for the gene name and another column with the corresponding logfoldchange. Here, we will be using the R package `fgsea` to perform GSEA.
The R package `fgsea` can be installed from Bioconductor using the commands:
```{r eval=F}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("fgsea")
```
We also need to download the pathway files from the GSEA MSigDB webpage at: https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp and load them in using `fgsea::gmtPathways()`. I will be working with the HALLMARK pathway set.
```{r message=F, warning=F}
library(fgsea)
# read in file containing lists of genes for each pathway
hallmark_pathway <- gmtPathways("h.all.v7.0.symbols.gmt.txt")
head(names(hallmark_pathway))
```
The pathway file is loaded as a list of gene sets. We can access the genes for each pathway by selecting the pathway name using `$`:
```{r}
head(hallmark_pathway$HALLMARK_HYPOXIA, 20)
```
After loading the pathways, we have to turn our ranked list into a vector, in which the `log2FoldChange` value is named with the gene name. We also need to get rid of any NA values or duplicate gene entries. The custom function `prepare_ranked_list()` will perform these operations.
```{r}
# load the ranked list
lncap_ranked_list <- read.table("LNCAP_Hypoxia_vs_LNCAP_Normoxia_rank.rnk", header = T, stringsAsFactors = F)
head(lncap_ranked_list)
# formats the ranked list for the fgsea() function
prepare_ranked_list <- function(ranked_list) {
# if duplicate gene names present, average the values
if( sum(duplicated(ranked_list$Gene.name)) > 0) {
ranked_list <- aggregate(.~Gene.name, FUN = mean, data = ranked_list)
ranked_list <- ranked_list[order(ranked_list$log2FoldChange, decreasing = T),]
}
# omit rows with NA values
ranked_list <- na.omit(ranked_list)
# turn the dataframe into a named vector
ranked_list <- tibble::deframe(ranked_list)
ranked_list
}
lncap_ranked_list <- prepare_ranked_list(lncap_ranked_list)
head(lncap_ranked_list)
```
Now that we have the named vector, we can plug it into the `fgsea()` function along with the hallmark pathways object. This will generate a table of results containing the enrichment scores associated with each pathway.
```{r warning=F}
# generate GSEA results table using fgsea() by inputting the pathway list and ranked list
fgsea_results <- fgsea(pathways = hallmark_pathway,
stats = lncap_ranked_list,
minSize = 15,
maxSize = 500,
nperm= 1000)
fgsea_results %>% arrange (desc(NES)) %>% select (pathway, padj, NES) %>% head()
```
The normalized enrichment scores (NES) tell us how much more enriched the pathway is in the hypoxia samples compared to the normoxia samples. As expected, we observe that the most enriched pathway in response to hypoxia is the HALLMARK_HYPOXIA pathway.
We can visualize the statistics for each pathway using a "waterfall" plot, which is a sideways bar plot of normalized enrichment scores for each of the pathways, color-coded by significance. This plot is great for quickly identifying the significantly enriched pathways.
```{r gsea-waterfall-plot}
waterfall_plot <- function (fsgea_results, graph_title) {
fgsea_results %>%
mutate(short_name = str_split_fixed(pathway, "_",2)[,2])%>% # removes 'HALLMARK_' from the pathway title
ggplot( aes(reorder(short_name,NES), NES)) +
geom_bar(stat= "identity", aes(fill = padj<0.05))+
coord_flip()+
labs(x = "Hallmark Pathway", y = "Normalized Enrichment Score", title = graph_title)+
theme(axis.text.y = element_text(size = 7),
plot.title = element_text(hjust = 1))
}
waterfall_plot(fgsea_results, "Hallmark pathways altered by hypoxia in LNCaP cells")
```
In addition to the hypoxia pathway, the androgen response and MTORC1 signaling pathways also appear to be significantly enriched in hypoxic conditions. Some interesting metabolic changes are also occuring. Similar to the Warburg effect, we observe that the glycolysis pathway is greatly enriched in the hypoxia condition, and that oxidative phosphorylation is negatively enriched. Since tumors become progressively hypoxic during cancer progression, this data may suggest that the hypoxic conditions are a driver of the metabolic switch from oxidative phosphorylation to glycolysis as an energy source.
Interestingly, the interferon response pathways are both negatively enriched in hypoxic conditions, suggesting that hypoxic cells may be less able to respond to interferons. Given that the cells were cultured in vitro (an environment where interferons are probably not present), I find this result unexpected.
We can also narrow in on specific pathways of interest by plotting "enrichment curves" using the function `fgsea::plotEnrichment()`. In these plots, the black ticks on the x-axis indicate a gene in the pathway, and the green curve is a measure of how enriched the genes are for either the hypoxia group (left side) or the normoxia group (right side). Example curves for pathways highly enriched in either hypoxia or normoxia are shown below.
```{r gsea-enrichment-plot, fig.width=5, fig.height=3}
# wrapper for fgsea::plotEnrichment()
plot_enrichment <- function (geneset, pathway, ranked_list) {
plotEnrichment(geneset[[pathway]], ranked_list)+labs (title = pathway)
}
# example of positively enriched pathway (up in Hypoxia)
plot_enrichment(hallmark_pathway, "HALLMARK_GLYCOLYSIS" , lncap_ranked_list)
# example of negatively enriched pathway (down in Hypoxia)
plot_enrichment(hallmark_pathway, "HALLMARK_OXIDATIVE_PHOSPHORYLATION" , lncap_ranked_list)
```