-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMakeSummaries.Rmd
268 lines (220 loc) · 7.93 KB
/
MakeSummaries.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
---
title: "Make GO Summaries"
author: "Steve Pederson"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE, message = FALSE)
```
## Introduction
```{r loadPackages}
library(GO.db)
library(graph)
library(dnet)
library(magrittr)
library(tidyverse)
library(pander)
library(scales)
library(plotly)
library(limma)
```
The aim of this page is to keep an up-to-date summary of the Gene Ontology database, as represented in the R package `GO.db`.
The information contained for each GO term is:
- Which Ontology the GO term belongs to
- The *shortest path* back to the root node of the relevant ontology
- The *longest path* back to the root node of the relevant ontology
- Whether the GO term is a *terminal node* (i.e. it has no children)
The main object created is available for download [here](https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS) and this will load as a `tibble` after using the function `read_rds()` (from the package `readr`).
Alternatively, you can load this directly into your workspace using:
```{r, eval=FALSE}
goSummaries <- url("https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS") %>%
readRDS()
```
## Create the Basic Graphs
The package `AnnotationDbi` contains the function `makeGOGraph` which returns a `graphNEL` graph.
In the following, we will:
1. Create a graph for each ontology and remove the node `"all"` as this is essentially redundant
2. Reverse the direction of the DAG for compatibility with tools in the package [`dnet`](https://cran.r-project.org/web/packages/dnet/index.html). Whilst hosted on CRAN, this package Depends on the bioconductor package [`supraHex`](https://www.bioconductor.org/packages/release/bioc/html/supraHex.html) and should be installed using `BiocManager::install("dnet")`
Note that the following code can take several minutes to run as these are large graphs.
```{r makeGraphs, cache=TRUE}
graphs <- c(BP = "bp", CC = "cc", MF = "mf") %>%
lapply(makeGOGraph) %>%
lapply(function(x){removeNode("all", x)}) %>%
lapply(dDAGreverse)
```
```{r summariseGraphs, echo=FALSE}
tibble(
Ontology = names(graphs),
Nodes = vapply(graphs, numNodes, integer(1)),
Edges = vapply(graphs, function(x){
length(unlist(edgeL(x)))
}, integer(1))
) %>%
pander(
caption = "Summary of graph sizes for each ontology",
big.mark = ","
)
```
## Find the Key Information
```{r goSummaries, cache=TRUE}
goSummaries <- lapply(graphs, function(x){
lng <- dDAGlevel(x, "longest_path") - 1
shrt <- dDAGlevel(x, "shortest_path") - 1
tips <- dDAGtip(x)
tibble(
id = unique(c(names(lng), names(shrt))),
shortest_path = shrt,
longest_path = lng,
terminal_node = id %in% tips
)
}) %>%
bind_rows() %>%
mutate(ontology = Ontology(id))
```
```{r summarisePaths, echo=FALSE, fig.cap = "Path lengths for each ontology, based on whether a term is a terminal node or not."}
goSummaries %>%
gather(key = "type", value = "path", contains("path")) %>%
ggplot(aes(terminal_node, path, fill = type)) +
geom_boxplot() +
facet_grid(type~ontology, scales = "free") +
labs(y = "Path Length") +
theme_bw()
```
```{r cumulativeTerms, echo=FALSE, fig.cap = "Cumulative number of GO Terms with paths $\\geq$ x."}
goSummaries %>%
gather(key = "type", value = "path", contains("path")) %>%
arrange(path) %>%
group_by(ontology, type, path) %>%
tally() %>%
mutate(n = cumsum(n)) %>%
ggplot(aes(path, n, colour = ontology, linetype = type)) +
geom_line() +
labs(x = "Path Length",
y = "Cumulative Number of GO Terms") +
scale_y_continuous(labels = comma) +
theme_bw()
```
## Examples
In reality, we can just add this table to our GO analysis table from tools like `goana()` and use it to filter results before adjusting p-values.
As an example of how to use this to assist our decision making, if we chose to remove GO terms with a shortest path $\leq$ 4, we can see how many terms we would keep and retain.
```{r}
ggplotly(goSummaries %>%
mutate(keep = shortest_path > 4) %>%
ggplot(aes(keep, fill = terminal_node)) +
geom_bar() +
facet_wrap(~ontology, nrow = 1) +
scale_y_continuous(labels = comma) +
labs(x = "Term Retained", y = "Total") +
theme_bw())
```
```{r}
goSummaries %>%
mutate(keep = shortest_path > 4) %>%
group_by(ontology, terminal_node, keep) %>%
tally() %>%
spread(key = keep, value = n) %>%
rename(Discard = `FALSE`,
Retain = `TRUE`) %>%
bind_rows(
tibble(ontology = "**Total**",
Discard = sum(.$Discard),
Retain = sum(.$Retain))) %>%
pander(big.mark = ",",
justify = "llrr")
```
Alternatively, we could remove GO terms with a longest path back to the root node is $\leq 5$ steps.
```{r}
ggplotly(goSummaries %>%
mutate(keep = longest_path > 5) %>%
ggplot(aes(keep, fill = terminal_node)) +
geom_bar() +
facet_wrap(~ontology, nrow = 1) +
scale_y_continuous(labels = comma) +
labs(x = "Term Retained", y = "Total") +
theme_bw())
```
```{r}
goSummaries %>%
mutate(keep = longest_path > 5) %>%
group_by(ontology, terminal_node, keep) %>%
tally() %>%
spread(key = keep, value = n) %>%
rename(Discard = `FALSE`,
Retain = `TRUE`) %>%
bind_rows(
tibble(ontology = "**Total**",
Discard = sum(.$Discard),
Retain = sum(.$Retain))) %>%
pander(big.mark = ",",
justify = "llrr")
```
```{r, echo=FALSE}
write_rds(goSummaries, "data/goSummaries.RDS")
```
The summaries obtained above can be downloaded from [here](https://uofabioinformaticshub.github.io/summaries2GO/data/goSummaries.RDS).
Place this in the appropriate folder and the object can then be imported using `read_rds("path/to/goSummaries.RDS")`
## Use with `goana`
As an artificial example, let's use a subset of the genes associated with `GO:0005795` (Golgi Stack) as a pretend set of DE genes.
```{r}
library(org.Hs.eg.db)
set.seed(65)
myDE <- get("GO:0005795", org.Hs.egGO2ALLEGS) %>% sample(50)
goResults <- goana(myDE, species = "Hs")
```
If we leave these as is, there will be a few high-level GO terms which are relatively meaningless, as well as GO terms not present in our DE genes.
These can easily be seen at the top of [the Ancestor Chart for this term](https://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0005795)
```{r}
goResults %>%
rownames_to_column("id") %>%
as_tibble() %>%
arrange(P.DE) %>%
mutate(adjP = p.adjust(P.DE, "bonferroni"),
FDR = p.adjust(P.DE, "fdr")) %>%
slice(1:10) %>%
pander(caption = "Unfiltered results for GO analysis",
justify = "lllrrrrr",
split.tables = Inf)
```
I usually like to rearrange mine as a `tibble` and do some filtering as well.
In the following we'll:
- Arrange by P value
- Add the information from goSummaries
- Remove GO terms not represented in our DE genes
- Restrict our results to a specific subset based on distance from the root node
- Adjust p-values
```{r filteredGO}
filteredGO <- goResults %>%
rownames_to_column("id") %>%
as_tibble() %>%
arrange(P.DE) %>%
left_join(goSummaries) %>%
filter(DE > 0) %>%
filter(shortest_path > 4) %>%
mutate(adjP = p.adjust(P.DE, "bonferroni"),
FDR = p.adjust(P.DE, "fdr"))
```
Now we have removed some of the redundant terms, we can just look at the top 10.
```{r, echo=FALSE}
filteredGO %>%
dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
slice(1:10) %>%
pander(caption = "Top 10 terms for GO analysis after filtering",
justify = "lllrrrrr",
split.tables = Inf)
```
Or we could find which terminal nodes are significant, as these may be especially informative.
```{r}
filteredGO %>%
filter(adjP < 0.05, terminal_node) %>%
dplyr::select(id, Term, Ont, N, DE, P.DE, adjP, FDR) %>%
pander(caption = "Significant GO terms with no Child Terms",
justify = "lllrrrrr",
split.tables = Inf)
```
Note that we just randomly grabbed 50 genes from `GO:0005795`, so this may also highlight issues with the general GO analytic approach and being cautious about low-level terms with only a small number of genes.
## Session Info
```{r sessionInfo, echo=FALSE}
pander(sessionInfo())
```