-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtidy-intro-talk.qmd
729 lines (538 loc) · 15 KB
/
tidy-intro-talk.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
---
title: "Tidyomics: Enabling Tidy Data Analysis for Complex Biological Data"
author:
- name: "Michael Love"
affiliation: "Biostatistics & Genetics, UNC-Chapel Hill"
format:
revealjs:
theme: simple
slideNumber: true
self-contained: false
highlight-style: arrow
include-in-header:
- text: |
<style>
#title-slide .title {
font-size: 1.5em
}
.cell-output {
margin-top: 20px;
}
</style>
---
## Tidyomics project
An open, open-source project spanning multiple R packages, and
developers from around the world. Operates within the Bioconductor
Project, with separate funding and organization. Coordination of
development via GitHub Projects.
- <https://github.com/tidyomics>
- [Announcement
paper](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2)
- `#tidiness_in_bioc` channel in Bioconductor Slack
## What is "tidy data"
- One row per observation, one column per variable.
- Genomic regions (BED) already in this format.
- Matrices and annotated matrices are not.
## A grammar of data manipulation
```{=html}
<style>
.small-text ul {
font-size: 0.8em;
}
</style>
```
In the R package *dplyr*:
::: small-text
- `mutate()` adds new variables that are functions of existing
variables.
- `select()` picks variables based on their names.
- `filter()` picks cases based on their values.
- `slice()` picks cases based on their position.
- `summarize()` reduces multiple values down to a single summary.
- `arrange()` changes the ordering of the rows.
- `group_by()` perform any operation by group.
:::
<https://dplyr.tidyverse.org/>
## The pipe
``` bash
command | command | command > output.txt
```
> "Pipes rank alongside the hierarchical file system and regular
> expressions as one of the most powerful yet elegant features of
> Unix-like operating systems."
<http://www.linfo.org/pipe.html>
In R we use `%>%` or `|>` instead of `|` to chain operations.
## Tidyomics = *dplyr* verbs applied to omics
- Genomic regions (called "ranges")
- Matrix data, e.g. gene expression over samples or cells
- Genomic interactions, e.g. DNA loops
- And more...
## Diagram of tidyomics workflows
![](figures/figure2.png){fig-align="center"
fig-alt="Diagram of how packages share a similar grammar="}
## International development team
![](figures/tidyomics_community.png){fig-align="center" width="300"
fig-alt="Diagram of tidyomics community="}
## Keep data organized in *objects*
- We typically have more information than just a matrix
- Row and column information (on features and samples)
- Annotated matrix data, i.e. python's `xarray`, `AnnData`
- Metadata: organism, genome build, annotation release
- 2002: *ExpressionSet*; 2011: *SummarizedExperiment*
- *Endomorphic functions*: `x |> f -> x`
## SummarizedExperiment, "SE"
A *SummarizedExperiment* is annotated matrix data.
Imagine a matrix of counts:
```{r}
#| message: false
library(SummarizedExperiment)
set.seed(123) # some random count data
```
```{r}
#| echo: true
counts <- matrix(
rpois(16, lambda=100), ncol=4,
dimnames=list(c("g1","g2","g3","g4"),
c("s1","s2","s3","s4"))
)
counts
```
## Row data and column data
metadata about genes (rows)
```{r}
genes <- DataFrame(
id = c("g1","g2","g3","g4"),
symbol = c("ABC","DEF","GHI","JKL")
)
genes
```
metadata about samples (columns)
```{r}
samples <- DataFrame(
sample = c("s1","s2","s3","s4"),
condition = c("x","y","x","y"),
sex = c("m","m","f","f")
)
samples
```
## Assembled object
```{r}
#| echo: true
se <- SummarizedExperiment(
assays = list(counts = counts),
rowData = genes,
colData = samples,
metadata = list(organism="Hsapiens")
)
se
```
## Deals with bookkeeping issues
Reordering columns propagates to `assay` and `colData`:
```{r}
#| echo: true
se2 <- se[,c(1,3,2,4)]
assay(se2, "counts")
colData(se2)
```
## Deals with bookkeeping issues
Assignment with conflicting metadata gives an error:
```{r}
#| echo: true
#| eval: false
assay(se2) <- counts
# Error:
# please use 'assay(x, withDimnames=FALSE)) <- value' or
# 'assays(x, withDimnames=FALSE)) <- value'
# when the rownames or colnames of the supplied assay(s) are not
# identical to those of the receiving SummarizedExperiment object 'x'
```
- Other such validity checks include comparison across *different
genome builds*.
- Errors triggered by metadata better than silent errors!
## Can be hard for new users
```{r}
#| echo: true
slotNames(se)
methods(class = class(se)) |> head()
```
- An advanced R/Bioconductor user should eventually learn these
methods, class/method inheritance.
- Not needed to explore or visualize data, or make basic data
summaries.
## Beginners know *dplyr* & *ggplot2*
```{r}
#| echo: true
#| message: false
library(dplyr)
# filter to samples in condition 'x'
samples |>
as_tibble() |>
filter(condition == "x")
```
## Enabling *dplyr* verbs for omics
- *tidySummarizedExperiment* package from Mangiola *et al.*
- Printing says: "*SummarizedExperiment-tibble abstraction*"
```{r}
#| echo: true
#| message: false
library(tidySummarizedExperiment)
se
```
## Tidyomics is an API
Interact with native objects using standard R/Bioc methods.
![](figures/counter.png){fig-alt="Counter with a menu and a bell"
fig-align="center"}
## Still a standard Bioc object
```{r}
#| echo: true
class(se)
dim(se)
```
## We can use familiar dplyr verbs
```{r}
#| echo: true
se |>
filter(condition == "x")
```
## We can use familiar dplyr verbs
```{r}
#| echo: true
se |>
filter(condition == "x") |>
colData()
```
## Facilitates quick exploration
```{r}
#| message: false
#| echo: false
library(ggplot2)
theme_set(theme_grey(base_size = 16))
```
```{r ggplot}
#| message: false
#| echo: true
#| fig-align: center
library(ggplot2)
se |>
filter(.feature %in% c("g1","g2")) |> # `.feature` a special name
ggplot(aes(condition, counts, color=sex, group=sex)) +
geom_point(size=2) + geom_line() + facet_wrap(~.feature)
```
## Seurat and SingleCellExperiment
*SingleCellExperiment* has slots for e.g. reduced dimensions.
```{r}
#| eval: false
#| echo: false
library(tidySingleCellExperiment)
library(scales)
sce <- readRDS("data/tidyomicsWorkshopSCE.rds")
```
```{r umap}
#| eval: false
#| echo: true
sce |>
filter(Phase == "G1") |>
ggplot(aes(UMAP_1, UMAP_2, color=nCount_RNA)) +
geom_point() + scale_color_gradient2(trans="log10")
```
![](figures/umap-1.png){fig-align="center"}
## Efficient operation on SE with `plyxp`
- Justin Landis (UNC BCB) noticed some opportunities for more
efficient operations.
- Also, reduce ambiguity, allow multiple ways to access data across
context. Leveraging data masks from *rlang*.
- Developed in Summer 2024: `plyxp`, stand-alone but also as a
*tidySummarizedExperiment* engine.
![](figures/plyxp.png){fig-align="center"}
## Efficient operation on SE with `plyxp`
![](figures/plyxp-bindings.png){fig-align="center"}
## Suppose the following SE object
```{r}
#| echo: false
#| message: false
library(SummarizedExperiment)
simple <- SummarizedExperiment(
list(counts = matrix(1:12,nrow=3)),
colData = data.frame(type = factor(c(1,1,2,2)),
row.names = letters[1:4]),
rowData = data.frame(length = c(10,20,30),
row.names = LETTERS[1:3])
)
```
```{r}
#| echo: true
assay(simple)
colData(simple) |> as.data.frame()
rowData(simple) |> as.data.frame()
```
## Trigger methods with `new_plyxp()`
```{r}
#| echo: true
#| message: false
library(plyxp)
xp <- simple |>
new_plyxp()
xp
```
## Mutate with `plyxp`
```{r}
#| echo: true
xp |> # assay is default context for mutate()
mutate(norm_counts = counts / .rows$length)
```
## What was that `.rows$` doing?
![](figures/plxyp-pronouns.png){fig-align="center"}
## Grammar for group and summarize
```{r}
#| echo: true
summed <- xp |>
group_by(cols(type)) |>
summarize(sum = rowSums(counts))
summed
assay(summed)
```
## Multiple implementations
```{r}
#| echo: true
xp |> # .assays_asis gives direct access to the matrix
mutate(rows(ave_counts = rowMeans(.assays_asis$counts)))
```
- See vignette for more: <https://jtlandis.github.io/plyxp/>
## The `plyxp` class
A simple wrapper of SE:
```{r}
#| echo: true
class(xp)
colData(xp)
```
## Unwrapping `plyxp`
```{r}
#| echo: true
xp |> se()
```
## Computing on genomic ranges
- Translate biological questions about the genome into code.
- Leverage familiar concepts of filters, joins, grouping, mutation,
or summarization.
![](figures/fillingthegaps.png){fig-alt="T2T Consoritium, Science."
fig-align="center"}
## Genomic overlap as a join
![](figures/woodjoin.png){fig-alt="Interwoven stacks of wood"
fig-align="center"}
```{r}
#| echo: false
#| message: false
library(plyranges)
set.seed(5)
n <- 40
x <- data.frame(seqnames=1, start=round(runif(n, 101, 996)),
width=2, score=rnorm(n, mean=5)) |>
as_granges() |>
sort()
seqlengths(x) <- 1000
y <- data.frame(seqnames=1, start=c(101, 451, 801),
width=200, id = factor(c("a","b","c"))) |>
as_granges()
```
## Genomic overlap as a join
```{r}
#| echo: true
#| message: false
library(plyranges)
x
```
## Genomic overlap as a join
```{r}
#| echo: true
y
```
## Computing overlap (here inner join)
```{r}
#| echo: true
x |> join_overlap_inner(y)
```
## Chaining operations
```{r}
#| echo: true
x |>
filter(score > 3.5) |>
join_overlap_inner(y) |>
group_by(id) |>
summarize(ave_score = mean(score), n = n())
```
Options: `directed`, `within`, `maxgap`, `minoverlap`, etc.
## Pipe to plot
```{r ranges-plot}
#| echo: true
#| fig-align: center
x |>
filter(score > 3.5) |>
join_overlap_inner(y) |>
as_tibble() |>
ggplot(aes(x = id, y = score)) +
geom_violin() + geom_jitter(width=.1)
```
## Convenience functions
```{r}
#| echo: true
y |>
anchor_5p() |> # 5', 3', start, end, center
mutate(width=1) |>
join_nearest(x, distance=TRUE)
```
## Assess significance using a tidy grammar
- Asking about the enrichment of variants near genes, or peaks in
TADs, often requires a lot of custom R code, lots of loops and
control code.
- Hard to read, hard to maintain, hard at submission/publication.
- Instead: use familiar verbs, stacked genomic range sets.
## Defining null genomic range sets
- We developed a package,
[nullranges](https://nullranges.github.io/nullranges), a modular
tool to assist with making genomic comparisons.
- Only provides null genomic range sets for investigating various
hypotheses. That's it!
- Doesn't do enrichment analysis *per se*, but can be combined with
*plyranges*, *ggplot2*, etc.
![](figures/nullranges.png){fig-align="center"}
## Bootstrapping ranges
Statistical papers from the ENCODE project noted that *block
bootstrapping* genomic data preserves important spatial patterns
(Bickel *et al.* 2010).
![](figures/boot.png){fig-alt="Diagram of block bootstrapping"
fig-align="center"}
## bootRanges
```{r}
#| echo: true
#| message: false
library(nullranges)
boot <- bootRanges(x, blockLength=10, R=20)
# keep track of bootstrap iteration, gives boot dist'n
boot |>
join_overlap_inner(y) |>
group_by(iter, id) |> # bootstrap iter, range id
summarize(n_overlaps = n())
```
## bootRanges
```{r boot-plot}
#| echo: true
#| fig-align: center
boot |>
join_overlap_inner(y) |>
group_by(iter, id) |> # bootstrap iter, range id
summarize(n_overlaps = n()) |>
as_tibble() |>
ggplot(aes(x = id, y = n_overlaps)) +
geom_boxplot()
```
## Matching ranges
Matching on covariates from a large pool allows for more focused
hypothesis testing.
![](figures/match.png){fig-alt="Diagram of matching genomic ranges"
fig-align="center"}
## matchRanges
```{r}
#| echo: true
xprime <- x |>
filter(score > 5) |> # xprime are hi-score
mutate(score = rnorm(n(), mean = score, sd = .5)) # jitter data
```
## matchRanges
```{r match-covar-plot}
#| echo: true
#| fig-align: center
m <- matchRanges(focal = xprime, pool = x, covar = ~score)
plotCovariate(m)
```
## Bind ranges
```{r}
#| echo: true
combined <- bind_ranges(
focal = xprime,
matched = matched(m),
pool = x,
.id = "origin"
)
```
This is now "tidy data" with the two group concatenated and a new
metadata column, `origin`.
## Group and summarize
```{r}
#| echo: true
combined |>
join_overlap_inner(y) |>
group_by(id, origin) |>
summarize(ave_score = mean(score))
```
## Pipe to plot
```{r match-overlap-plot}
#| echo: true
#| fig-align: center
combined |>
join_overlap_inner(y) |>
group_by(id, origin) |>
summarize(ave_score = mean(score), sd = sd(score)) |>
as_tibble() |>
ggplot(aes(origin, ave_score,
ymin=ave_score-2*sd, ymax=ave_score+2*sd)) +
geom_pointrange() +
facet_wrap(~id, labeller = label_both)
```
## bootRanges and matchRanges methods
Published as Application Notes in 2023:
- [bootRanges](https://doi.org/10.1093/bioinformatics/btad190)
- [matchRanges](https://doi.org/10.1093/bioinformatics/btad197)
## Current state and future directions
- Writing up `plyxp` as an Application Note.
- Have shown it locally and to an industry team.
- Working with Stefano Mangiola's team on consistent printing,
messaging.
- Working on tutorials, workshops, documentation.
## What has been implemented
- Matrix-shaped objects (SE, SCE) - Mangiola *et al.*
- Ranges - Lee *et al.*
- Interactions - Serizay *et al.*
- Cytometry - Keyes *et al.*
- Spatial - Hutchison *et al.*
- more to come...
## Ongoing work
- Package code and non-standard evaluation (IMO I don't recommend)
- Optimized code
- Conflicts
- Documentation!
## How to contribute
- <https://github.com/tidyomics> open challenges
- For project motivation, read the
[paper](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2)
- `#tidiness_in_bioc` channel in Bioconductor Slack
- More complicated use cases: [Tidy ranges
tutorial](https://tidyomics.github.io/tidy-ranges-tutorial)
## Omics data analysis requires looking at:
- main contributions to variance (e.g. PCA, see `plotPCA` for bulk
and [OSCA](https://bioconductor.org/books/release/OSCA/) for sc),
also `variancePartition`
- column and row densities (`geom_density` of select rows/columns,
or `geom_violin`)
- known positive features, feature-level plots (`filter` to feature,
pipe to `geom_point` etc.)
## Acknowledgments
```{=html}
<style>
.small-text p {
font-size: 0.6em;
}
</style>
```
- Stefano Mangiola (co-lead, tidySE, tidySCE, spatial)
- Justin Landis (*plyxp*)
- Eric Davis, Wancen Mu, Doug Phanstiel (*nullranges*)
- Stuart Lee, M. Lawrence, D. Cook (*plyranges* developers)
::: small-text
**tidyomics developers**: William Hutchison, Timothy Keyes, Helena
Crowell, Jacques Serizay, Charlotte Soneson, Eric Davis, Noriaki Sato,
Lambda Moses, Boyd Tarlinton, Abdullah Nahid, Miha Kosmac, Quentin
Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min
Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang
*tidyomics* project funded by an Essential Open Source Software award
from CZI and Wellcome Trust
:::