tidy-intro-talk.qmd

---
title: "Tidyomics: Enabling Tidy Data Analysis for Complex Biological Data"
author:
  - name: "Michael Love"
    affiliation: "Dept of Genetics, Dept of Biostatistics, UNC-Chapel Hill"
format: 
  revealjs:
    theme: simple
    slideNumber: true
    self-contained: false
    highlight-style: arrow
include-in-header:
  - text: |
      <style>
      #title-slide .title {
        font-size: 1.5em
      }
      .cell-output {
        margin-top: 20px;
      }
      </style>
---

## Tidyomics project

An open, open-source project spanning multiple R packages, and
developers from around the world. Operates within the Bioconductor
Project, with separate funding and organization. Coordination of
development via GitHub Projects.

-   <https://github.com/tidyomics>

-   [Announcement
    paper](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2)

-   `#tidiness_in_bioc` channel in Zulip

## What is "tidy data"

-   One row per observation, one column per variable.

-   Genomic regions (BED) already in this format.

-   Matrices and annotated matrices are not.

## A grammar of data manipulation

```{=html}
<style>
  .small-text ul {
    font-size: 0.8em;
  }
</style>
```

In the R package *dplyr*:

::: small-text
-   `mutate()` adds new variables that are functions of existing
    variables.
-   `select()` picks variables based on their names.
-   `filter()` picks cases based on their values.
-   `slice()` picks cases based on their position.
-   `summarize()` reduces multiple values down to a single summary.
-   `arrange()` changes the ordering of the rows.
-   `group_by()` perform any operation by group.
:::

<https://dplyr.tidyverse.org/>

## The pipe

``` bash
command | command | command > output.txt
```

> "Pipes rank alongside the hierarchical file system and regular
> expressions as one of the most powerful yet elegant features of
> Unix-like operating systems."

<http://www.linfo.org/pipe.html>

In R we use `%>%` or `|>` instead of `|` to chain operations.

## Tidyomics = *dplyr* verbs applied to omics

-   Genomic regions (called "ranges")

-   Matrix data, e.g. gene expression over samples or cells

-   Genomic interactions, e.g. DNA loops

-   And more...

## Diagram of tidyomics workflows

![](figures/figure2.png){fig-align="center"
fig-alt="Diagram of how packages share a similar grammar="}

## International development team

![](figures/tidyomics_community.png){fig-align="center" width="300"
fig-alt="Diagram of tidyomics community="}

## Keep data organized in *objects*

-   We typically have more information than just a matrix

-   Row and column information (on features and samples)

-   Annotated matrix data, i.e. python's `xarray`, `AnnData`

-   Metadata: organism, genome build, annotation release

-   2002: *ExpressionSet*; 2011: *SummarizedExperiment*

-   *Endomorphic functions*: `x |> f -> x`

## SummarizedExperiment, "SE"

A *SummarizedExperiment* is annotated matrix data.

Imagine a matrix of counts:

```{r}
#| message: false
library(SummarizedExperiment)
set.seed(123) # some random count data
```

```{r}
#| echo: true
counts <- matrix(
  rpois(16, lambda=100), ncol=4,
  dimnames=list(c("g1","g2","g3","g4"),
                c("s1","s2","s3","s4"))
)
counts
```

## Row data and column data

metadata about genes (rows)

```{r}
genes <- DataFrame(
  id = c("g1","g2","g3","g4"), 
  symbol = c("ABC","DEF","GHI","JKL")
)
genes
```

metadata about samples (columns)

```{r}
samples <- DataFrame(
  sample = c("s1","s2","s3","s4"),
  condition = c("x","y","x","y"),
  sex = c("m","m","f","f")
)
samples
```

## Assembled object

```{r}
#| echo: true
se <- SummarizedExperiment(
  assays = list(counts = counts),
  rowData = genes, 
  colData = samples,
  metadata = list(organism="Hsapiens")
)
se
```

## Deals with bookkeeping issues

Reordering columns propagates to `assay` and `colData`:

```{r}
#| echo: true
se2 <- se[,c(1,3,2,4)]
assay(se2, "counts")
colData(se2)
```

## Deals with bookkeeping issues

Assignment with conflicting metadata gives an error:

```{r}
#| echo: true
#| eval: false
assay(se2) <- counts
# Error:
# please use 'assay(x, withDimnames=FALSE)) <- value' or 
# 'assays(x, withDimnames=FALSE)) <- value'
# when the rownames or colnames of the supplied assay(s) are not 
# identical to those of the receiving  SummarizedExperiment object 'x'
```

-   Other such validity checks include comparison across *different
    genome builds*.

-   Errors triggered by metadata better than silent errors!

## Can be hard for new users

```{r}
#| echo: true
slotNames(se)
methods(class = class(se)) |> head()
```

-   An advanced R/Bioconductor user should eventually learn these
    methods, class/method inheritance.

-   Not needed to explore or visualize data, or make basic data
    summaries.

## Beginners know *dplyr* & *ggplot2*

```{r}
#| echo: true
#| message: false
library(dplyr)
# filter to samples in condition 'x'
samples |> 
  as_tibble() |> 
  filter(condition == "x")
```

## Enabling *dplyr* verbs for omics

-   *tidySummarizedExperiment* package from Mangiola *et al.*

-   Printing says: "*SummarizedExperiment-tibble abstraction*"

```{r}
#| echo: true
#| message: false
library(tidySummarizedExperiment)
se
```

## Tidyomics is an API

Interact with native objects using standard R/Bioc methods.

![](figures/counter.png){fig-alt="Counter with a menu and a bell"
fig-align="center"}

## Still a standard Bioc object

```{r}
#| echo: true
class(se)
dim(se)
```

## We can use familiar dplyr verbs

```{r}
#| echo: true
se |> 
  filter(condition == "x")
```

## We can use familiar dplyr verbs

```{r}
#| echo: true
se |> 
  filter(condition == "x") |>
  colData()
```

## Facilitates quick exploration

```{r}
#| message: false
#| echo: false
library(ggplot2)
theme_set(theme_grey(base_size = 16))
```

```{r ggplot}
#| message: false
#| echo: true
#| fig-align: center
library(ggplot2)
se |> 
  filter(.feature %in% c("g1","g2")) |> # `.feature` a special name
  ggplot(aes(condition, counts, color=sex, group=sex)) + 
  geom_point(size=2) + geom_line() + facet_wrap(~.feature)
```

## Seurat and SingleCellExperiment

*SingleCellExperiment* has slots for e.g. reduced dimensions.

```{r}
#| eval: false
#| echo: false
library(tidySingleCellExperiment)
library(scales)
sce <- readRDS("data/tidyomicsWorkshopSCE.rds")
```

```{r umap}
#| eval: false
#| echo: true
sce |>
  filter(Phase == "G1") |>
  ggplot(aes(UMAP_1, UMAP_2, color=nCount_RNA)) +
  geom_point() + scale_color_gradient2(trans="log10")
```

![](figures/umap-1.png){fig-align="center"}

## Efficient operation on SE with `plyxp`

-   Justin Landis (UNC BCB) noticed some opportunities for more
    efficient operations.

-   Also, reduce ambiguity, allow multiple ways to access data across
    context. Leveraging data masks from *rlang*.

-   Developed in Summer 2024: `plyxp`, stand-alone but also as a
    *tidySummarizedExperiment* engine.

![](figures/plyxp.png){fig-align="center"}

## Efficient operation on SE with `plyxp`

![](figures/plyxp-bindings.png){fig-align="center"}

## Suppose the following SE object

```{r}
#| echo: false
#| message: false
library(SummarizedExperiment)
simple <- SummarizedExperiment(
  list(counts = matrix(1:12,nrow=3)),
  colData = data.frame(type = factor(c(1,1,2,2)), 
                       row.names = letters[1:4]),
  rowData = data.frame(length = c(10,20,30),
                       row.names = LETTERS[1:3])
)
```

```{r}
#| echo: true
assay(simple)
colData(simple) |> as.data.frame()
rowData(simple) |> as.data.frame()
```

## Trigger methods with `new_plyxp()`

```{r}
#| echo: true
#| message: false
library(plyxp)
xp <- simple |> 
  new_plyxp()
xp
```

## Mutate with `plyxp`

```{r}
#| echo: true
xp |> # assay is default context for mutate()
  mutate(norm_counts = counts / .rows$length)
```

## What was that `.rows$` doing?

![](figures/plxyp-pronouns.png){fig-align="center"}

## Grammar for group and summarize

```{r}
#| echo: true
summed <- xp |>
  group_by(cols(type)) |>
  summarize(sum = rowSums(counts))
summed
assay(summed)
```

## Multiple implementations

```{r}
#| echo: true
xp |> # .assays_asis gives direct access to the matrix
  mutate(rows(ave_counts = rowMeans(.assays_asis$counts)))
```

-   See vignette for more: <https://jtlandis.github.io/plyxp/>

## The `plyxp` class

A simple wrapper of SE:

```{r}
#| echo: true
class(xp)
colData(xp)
```

## Unwrapping `plyxp`

```{r}
#| echo: true
xp |> se()
```

## tidyomics in long read RNA-seq analysis

-   Filtering, exploration, QC

-   Data thinning (count splitting)

-   Grouping isoforms for DTU analysis, keeping track of aggregated
    transcript sets

-   Projecting splicing features from transcript-level quantification
    and analysis to events like skipped exons, retained introns,
    alternative 5' / 3' UTR, etc.

## Computing on genomic ranges

-   Translate biological questions about the genome into code.

-   Leverage familiar concepts of filters, joins, grouping, mutation,
    or summarization.

![](figures/fillingthegaps.png){fig-alt="T2T Consoritium, Science."
fig-align="center"}

## Genomic overlap as a join

![](figures/woodjoin.png){fig-alt="Interwoven stacks of wood"
fig-align="center"}

```{r}
#| echo: false
#| message: false
library(plyranges)
set.seed(5)
n <- 40
x <- data.frame(seqnames=1, start=round(runif(n, 101, 996)), 
                width=2, score=rnorm(n, mean=5)) |>
  as_granges() |>
  sort()
seqlengths(x) <- 1000
y <- data.frame(seqnames=1, start=c(101, 451, 801), 
                width=200, id = factor(c("a","b","c"))) |>
  as_granges()
```

## Genomic overlap as a join

```{r}
#| echo: true
#| message: false
library(plyranges)
x
```

## Genomic overlap as a join

```{r}
#| echo: true
y
```

## Computing overlap (here inner join)

```{r}
#| echo: true
x |> join_overlap_inner(y)
```

## Chaining operations

```{r}
#| echo: true
x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  group_by(id) |>
  summarize(ave_score = mean(score), n = n())
```

Options: `directed`, `within`, `maxgap`, `minoverlap`, etc.

## Pipe to plot

```{r ranges-plot}
#| echo: true
#| fig-align: center
x |>
  filter(score > 3.5) |>
  join_overlap_inner(y) |>
  as_tibble() |>
  ggplot(aes(x = id, y = score)) + 
  geom_violin() + geom_jitter(width=.1)
```

## Convenience functions

```{r}
#| echo: true
y |> 
  anchor_5p() |> # 5', 3', start, end, center
  mutate(width=1) |>
  join_nearest(x, distance=TRUE)
```

## Manipulating transcripts for long read RNA-seq

- Bea Campillo in the lab (UNC BCB) was interested in developing a visualization tool after differential transcript usage (DTU)
- We run e.g. [oarfish](https://github.com/COMBINE-lab/oarfish), import into R with tximeta, then run [satuRn](https://www.bioconductor.org/packages/satuRn) for DTU testing
- Bea developed *SPLain* for visualizing per gene DTU results, also detecting skipped exons and performing sequence analysis using *plyranges*. Coming soon!

## SPLain: visualize splicing changes

![](figures/splain1a.png){fig-alt="SPLain app" fig-align="center"}

## SPLain: visualize splicing changes

![](figures/splain1b.png){fig-alt="SPLain app" fig-align="center"}

## SPLain: sequence analysis

![](figures/splain2.png){fig-alt="SPLain app" fig-align="center"}

## SPLain: sequence analysis

```{r}
#| eval: false
#| echo: true
pos_exons  <- exons_flat |> filter(sign ==  1)
neg_exons <- exons_flat |> filter(sign ==  -1)

candidates <- pos_exons |>
  filter_by_non_overlaps_directed(neg_exons) |>

# ...

upstr_downreg_exons <- downreg_exons |>
  flank_upstream(width = width_upstream) 

seq_downreg_exons <-  Hsapiens |>
  getSeq(upstr_downreg_exons) |>
  RNAStringSet()
```

## Assess significance using a tidy grammar

-   Asking about the enrichment of variants near genes, or peaks in
    TADs, often requires a lot of custom R code, lots of loops and
    control code.

-   Hard to read, hard to maintain, hard at submission/publication.

-   Instead: use familiar verbs, stacked genomic range sets.

## Defining null genomic range sets

-   We developed a package,
    [nullranges](https://nullranges.github.io/nullranges), a modular
    tool to assist with making genomic comparisons.

-   Only provides null genomic range sets for investigating various
    hypotheses. That's it!

-   Doesn't do enrichment analysis *per se*, but can be combined with
    *plyranges*, *ggplot2*, etc.

![](figures/nullranges.png){fig-align="center"}

## Bootstrapping ranges

Statistical papers from the ENCODE project noted that *block
bootstrapping* genomic data preserves important spatial patterns
(Bickel *et al.* 2010).

![](figures/boot.png){fig-alt="Diagram of block bootstrapping"
fig-align="center"}

## bootRanges

```{r}
#| echo: true
#| message: false
library(nullranges)
boot <- bootRanges(x, blockLength=10, R=20)
# keep track of bootstrap iteration, gives boot dist'n
boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |> # bootstrap iter, range id
  summarize(n_overlaps = n())
```

## bootRanges

```{r boot-plot}
#| echo: true
#| fig-align: center
boot |>
  join_overlap_inner(y) |>
  group_by(iter, id) |> # bootstrap iter, range id
  summarize(n_overlaps = n()) |>
  as_tibble() |>
  ggplot(aes(x = id, y = n_overlaps)) + 
  geom_boxplot()
```

## Matching ranges

Matching on covariates from a large pool allows for more focused
hypothesis testing.

![](figures/match.png){fig-alt="Diagram of matching genomic ranges"
fig-align="center"}

## matchRanges

```{r}
#| echo: true
xprime <- x |>
  filter(score > 5) |> # xprime are hi-score
  mutate(score = rnorm(n(), mean = score, sd = .5)) # jitter data
```

## matchRanges

```{r match-covar-plot}
#| echo: true
#| fig-align: center
m <- matchRanges(focal = xprime, pool = x, covar = ~score)
plotCovariate(m)
```

## Bind ranges

```{r}
#| echo: true
combined <- bind_ranges(
  focal = xprime,
  matched = matched(m),
  pool = x,
  .id = "origin"
)
```

This is now "tidy data" with the two group concatenated and a new
metadata column, `origin`.

## Group and summarize

```{r}
#| echo: true
combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score))
```

## Pipe to plot

```{r match-overlap-plot}
#| echo: true
#| fig-align: center
combined |> 
  join_overlap_inner(y) |>
  group_by(id, origin) |>
  summarize(ave_score = mean(score), sd = sd(score)) |>
  as_tibble() |>
  ggplot(aes(origin, ave_score, 
             ymin=ave_score-2*sd, ymax=ave_score+2*sd)) + 
  geom_pointrange() +
  facet_wrap(~id, labeller = label_both)
```

## bootRanges and matchRanges methods

Published as Application Notes in 2023:

-   [bootRanges](https://doi.org/10.1093/bioinformatics/btad190)

-   [matchRanges](https://doi.org/10.1093/bioinformatics/btad197)

## Current state and future directions

-   Writing up `plyxp` as an Application Note.

-   Have shown it locally and to an industry team.

-   Working with Stefano Mangiola's team on consistent printing,
    messaging.

-   Working on tutorials, workshops, documentation.

## What has been implemented

-   Matrix-shaped objects (SE, SCE) - Mangiola *et al.*

-   Ranges - Lee *et al.*

-   Interactions - Serizay *et al.*

-   Cytometry - Keyes *et al.*

-   Spatial - Hutchison *et al.*

-   more to come...

## Ongoing work

-   Package code and non-standard evaluation (IMO I don't recommend)
-   Optimized code
-   Conflicts
-   Documentation!

## How to contribute

-   <https://github.com/tidyomics> open challenges

-   For project motivation, read the
    [paper](https://www.biorxiv.org/content/10.1101/2023.09.10.557072v2)

-   `#tidiness_in_bioc` channel in Bioconductor Slack

-   More complicated use cases: [Tidy ranges
    tutorial](https://tidyomics.github.io/tidy-ranges-tutorial)

## Omics data analysis requires looking at:

-   main contributions to variance (e.g. PCA, see `plotPCA` for bulk
    and [OSCA](https://bioconductor.org/books/release/OSCA/) for sc),
    also `variancePartition`
-   column and row densities (`geom_density` of select rows/columns,
    or `geom_violin`)
-   known positive features, feature-level plots (`filter` to feature,
    pipe to `geom_point` etc.)

## Acknowledgments

```{=html}
<style>
  .small-text p {
    font-size: 0.6em;
  }
</style>
```

-   Stefano Mangiola (co-lead, tidySE, tidySCE, spatial)

-   Justin Landis (*plyxp*) and Beatriz Campillo Minano (*SPLain*)

-   Eric Davis, Wancen Mu, Doug Phanstiel (*nullranges*)

-   Stuart Lee, M. Lawrence, D. Cook (*plyranges* developers)

::: small-text
**tidyomics developers**: William Hutchison, Timothy Keyes, Helena
Crowell, Jacques Serizay, Charlotte Soneson, Eric Davis, Noriaki Sato,
Lambda Moses, Boyd Tarlinton, Abdullah Nahid, Miha Kosmac, Quentin
Clayssen, Victor Yuan, Wancen Mu, Ji-Eun Park, Izabela Mamede, Min
Hyung Ryu, Pierre-Paul Axisa, Paulina Paiz, Chi-Lam Poon, Ming Tang

*tidyomics* project funded by an Essential Open Source Software award
from CZI and Wellcome Trust
:::