+ "markdown": "# Correlate signals over space and time\n\n\nThe `epiprocess` package provides some simple functionality for computing lagged\ncorrelations between two signals, over space or time (or other variables), via\n`epi_cor()`. This function is really just a convenience wrapper over some basic\ncommands: it first performs specified time shifts, then computes correlations,\ngrouped in a specified way. In this vignette, we'll examine correlations between\nstate-level COVID-19 case and death rates, smoothed using 7-day trailing\naverages.\n\n\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-2_7c89b5aee8f9d474319d212978503566'}\n\n```{.r .cell-code}\nx <- covid_case_death_rates\n```\n:::\n\n\n## Correlations grouped by time\n\nThe `epi_cor()` function operates on an `epi_df` object, and it requires further\nspecification of the variables to correlate, in its next two arguments (`var1`\nand `var2`).\n\nIn general, we can specify any grouping variable (or combination of variables)\nfor the correlation computations in a call to `epi_cor()`, via the `cor_by`\nargument. This potentially leads to many ways to compute correlations. There are\nalways at least two ways to compute correlations in an `epi_df`: grouping by\ntime value, and by geo value. The former is obtained via `cor_by = time_value`.\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-3_f66ac47a93717e593807c2fa24766523'}\n\n```{.r .cell-code}\nz1 <- epi_cor(x, case_rate, death_rate, cor_by = \"time_value\")\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-4_84e1aa4831754809c364c0c6200c23eb'}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(z1, aes(x = time_value, y = cor)) +\n geom_hline(yintercept = 0) +\n geom_line(color = 4) +\n scale_x_date(minor_breaks = \"month\", date_labels = \"%b %Y\") +\n labs(x = \"Date\", y = \"Correlation\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nThe above plot addresses the question: \"on any given day, are case and death\nrates linearly associated, across the U.S. states?\". We might be interested in\nbroadening this question, instead asking: \"on any given day, do higher case\nrates tend to associate with higher death rates?\", removing the dependence on a\nlinear relationship. The latter can be addressed using Spearman correlation,\naccomplished by setting `method = \"spearman\"` in the call to `epi_cor()`.\nSpearman correlation is highly robust and invariant to monotone transformations.\n\n## Lagged correlations\n\nWe might also be interested in how case rates associate with death rates in the\n*future*. Using the `dt1` parameter in `epi_cor()`, we can lag case rates back\nany number of days we want, before calculating correlations. Below, we set `dt1\n= -10`. This means that `var1 = case_rate` will be lagged by 10 days, so that\ncase rates on June 1st will be correlated with death rates on June 11th. (It\nmight also help to think of it this way: death rates on a certain day will be\ncorrelated with case rates at an offset of -10 days.)\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-5_a31d30ca4ff11cc4173e87d7b3e17516'}\n\n```{.r .cell-code}\nz2 <- epi_cor(x, case_rate, death_rate, cor_by = time_value, dt1 = -10)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-6_17b2540d7afa4303c809aa7836eb3729'}\n\n```{.r .cell-code code-fold=\"true\"}\nz <- rbind(\n z1 %>% mutate(lag = 0),\n z2 %>% mutate(lag = 10)\n) %>%\n mutate(lag = as.factor(lag))\n\nggplot(z, aes(x = time_value, y = cor)) +\n geom_hline(yintercept = 0) +\n geom_line(aes(color = lag)) +\n scale_color_brewer(palette = \"Set1\") +\n scale_x_date(minor_breaks = \"month\", date_labels = \"%b %Y\") +\n labs(x = \"Date\", y = \"Correlation\", col = \"Lag\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nNote that `epi_cor()` takes an argument `shift_by` that determines the grouping\nto use for the time shifts. The default is `geo_value`, which makes sense in our\nproblem at hand (but in another setting, we may want to group by geo value and\nanother variable---say, age---before time shifting).\n\nWe can see that, generally, lagging the case rates back by 10 days improves the\ncorrelations, confirming case rates are better correlated with death rates 10\ndays from now.\n\n## Correlations grouped by state\n\nThe second option we have is to group by geo value, obtained by setting `cor_by\n= geo_value`. We'll again look at correlations for both 0- and 10-day lagged \ncase rates.\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-7_caf710aef16ca5c027a450b5e8a09798'}\n\n```{.r .cell-code}\nz1 <- epi_cor(x, case_rate, death_rate, cor_by = geo_value)\nz2 <- epi_cor(x, case_rate, death_rate, cor_by = geo_value, dt1 = -10)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-8_410fa1f85cdea4abca883f8cec2a5879'}\n\n```{.r .cell-code code-fold=\"true\"}\nz <- rbind(\n z1 %>% mutate(lag = 0),\n z2 %>% mutate(lag = 10)\n) %>%\n mutate(lag = as.factor(lag))\n\nggplot(z, aes(cor)) +\n geom_density(aes(fill = lag, col = lag), alpha = 0.5, bounds = c(-1, 1)) +\n scale_fill_brewer(palette = \"Set1\") +\n scale_color_brewer(palette = \"Set1\") +\n labs(x = \"Correlation\", y = \"Density\", fill = \"Lag\", col = \"Lag\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nWe can again see that, generally speaking, lagging the case rates back by 10 \ndays improves the correlations.\n\n## More systematic lag analysis\n\nNext we perform a more systematic investigation of the correlations over a broad \nrange of lag values. \n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-9_505146d4a43638cd8f42f1d37a772a50'}\n\n```{.r .cell-code}\nlags <- 0:35\n\nz <- map(\n .x = lags,\n ~ epi_cor(x, case_rate, death_rate, cor_by = geo_value, dt1 = -.x) %>%\n mutate(lag = .x)\n) %>% list_rbind()\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-10_e5d3435d433ff118e91d8001cb930845'}\n\n```{.r .cell-code code-fold=\"true\"}\nz %>%\n group_by(lag) %>%\n summarize(mean = mean(cor, na.rm = TRUE)) %>%\n ggplot(aes(x = lag, y = mean)) +\n geom_line(color = 4) +\n geom_point(color = 4) +\n labs(x = \"Lag\", y = \"Mean correlation\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nWe can see pronounced curvature in the average correlation between \ncase and death rates (where the correlations come from grouping by `geo_value`) as\na function of lag. The maximum occurs at a lag of somewhere around 17 days.\n",
0 commit comments