Skip to content

Commit ab28e6a

Browse files
authored
Merge pull request #58 from cmu-delphi/rach-minor-edits
Minor edits to Day 1 morning + Day 1 afternoon (tidyverse only)
2 parents 65676a0 + eae5035 commit ab28e6a

File tree

2 files changed

+35
-59
lines changed

2 files changed

+35
-59
lines changed

slides/day1-afternoon.qmd

+30-46
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,6 @@ Our focus will be on basic operations like selecting and filtering data.
7474
<small>[Source](https://towardsdatascience.com/data-manipulation-in-r-with-dplyr-3095e0867f75)</small>
7575
</div>
7676

77-
7877
## Downloading JHU CSSE COVID-19 case data
7978

8079
* Let's start with something familiar... Here's a task for you:
@@ -102,8 +101,9 @@ cases_df_api <- pub_covidcast(
102101
Now we only really need a few columns here...
103102
```{r head-jhu-dplyr-demo-data}
104103
#| echo: true
105-
cases_df <- cases_df_api |>
106-
select(geo_value, time_value, raw_cases = value) # We'll talk more about this soon :)
104+
# Base R way for now...
105+
cases_df <- cases_df_api[,c("geo_value", "time_value", "value")]
106+
names(cases_df)[names(cases_df) == "value"] <- "raw_cases"
107107
```
108108

109109
## Ways to inspect the dataset
@@ -120,15 +120,15 @@ and tail to view the last six
120120
tail(cases_df) # Last 6 rows
121121
```
122122

123-
## Ways to inspect the dataset
124-
Now, for our first foray into the `tidyverse`...
123+
<!-- ## Ways to inspect the dataset
125124
126125
Use `glimpse()` to get a compact overview of the dataset.
127126
128127
```{r glimpse}
129128
#| echo: true
130129
glimpse(cases_df)
131130
```
131+
-->
132132

133133
<!-- ## Creating tibbles
134134
@@ -149,7 +149,7 @@ The `select()` function is used to pick specific columns from your dataset.
149149

150150
```{r select-columns}
151151
#| echo: true
152-
select(cases_df, geo_value, time_value) # Select the 'geo_value' and 'time_value' columns
152+
select(cases_df, time_value, raw_cases) # Select the 'time_value' and 'raw_cases' columns
153153
```
154154

155155
## Selecting columns with `select()`
@@ -158,7 +158,7 @@ You can exclude columns by prefixing the column names with a minus sign `-`.
158158

159159
```{r select-columns-exclude}
160160
#| echo: true
161-
select(cases_df, -raw_cases) # Exclude the 'raw_cases' column from the dataset
161+
select(cases_df, -geo_value) # Exclude the 'geo_value' column from the dataset
162162
```
163163

164164
<!-- ## Extracting columns with `pull()`
@@ -199,7 +199,7 @@ filter(cases_df, geo_value == "nc", raw_cases > 500) # Filter for NC with raw d
199199

200200
```{r select-filter-combine}
201201
#| echo: true
202-
select(filter(cases_df, geo_value == "nc", raw_cases > 1000), time_value, raw_cases) |>
202+
select(filter(cases_df, geo_value == "nc", raw_cases > 500), time_value, raw_cases) |>
203203
head()
204204
```
205205

@@ -213,21 +213,11 @@ select(filter(cases_df, geo_value == "nc", raw_cases > 1000), time_value, raw_ca
213213
#| echo: true
214214
# This code reads more like poetry!
215215
cases_df |>
216-
filter(geo_value == "nc", raw_cases > 1000) |>
216+
filter(geo_value == "nc", raw_cases > 500) |>
217217
select(time_value, raw_cases) |>
218218
head()
219219
```
220220

221-
## Key practices in `dplyr`
222-
223-
* Use [**tibbles**]{.primary} for easier data handling.
224-
* Use `head()`, `tail()`, and `glimpse()` for quick data inspection.
225-
* Use `select()` and `filter()` for data manipulation.
226-
* Chain functions with `|>` for cleaner code.
227-
228-
<!-- * Use `pull()` to extract columns as vectors. -->
229-
230-
231221
## Grouping data with `group_by()`
232222

233223
* Use `group_by()` to group data by one or more columns.
@@ -317,7 +307,7 @@ cases_df |>
317307
summarise(median_cases = median(raw_cases))
318308
```
319309

320-
## Using `count()` to aggregate data
310+
<!-- ## Using `count()` to aggregate data
321311
`count()` is a shortcut for grouping and summarizing the data.
322312
323313
For example, if we want to get the total number of complete rows for each state, then
@@ -340,28 +330,38 @@ cases_count <- cases_df |>
340330
cases_count # Let's see what the counts are.
341331
```
342332
343-
## Key practices in `dplyr`: Round 2
333+
-->
334+
335+
## Key practices learned
344336

337+
* Use `head()` and `tail()` for quick data inspection.
338+
* Use `select()` and `filter()` for data manipulation.
339+
* Chain functions with `|>`.
345340
* Use `group_by()` to group data by one or more variables before applying functions.
346341
* Use `mutate` to create new columns or modify existing ones by applying functions to existing data.
347342
* Use `summarise` to reduce data to summary statistics (e.g., mean, median).
348-
* `count()` is a convenient shortcut for counting rows by group without needing `group_by()` and `summarise()`.
349343

350-
## Tidy data and Tolstoy
344+
## 1 to 2 word summaries of the `dplyr` functions
351345

352-
> "Happy families are all alike; every unhappy family is unhappy in its own way." — Leo Tolstoy
346+
Here are 1 to 2 word summaries of the key `dplyr` functions:
347+
348+
1. `select()`: Choose columns
353349

354-
* [**Tidy datasets**]{.primary} are like happy families: consistent, standardized, and easy to work with.
355-
* [**Messy datasets**]{.primary} are like unhappy families: each one messy in its own unique way.
356-
In this section:
357-
* We'll define what makes data *tidy* and how to transform between the tidy and messy formats.
350+
1. `filter()`: Subset rows
351+
352+
1. `mutate()`: Create columns
353+
354+
1. `group_by()`: Group by
355+
356+
1. `summarise()`: Numerical summary
358357

359358
## Tidy data and Tolstoy
360359

361-
![](gfx/tidy_messy_data.jpg){style="width: 60%;"}
360+
> "Happy families are all alike; every unhappy family is unhappy in its own way." — Leo Tolstoy
362361
363-
<small>[Artwork by @allison_horst](https://x.com/allison_horst)</small>
362+
![](gfx/tidy_messy_data.jpg){style="width: 40%;"}
364363

364+
<small>[Artwork by @allison_horst](https://x.com/allison_horst)</small>
365365

366366
## What is tidy data?
367367

@@ -385,8 +385,6 @@ head(cases_df)
385385
* To convert data from long format to wide/messy format use `pivot_wider()`.
386386
* For example, let's try creating a column for each time value in `cases_df`:
387387

388-
<!-- Example. Spreadsheet from hell -->
389-
390388
```{r pivot-wider-ex}
391389
#| echo: true
392390
messy_cases_df <- cases_df |>
@@ -416,20 +414,6 @@ tidy_cases_df <- messy_cases_df |>
416414
head(tidy_cases_df, n = 3) # Notice the class of time_value here
417415
```
418416

419-
## Tidying messy data with `pivot_longer()`
420-
421-
* When we used `pivot_longer()`, the `time_value` column is converted to a character class because the column names are treated as strings.
422-
* So, to truly get the original `cases_df` we need to convert `time_value` back to the `Date` class.
423-
* Then, we can use `identical()` to check if the two data frames are exactly the same.
424-
```{r check-identical}
425-
#| echo: true
426-
tidy_cases_df = tidy_cases_df |> mutate(time_value = as.Date(time_value))
427-
428-
identical(tidy_cases_df |> arrange(time_value), cases_df)
429-
```
430-
431-
Great. That was a success!
432-
433417
## Introduction to joins in `dplyr`
434418
* Joining datasets is a powerful tool for combining info. from multiple sources.
435419
* In R, `dplyr` provides several functions to perform different types of joins.

slides/day1-morning.qmd

+5-13
Original file line numberDiff line numberDiff line change
@@ -228,13 +228,11 @@ verify_setup()
228228

229229
* In table form, panel data is a time index + one or more locations/keys.
230230

231-
* **Ex**: The % of outpatient doctor visits that are COVID-related in WA from Dec. 2021 to Feb. 2022 ([docs](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)):
232-
```{r panel-wa-ex}
233-
head(epix_as_of(dv_wa, max(dv_wa$DT$version)))
231+
* **Ex**: The % of outpatient doctor visits that are COVID-related in CA from June 2020 to Dec. 2021 ([docs](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)):
232+
```{r panel-ca-ex}
233+
dv_versioned_panel_final |> filter(geo_value == "ca") |> select(-version)
234234
```
235235

236-
<!-- Example: The estimated % of outpatient doctor visits that are COVID-related in WA from Dec. 2021 to Feb. 2022 -->
237-
238236
## Examples of panel data - COVID-19 cases
239237

240238
[[**JHU CSSE COVID cases per 100k **]{.primary}](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) estimates the daily number of new confirmed COVID-19 cases per 100,000 population, averaged over the past 7 days.
@@ -427,7 +425,7 @@ ggplot(ca |>
427425

428426
* This is because individual-level data has delayed availability:
429427

430-
[Person comes to ER → Has some tests → Admitted → Tests come back → Entered into the system → ...]{.fourth-colour}
428+
[Person comes to ER → Admitted → Has some tests → Tests come back → Entered into the system → ...]{.fourth-colour}
431429

432430
* So, a "Hospital admission" may not attributable to a particular condition
433431
until a few days have passed (the patient may even have been released)
@@ -526,10 +524,6 @@ and subject to [**revision**]{.primary} <!--over time (ex. consider Dec. 1's `pe
526524
head(x_dt_with_diff) |> as_tibble()
527525
```
528526

529-
<!-- min_lag: the minimum time to any value min(as.integer(version) - as.integer(time_value) -->
530-
<!-- max_lag: the amount of time until the final (new) version -->
531-
<!-- revision_summary computes some basic statistics about the revision behavior of an archive, returning a tibble summarizing the revisions per time_value+epi_key features. -->
532-
533527
## Revision triangle, Outpatient visits in WA 2022
534528

535529
* 7-day trailing average to smooth day-of-week effects
@@ -1027,7 +1021,7 @@ print(res['result'], res['message'], len(res['epidata']))
10271021
* Anonymous API access is subject to some restrictions:
10281022
<small>public datasets only; 60 requests per hour; only two parameters may have multiple selections</small>
10291023

1030-
* API key grants priviledged access; can be obtained by [registering with us](https://api.delphi.cmu.edu/epidata/admin/registration_form)
1024+
* API key grants privileged access; can be obtained by [registering with us](https://api.delphi.cmu.edu/epidata/admin/registration_form)
10311025

10321026
* Privileges of registration: no rate limit; no limit on multiple selections
10331027

@@ -1172,8 +1166,6 @@ Change `geo_type` and `geo_values` in the previous example
11721166
```{r state-jhu-pub-covidcast}
11731167
#| echo: true
11741168
#| eval: false
1175-
# Obtain the most up-to-date version of the smoothed covid-like illness (CLI)
1176-
# signal from the COVID-19 Trends and Impact survey for all states
11771169
jhu_state_cases <- pub_covidcast(
11781170
source = "jhu-csse",
11791171
signals = "confirmed_7dav_incidence_prop",

0 commit comments

Comments
 (0)