Merge pull request #58 from cmu-delphi/rach-minor-edits

dajmcdon · web-flow · commit ab28e6a57623 · 2024-12-10T23:19:18.000-06:00
Minor edits to Day 1 morning + Day 1 afternoon (tidyverse only)
diff --git a/slides/day1-afternoon.qmd b/slides/day1-afternoon.qmd
@@ -74,7 +74,6 @@ Our focus will be on basic operations like selecting and filtering data.
 <small>[Source](https://towardsdatascience.com/data-manipulation-in-r-with-dplyr-3095e0867f75)</small>
 </div>
 
-
 ## Downloading JHU CSSE COVID-19 case data
 
 * Let's start with something familiar... Here's a task for you:
@@ -102,8 +101,9 @@ cases_df_api <- pub_covidcast(
 Now we only really need a few columns here...
 ```{r head-jhu-dplyr-demo-data}
 #| echo: true
-cases_df <- cases_df_api |>
-  select(geo_value, time_value, raw_cases = value) # We'll talk more about this soon :)
+# Base R way for now...
+cases_df <- cases_df_api[,c("geo_value", "time_value", "value")] 
+names(cases_df)[names(cases_df) == "value"] <- "raw_cases"
 ```
 
 ## Ways to inspect the dataset
@@ -120,15 +120,15 @@ and tail to view the last six
 tail(cases_df)  # Last 6 rows
 ```
 
-## Ways to inspect the dataset
-Now, for our first foray into the `tidyverse`...
+<!-- ## Ways to inspect the dataset
 
 Use `glimpse()` to get a compact overview of the dataset.
 
 ```{r glimpse}
 #| echo: true
 glimpse(cases_df)
 ```
+-->
 
 <!-- ## Creating tibbles
 
@@ -149,7 +149,7 @@ The `select()` function is used to pick specific columns from your dataset.
 
 ```{r select-columns}
 #| echo: true
-select(cases_df, geo_value, time_value)  # Select the 'geo_value' and 'time_value' columns
+select(cases_df, time_value, raw_cases)  # Select the 'time_value' and 'raw_cases' columns
 ```
 
 ## Selecting columns with `select()`
@@ -158,7 +158,7 @@ You can exclude columns by prefixing the column names with a minus sign `-`.
 
 ```{r select-columns-exclude}
 #| echo: true
-select(cases_df, -raw_cases)  # Exclude the 'raw_cases' column from the dataset
+select(cases_df, -geo_value)  # Exclude the 'geo_value' column from the dataset
 ```
 
 <!-- ## Extracting columns with `pull()`
@@ -199,7 +199,7 @@ filter(cases_df, geo_value == "nc", raw_cases > 500)  # Filter for NC with raw d
 
 ```{r select-filter-combine}
 #| echo: true
-select(filter(cases_df, geo_value == "nc", raw_cases > 1000), time_value, raw_cases) |> 
+select(filter(cases_df, geo_value == "nc", raw_cases > 500), time_value, raw_cases) |> 
   head()
 ```
 
@@ -213,21 +213,11 @@ select(filter(cases_df, geo_value == "nc", raw_cases > 1000), time_value, raw_ca
 #| echo: true
 # This code reads more like poetry!
 cases_df |> 
-  filter(geo_value == "nc", raw_cases > 1000) |> 
+  filter(geo_value == "nc", raw_cases > 500) |> 
   select(time_value, raw_cases) |> 
   head()
 ```
 
-## Key practices in `dplyr`
-
-* Use [**tibbles**]{.primary} for easier data handling.
-* Use `head()`, `tail()`, and `glimpse()` for quick data inspection.
-* Use `select()` and `filter()` for data manipulation.
-* Chain functions with `|>` for cleaner code.
-
-<!-- * Use `pull()` to extract columns as vectors. --> 
-
-
 ## Grouping data with `group_by()`
 
 * Use `group_by()` to group data by one or more columns.
@@ -317,7 +307,7 @@ cases_df |>
   summarise(median_cases = median(raw_cases))
 ```
 
-## Using `count()` to aggregate data
+<!-- ## Using `count()` to aggregate data
 `count()` is a shortcut for grouping and summarizing the data.
 
 For example, if we want to get the total number of complete rows for each state, then
@@ -340,28 +330,38 @@ cases_count <- cases_df |>
 cases_count # Let's see what the counts are.
 ```
 
-## Key practices in `dplyr`: Round 2
+-->
+
+## Key practices learned
 
+* Use `head()` and `tail()` for quick data inspection.
+* Use `select()` and `filter()` for data manipulation.
+* Chain functions with `|>`.
 * Use `group_by()` to group data by one or more variables before applying functions.
 * Use `mutate` to create new columns or modify existing ones by applying functions to existing data.
 * Use `summarise` to reduce data to summary statistics (e.g., mean, median).
-* `count()` is a convenient shortcut for counting rows by group without needing `group_by()` and `summarise()`.
 
-## Tidy data and Tolstoy
+## 1 to 2 word summaries of the `dplyr` functions
 
-> "Happy families are all alike; every unhappy family is unhappy in its own way." — Leo Tolstoy  
+Here are 1 to 2 word summaries of the key `dplyr` functions:
+
+1. `select()`: Choose columns
 
-* [**Tidy datasets**]{.primary} are like happy families: consistent, standardized, and easy to work with.  
-* [**Messy datasets**]{.primary} are like unhappy families: each one messy in its own unique way.  
-In this section:
-* We'll define what makes data *tidy* and how to transform between the tidy and messy formats.
+1. `filter()`: Subset rows
+
+1. `mutate()`: Create columns
+
+1. `group_by()`: Group by
+
+1. `summarise()`: Numerical summary
 
 ## Tidy data and Tolstoy
 
-![](gfx/tidy_messy_data.jpg){style="width: 60%;"}
+> "Happy families are all alike; every unhappy family is unhappy in its own way." — Leo Tolstoy  
 
-<small>[Artwork by @allison_horst](https://x.com/allison_horst)</small>
+![](gfx/tidy_messy_data.jpg){style="width: 40%;"}
 
+<small>[Artwork by @allison_horst](https://x.com/allison_horst)</small>
 
 ## What is tidy data?
 
@@ -385,8 +385,6 @@ head(cases_df)
 * To convert data from long format to wide/messy format use `pivot_wider()`.
 * For example, let's try creating a column for each time value in `cases_df`:
 
-<!-- Example. Spreadsheet from hell -->
-
 ```{r pivot-wider-ex}
 #| echo: true
 messy_cases_df <- cases_df |>
@@ -416,20 +414,6 @@ tidy_cases_df <- messy_cases_df |>
 head(tidy_cases_df, n = 3) # Notice the class of time_value here
 ```
 
-##  Tidying messy data with `pivot_longer()`
-
-* When we used `pivot_longer()`, the `time_value` column is converted to a character class because the column names are treated as strings.
-* So, to truly get the original `cases_df` we need to convert `time_value` back to the `Date` class.
-* Then, we can use `identical()` to check if the two data frames are exactly the same.
-```{r check-identical}
-#| echo: true
-tidy_cases_df = tidy_cases_df |> mutate(time_value = as.Date(time_value))
-
-identical(tidy_cases_df |> arrange(time_value), cases_df)
-```
-
-Great. That was a success!
-
 ## Introduction to joins in `dplyr`
 * Joining datasets is a powerful tool for combining info. from multiple sources.
 * In R, `dplyr` provides several functions to perform different types of joins.
diff --git a/slides/day1-morning.qmd b/slides/day1-morning.qmd
@@ -228,13 +228,11 @@ verify_setup()
 
 * In table form, panel data is a time index + one or more locations/keys.
 
-* **Ex**: The % of outpatient doctor visits that are COVID-related in WA from Dec. 2021 to Feb. 2022 ([docs](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)):
-```{r panel-wa-ex}
-head(epix_as_of(dv_wa, max(dv_wa$DT$version)))
+* **Ex**: The % of outpatient doctor visits that are COVID-related in CA from June 2020 to Dec. 2021 ([docs](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)):
+```{r panel-ca-ex}
+dv_versioned_panel_final |> filter(geo_value == "ca") |> select(-version)
 ```
 
-<!-- Example: The estimated % of outpatient doctor visits that are COVID-related in WA from Dec. 2021 to Feb. 2022 -->
-
 ## Examples of panel data - COVID-19 cases
 
 [[**JHU CSSE COVID cases per 100k **]{.primary}](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html) estimates the daily number of new confirmed COVID-19 cases per 100,000 population, averaged over the past 7 days.
@@ -427,7 +425,7 @@ ggplot(ca |>
 
 * This is because individual-level data has delayed availability:
     
-[Person comes to ER → Has some tests → Admitted → Tests come back → Entered into the system → ...]{.fourth-colour}
+[Person comes to ER → Admitted → Has some tests → Tests come back → Entered into the system → ...]{.fourth-colour}
 
 * So, a "Hospital admission" may not attributable to a particular condition
 until a few days have passed (the patient may even have been released)
@@ -526,10 +524,6 @@ and subject to [**revision**]{.primary} <!--over time (ex. consider Dec. 1's `pe
 head(x_dt_with_diff) |> as_tibble()
 ```
 
-<!-- min_lag: the minimum time to any value min(as.integer(version) - as.integer(time_value)  -->
-<!-- max_lag: the amount of time until the final (new) version -->
-<!-- revision_summary computes some basic statistics about the revision behavior of an archive, returning a tibble summarizing the revisions per time_value+epi_key features. -->
-
 ## Revision triangle, Outpatient visits in WA 2022 
 
 * 7-day trailing average to smooth day-of-week effects
@@ -1027,7 +1021,7 @@ print(res['result'], res['message'], len(res['epidata']))
 * Anonymous API access is subject to some restrictions:
   <small>public datasets only; 60 requests per hour; only two parameters may have multiple selections</small>
 
-* API key grants priviledged access; can be obtained by [registering with us](https://api.delphi.cmu.edu/epidata/admin/registration_form) 
+* API key grants privileged access; can be obtained by [registering with us](https://api.delphi.cmu.edu/epidata/admin/registration_form) 
 
 * Privileges of registration: no rate limit; no limit on multiple selections
 
@@ -1172,8 +1166,6 @@ Change `geo_type` and `geo_values` in the previous example
 ```{r state-jhu-pub-covidcast}
 #| echo: true
 #| eval: false
-# Obtain the most up-to-date version of the smoothed covid-like illness (CLI)
-# signal from the COVID-19 Trends and Impact survey for all states
 jhu_state_cases <- pub_covidcast(
   source = "jhu-csse",
   signals = "confirmed_7dav_incidence_prop",