Merge pull request #339 from cmu-delphi/docs/monthly-rollups

capnrefsmmat · web-flow · commit 34ee996d61ee · 2020-12-17T12:21:16.000-05:00
Document monthly rollup CSVs
diff --git a/docs/symptom-survey/survey-files.md b/docs/symptom-survey/survey-files.md
@@ -26,49 +26,77 @@ where the data is hosted.
 1. TOC
 {:toc}
 
-## Naming Conventions
+## Available Data Files
 
-Cumulative files:
+We provide two types of data files, daily and monthly. Users who need the most
+up-to-date data should use the daily files, while those who want to conduct
+retrospective analyses using many months of data may find the monthly files more
+convenient.
 
-	{YYYY_mm}.tar
+### Daily Files
 
-Incremental files:
+Each day, we write CSV files with names following this pattern:
 
 	cvid_responses_{for}_recordedby_{recorded}.csv.gz
 
 Dates in incremental filenames are of the form `YYYY_mm_dd`. `for` refers to the
 day the survey response was started, in the Pacific time zone (UTC -
 7). `recorded` refers to the day survey data was retrieved; see the [lag
-policy](#lag-policy) for more details.
+policy](#lag-policy) for more details. Each file is compressed with gzip, and
+the standard `gunzip` command on Linux or Mac can decompress it.
 
 Every day, we write response files for all recent days of data, with today's
-`recorded` date. You need only load the most recent set of `recorded` files to
-obtain all survey responses; the older versions are available to track any
-changes in file formats or slight changes from late-arriving responses, as
-described in the [lag policy below](#lag-policy).
-
-## Loading Data Files
-
-As described above, one day of data may be reissued several times, if responses
-arrive late, file formats are changed, or errors in data processing are
-corrected. You need only load the latest version of each file.
+`recorded` date. For each `for` date, you need only load the most recent
+`recorded` file to obtain all survey responses; the older versions are available
+to track any changes in file formats or slight changes from late-arriving
+responses, as described in the [lag policy below](#lag-policy).
 
 For data users who use R to load and process data, we provide a [`get_survey_df`
 function](survey-utils.R) to read a directory of CSV files (such as those
 provided on the SFTP server), select the correct files, and read them into a
 single data frame for use.
 
+### Monthly Files
+
+Several days after the end of each month, we produce "rollup" files containing
+all survey responses from that month. These are in two forms.
+
+First, the monthly CSV files have filenames in the form
+
+    {YYYY}-{mm}.csv.gz
+
+and contain all valid responses for that month. These are produced from the
+daily files, by taking the data with the most recent `recorded` date for each
+day of the month. Because these files are large (typically over 300 MB), they
+are compressed with gzip; the standard `gunzip` command on macOS or Linux can
+decompress them. (macOS can also decompress these files through Finder
+automatically; on Windows, free programs like [7-zip](https://www.7-zip.org/)
+can decompress gzip files.) Users doing historical analyses of the survey data
+should start with these files, since they provide the easiest way to get all the
+necessary data, without accidentally including duplicate results.
+
+Second, we produce monthly tarballs containing the daily `.csv.gz` files for
+that month, with names in the form
+
+	{YYYY}-{mm}.tar
+
+Similar to the monthly CSV files, they contain only the files with the most
+recent `recorded` date for each day. These archives can be unpacked using the
+standard `tar` command. The unpacked files are described in [Daily
+Files](#daily-files) above.
+
 ## Conditions Responses are Recorded
 
-The survey was configured to record responses under two sets of circumstances:
+The survey is configured to record responses under two sets of circumstances:
 
-1. The user taking the survey clicked submit, or
+1. The user taking the survey reached the end of the survey, or
 2. The user taking the survey left the survey unattended for 4 hours.
 
 An abandoned survey as in (2) is automatically closed and recorded, and the user
 is not permitted to reopen it.
 
 Responses qualify for inclusion in these files if they meet the following conditions:
+
 * answered "yes" to age consent
 * answered a minimum of 2 additional questions, where to “answer” a numeric
   open-ended question (A2, A2b, B2b, Q40, C10_1_1, C10_2_1, C10_3_1, C10_4_1,
@@ -113,6 +141,7 @@ observed was four days.
 
 Once a weight has been generated for a unique identifier, CMU continues to
 retrieve survey responses for that identifier in case one comes in with an
-earlier starting time. If it does, CMU will replace the individual response data
-in the appropriate file with the response that was started earlier. We expect
-all response files to stabilize after four days.
+earlier starting time. If it does, CMU will write a new daily CSV file
+containing the response with the earlier starting time (and not the later
+response), as [described above](#daily-files). We expect all response files to
+stabilize after four days.
diff --git a/docs/symptom-survey/survey-utils.R b/docs/symptom-survey/survey-utils.R
@@ -5,14 +5,14 @@ library(dplyr)
 #' Fetch all survey data in a chosen directory.
 #'
 #' There can be multiple data files for a single day of survey responses, for
-#' example if the data is reissued when late-arriving surveys are recorded.
-#' Each file contains *all* data recorded for that date, so only the later files
-#' are needed.
+#' example if the data is reissued when late-arriving surveys are recorded. Each
+#' file contains *all* data recorded for that date, so only the most recently
+#' updated file for each date is needed.
 #'
-#' This function extracts the date from each file, determines which files are
-#' reissued data, and produces a single data frame representing the most recent
-#' data available for each day. It can read gzip-compressed CSV files, such as
-#' those on the SFTP site, using `readr::read_csv`.
+#' This function extracts the date from each file, determines which files
+#' contain reissued data, and produces a single data frame representing the most
+#' recent data available for each day. It can read gzip-compressed CSV files,
+#' such as those on the SFTP site, using `readr::read_csv`.
 #'
 #' This function handles column types correctly for surveys up to Wave 4.
 #'
@@ -62,10 +62,27 @@ get_survey_df <- function(directory, pattern = "*.csv.gz$") {
                  C13 = col_character(),
                  C13a = col_character(),
                  D1_4_TEXT = col_character(),
+                 E3 = col_character(),
                  fips = col_character(),
                  UserLanguage = col_character(),
                  StartDatetime = col_character(),
                  EndDatetime = col_character(),
+                 Q65 = col_integer(),
+                 Q66 = col_integer(),
+                 Q67 = col_integer(),
+                 Q68 = col_integer(),
+                 Q69 = col_integer(),
+                 Q70 = col_integer(),
+                 Q71 = col_integer(),
+                 Q72 = col_integer(),
+                 Q73 = col_integer(),
+                 Q74 = col_integer(),
+                 Q75 = col_integer(),
+                 Q76 = col_integer(),
+                 Q77 = col_integer(),
+                 Q78 = col_integer(),
+                 Q79 = col_integer(),
+                 Q80 = col_integer(),
                  .default = col_number()))
     }
   )