Skip to content

Commit 34ee996

Browse files
authored
Merge pull request #339 from cmu-delphi/docs/monthly-rollups
Document monthly rollup CSVs
2 parents c2370fa + c1f4d4e commit 34ee996

File tree

2 files changed

+73
-27
lines changed

2 files changed

+73
-27
lines changed

docs/symptom-survey/survey-files.md

Lines changed: 49 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -26,49 +26,77 @@ where the data is hosted.
2626
1. TOC
2727
{:toc}
2828

29-
## Naming Conventions
29+
## Available Data Files
3030

31-
Cumulative files:
31+
We provide two types of data files, daily and monthly. Users who need the most
32+
up-to-date data should use the daily files, while those who want to conduct
33+
retrospective analyses using many months of data may find the monthly files more
34+
convenient.
3235

33-
{YYYY_mm}.tar
36+
### Daily Files
3437

35-
Incremental files:
38+
Each day, we write CSV files with names following this pattern:
3639

3740
cvid_responses_{for}_recordedby_{recorded}.csv.gz
3841

3942
Dates in incremental filenames are of the form `YYYY_mm_dd`. `for` refers to the
4043
day the survey response was started, in the Pacific time zone (UTC -
4144
7). `recorded` refers to the day survey data was retrieved; see the [lag
42-
policy](#lag-policy) for more details.
45+
policy](#lag-policy) for more details. Each file is compressed with gzip, and
46+
the standard `gunzip` command on Linux or Mac can decompress it.
4347

4448
Every day, we write response files for all recent days of data, with today's
45-
`recorded` date. You need only load the most recent set of `recorded` files to
46-
obtain all survey responses; the older versions are available to track any
47-
changes in file formats or slight changes from late-arriving responses, as
48-
described in the [lag policy below](#lag-policy).
49-
50-
## Loading Data Files
51-
52-
As described above, one day of data may be reissued several times, if responses
53-
arrive late, file formats are changed, or errors in data processing are
54-
corrected. You need only load the latest version of each file.
49+
`recorded` date. For each `for` date, you need only load the most recent
50+
`recorded` file to obtain all survey responses; the older versions are available
51+
to track any changes in file formats or slight changes from late-arriving
52+
responses, as described in the [lag policy below](#lag-policy).
5553

5654
For data users who use R to load and process data, we provide a [`get_survey_df`
5755
function](survey-utils.R) to read a directory of CSV files (such as those
5856
provided on the SFTP server), select the correct files, and read them into a
5957
single data frame for use.
6058

59+
### Monthly Files
60+
61+
Several days after the end of each month, we produce "rollup" files containing
62+
all survey responses from that month. These are in two forms.
63+
64+
First, the monthly CSV files have filenames in the form
65+
66+
{YYYY}-{mm}.csv.gz
67+
68+
and contain all valid responses for that month. These are produced from the
69+
daily files, by taking the data with the most recent `recorded` date for each
70+
day of the month. Because these files are large (typically over 300 MB), they
71+
are compressed with gzip; the standard `gunzip` command on macOS or Linux can
72+
decompress them. (macOS can also decompress these files through Finder
73+
automatically; on Windows, free programs like [7-zip](https://www.7-zip.org/)
74+
can decompress gzip files.) Users doing historical analyses of the survey data
75+
should start with these files, since they provide the easiest way to get all the
76+
necessary data, without accidentally including duplicate results.
77+
78+
Second, we produce monthly tarballs containing the daily `.csv.gz` files for
79+
that month, with names in the form
80+
81+
{YYYY}-{mm}.tar
82+
83+
Similar to the monthly CSV files, they contain only the files with the most
84+
recent `recorded` date for each day. These archives can be unpacked using the
85+
standard `tar` command. The unpacked files are described in [Daily
86+
Files](#daily-files) above.
87+
6188
## Conditions Responses are Recorded
6289

63-
The survey was configured to record responses under two sets of circumstances:
90+
The survey is configured to record responses under two sets of circumstances:
6491

65-
1. The user taking the survey clicked submit, or
92+
1. The user taking the survey reached the end of the survey, or
6693
2. The user taking the survey left the survey unattended for 4 hours.
6794

6895
An abandoned survey as in (2) is automatically closed and recorded, and the user
6996
is not permitted to reopen it.
7097

7198
Responses qualify for inclusion in these files if they meet the following conditions:
99+
72100
* answered "yes" to age consent
73101
* answered a minimum of 2 additional questions, where to “answer” a numeric
74102
open-ended question (A2, A2b, B2b, Q40, C10_1_1, C10_2_1, C10_3_1, C10_4_1,
@@ -113,6 +141,7 @@ observed was four days.
113141

114142
Once a weight has been generated for a unique identifier, CMU continues to
115143
retrieve survey responses for that identifier in case one comes in with an
116-
earlier starting time. If it does, CMU will replace the individual response data
117-
in the appropriate file with the response that was started earlier. We expect
118-
all response files to stabilize after four days.
144+
earlier starting time. If it does, CMU will write a new daily CSV file
145+
containing the response with the earlier starting time (and not the later
146+
response), as [described above](#daily-files). We expect all response files to
147+
stabilize after four days.

docs/symptom-survey/survey-utils.R

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ library(dplyr)
55
#' Fetch all survey data in a chosen directory.
66
#'
77
#' There can be multiple data files for a single day of survey responses, for
8-
#' example if the data is reissued when late-arriving surveys are recorded.
9-
#' Each file contains *all* data recorded for that date, so only the later files
10-
#' are needed.
8+
#' example if the data is reissued when late-arriving surveys are recorded. Each
9+
#' file contains *all* data recorded for that date, so only the most recently
10+
#' updated file for each date is needed.
1111
#'
12-
#' This function extracts the date from each file, determines which files are
13-
#' reissued data, and produces a single data frame representing the most recent
14-
#' data available for each day. It can read gzip-compressed CSV files, such as
15-
#' those on the SFTP site, using `readr::read_csv`.
12+
#' This function extracts the date from each file, determines which files
13+
#' contain reissued data, and produces a single data frame representing the most
14+
#' recent data available for each day. It can read gzip-compressed CSV files,
15+
#' such as those on the SFTP site, using `readr::read_csv`.
1616
#'
1717
#' This function handles column types correctly for surveys up to Wave 4.
1818
#'
@@ -62,10 +62,27 @@ get_survey_df <- function(directory, pattern = "*.csv.gz$") {
6262
C13 = col_character(),
6363
C13a = col_character(),
6464
D1_4_TEXT = col_character(),
65+
E3 = col_character(),
6566
fips = col_character(),
6667
UserLanguage = col_character(),
6768
StartDatetime = col_character(),
6869
EndDatetime = col_character(),
70+
Q65 = col_integer(),
71+
Q66 = col_integer(),
72+
Q67 = col_integer(),
73+
Q68 = col_integer(),
74+
Q69 = col_integer(),
75+
Q70 = col_integer(),
76+
Q71 = col_integer(),
77+
Q72 = col_integer(),
78+
Q73 = col_integer(),
79+
Q74 = col_integer(),
80+
Q75 = col_integer(),
81+
Q76 = col_integer(),
82+
Q77 = col_integer(),
83+
Q78 = col_integer(),
84+
Q79 = col_integer(),
85+
Q80 = col_integer(),
6986
.default = col_number()))
7087
}
7188
)

0 commit comments

Comments
 (0)