Skip to content

Commit 726c5c8

Browse files
committed
Document forthcoming monthly rollup CSVs
1 parent f887bed commit 726c5c8

File tree

1 file changed

+41
-19
lines changed

1 file changed

+41
-19
lines changed

docs/symptom-survey/survey-files.md

Lines changed: 41 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,16 @@ where the data is hosted.
2626
1. TOC
2727
{:toc}
2828

29-
## Naming Conventions
29+
## Available Data Files
3030

31-
Cumulative files:
31+
We provide two types of data files, daily and monthly. Users who need the most
32+
up-to-date data should use the daily files, while those who want to conduct
33+
retrospective analyses using many months of data may find the monthly files more
34+
convenient.
3235

33-
{YYYY_mm}.tar
36+
### Daily Files
3437

35-
Incremental files:
38+
Each day, we write CSV files with names following this pattern:
3639

3740
cvid_responses_{for}_recordedby_{recorded}.csv.gz
3841

@@ -42,33 +45,51 @@ day the survey response was started, in the Pacific time zone (UTC -
4245
policy](#lag-policy) for more details.
4346

4447
Every day, we write response files for all recent days of data, with today's
45-
`recorded` date. You need only load the most recent set of `recorded` files to
46-
obtain all survey responses; the older versions are available to track any
47-
changes in file formats or slight changes from late-arriving responses, as
48-
described in the [lag policy below](#lag-policy).
49-
50-
## Loading Data Files
51-
52-
As described above, one day of data may be reissued several times, if responses
53-
arrive late, file formats are changed, or errors in data processing are
54-
corrected. You need only load the latest version of each file.
48+
`recorded` date. For each `for` date, you need only load the most recent
49+
`recorded` file to obtain all survey responses; the older versions are available
50+
to track any changes in file formats or slight changes from late-arriving
51+
responses, as described in the [lag policy below](#lag-policy).
5552

5653
For data users who use R to load and process data, we provide a [`get_survey_df`
5754
function](survey-utils.R) to read a directory of CSV files (such as those
5855
provided on the SFTP server), select the correct files, and read them into a
5956
single data frame for use.
6057

58+
### Monthly Files
59+
60+
Several days after the end of each month, we produce "rollup" files containing
61+
all survey responses from that month. These are in two forms.
62+
63+
First, the monthly CSV files have filenames in the form
64+
65+
{YYYY}-{mm}.csv
66+
67+
and contain all valid responses for that month. These are produced from the
68+
daily files, by taking the data with the most recent `recordedby` date for each
69+
day of the month. Users doing historical analyses of the survey data should
70+
start with these files, since they provide the easiest way to get all the
71+
necessary data, without accidentally including duplicate results.
72+
73+
Second, we produce monthly tarballs containing all the daily `.csv.gz` files for
74+
that month, with names in the form
75+
76+
{YYYY}-{mm}.tar
77+
78+
These archives can be unpacked using the standard `tar` command. The unpacked
79+
files are described in [Daily Files](#daily-files) above.
80+
6181
## Conditions Responses are Recorded
6282

63-
The survey was configured to record responses under two sets of circumstances:
83+
The survey is configured to record responses under two sets of circumstances:
6484

65-
1. The user taking the survey clicked submit, or
85+
1. The user taking the survey reached the end of the survey, or
6686
2. The user taking the survey left the survey unattended for 4 hours.
6787

6888
An abandoned survey as in (2) is automatically closed and recorded, and the user
6989
is not permitted to reopen it.
7090

7191
Responses qualify for inclusion in these files if they meet the following conditions:
92+
7293
* answered "yes" to age consent
7394
* answered a minimum of 2 additional questions, where to “answer” a numeric
7495
open-ended question (A2, A2b, B2b, Q40, C10_1_1, C10_2_1, C10_3_1, C10_4_1,
@@ -113,6 +134,7 @@ observed was four days.
113134

114135
Once a weight has been generated for a unique identifier, CMU continues to
115136
retrieve survey responses for that identifier in case one comes in with an
116-
earlier starting time. If it does, CMU will replace the individual response data
117-
in the appropriate file with the response that was started earlier. We expect
118-
all response files to stabilize after four days.
137+
earlier starting time. If it does, CMU will write a new daily CSV file
138+
containing the response with the earlier starting time (and not the later
139+
response), as [described above](#daily-files). We expect all response files to
140+
stabilize after four days.

0 commit comments

Comments
 (0)