@@ -26,49 +26,77 @@ where the data is hosted.
26
26
1 . TOC
27
27
{: toc }
28
28
29
- ## Naming Conventions
29
+ ## Available Data Files
30
30
31
- Cumulative files:
31
+ We provide two types of data files, daily and monthly. Users who need the most
32
+ up-to-date data should use the daily files, while those who want to conduct
33
+ retrospective analyses using many months of data may find the monthly files more
34
+ convenient.
32
35
33
- {YYYY_mm}.tar
36
+ ### Daily Files
34
37
35
- Incremental files:
38
+ Each day, we write CSV files with names following this pattern :
36
39
37
40
cvid_responses_{for}_recordedby_{recorded}.csv.gz
38
41
39
42
Dates in incremental filenames are of the form ` YYYY_mm_dd ` . ` for ` refers to the
40
43
day the survey response was started, in the Pacific time zone (UTC -
41
44
7). ` recorded ` refers to the day survey data was retrieved; see the [ lag
42
- policy] ( #lag-policy ) for more details.
45
+ policy] ( #lag-policy ) for more details. Each file is compressed with gzip, and
46
+ the standard ` gunzip ` command on Linux or Mac can decompress it.
43
47
44
48
Every day, we write response files for all recent days of data, with today's
45
- ` recorded ` date. You need only load the most recent set of ` recorded ` files to
46
- obtain all survey responses; the older versions are available to track any
47
- changes in file formats or slight changes from late-arriving responses, as
48
- described in the [ lag policy below] ( #lag-policy ) .
49
-
50
- ## Loading Data Files
51
-
52
- As described above, one day of data may be reissued several times, if responses
53
- arrive late, file formats are changed, or errors in data processing are
54
- corrected. You need only load the latest version of each file.
49
+ ` recorded ` date. For each ` for ` date, you need only load the most recent
50
+ ` recorded ` file to obtain all survey responses; the older versions are available
51
+ to track any changes in file formats or slight changes from late-arriving
52
+ responses, as described in the [ lag policy below] ( #lag-policy ) .
55
53
56
54
For data users who use R to load and process data, we provide a [ ` get_survey_df `
57
55
function] ( survey-utils.R ) to read a directory of CSV files (such as those
58
56
provided on the SFTP server), select the correct files, and read them into a
59
57
single data frame for use.
60
58
59
+ ### Monthly Files
60
+
61
+ Several days after the end of each month, we produce "rollup" files containing
62
+ all survey responses from that month. These are in two forms.
63
+
64
+ First, the monthly CSV files have filenames in the form
65
+
66
+ {YYYY}-{mm}.csv.gz
67
+
68
+ and contain all valid responses for that month. These are produced from the
69
+ daily files, by taking the data with the most recent ` recorded ` date for each
70
+ day of the month. Because these files are large (typically over 300 MB), they
71
+ are compressed with gzip; the standard ` gunzip ` command on macOS or Linux can
72
+ decompress them. (macOS can also decompress these files through Finder
73
+ automatically; on Windows, free programs like [ 7-zip] ( https://www.7-zip.org/ )
74
+ can decompress gzip files.) Users doing historical analyses of the survey data
75
+ should start with these files, since they provide the easiest way to get all the
76
+ necessary data, without accidentally including duplicate results.
77
+
78
+ Second, we produce monthly tarballs containing the daily ` .csv.gz ` files for
79
+ that month, with names in the form
80
+
81
+ {YYYY}-{mm}.tar
82
+
83
+ Similar to the monthly CSV files, they contain only the files with the most
84
+ recent ` recorded ` date for each day. These archives can be unpacked using the
85
+ standard ` tar ` command. The unpacked files are described in [ Daily
86
+ Files] ( #daily-files ) above.
87
+
61
88
## Conditions Responses are Recorded
62
89
63
- The survey was configured to record responses under two sets of circumstances:
90
+ The survey is configured to record responses under two sets of circumstances:
64
91
65
- 1 . The user taking the survey clicked submit , or
92
+ 1 . The user taking the survey reached the end of the survey , or
66
93
2 . The user taking the survey left the survey unattended for 4 hours.
67
94
68
95
An abandoned survey as in (2) is automatically closed and recorded, and the user
69
96
is not permitted to reopen it.
70
97
71
98
Responses qualify for inclusion in these files if they meet the following conditions:
99
+
72
100
* answered "yes" to age consent
73
101
* answered a minimum of 2 additional questions, where to “answer” a numeric
74
102
open-ended question (A2, A2b, B2b, Q40, C10_1_1, C10_2_1, C10_3_1, C10_4_1,
@@ -113,6 +141,7 @@ observed was four days.
113
141
114
142
Once a weight has been generated for a unique identifier, CMU continues to
115
143
retrieve survey responses for that identifier in case one comes in with an
116
- earlier starting time. If it does, CMU will replace the individual response data
117
- in the appropriate file with the response that was started earlier. We expect
118
- all response files to stabilize after four days.
144
+ earlier starting time. If it does, CMU will write a new daily CSV file
145
+ containing the response with the earlier starting time (and not the later
146
+ response), as [ described above] ( #daily-files ) . We expect all response files to
147
+ stabilize after four days.
0 commit comments