Skip to content

Commit d790345

Browse files
committed
link to fb-survey for calculation info; add detail
1 parent 6e416bf commit d790345

File tree

1 file changed

+21
-85
lines changed

1 file changed

+21
-85
lines changed

docs/api/covidcast-signals/youtube-survey.md

+21-85
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ parent: Inactive Signals
44
grand_parent: COVIDcast Main Endpoint
55
---
66

7+
[//]: # (code at https://github.com/cmu-delphi/covid-19/tree/deeb4dc1e9a30622b415361ef6b99198e77d2a94/youtube)
8+
79
# Youtube Survey
810
{: .no_toc}
911

@@ -28,7 +30,8 @@ shared back to Youtube.
2830
This survey was a pared-down version of the
2931
[COVID-19 Trends and Impact Survey (CTIS)](../../symptom-survey/),
3032
collecting data only about COVID-19 symptoms. CTIS is much longer-running
31-
and more detailed, also collecting belief and behavior data. See our
33+
and more detailed, also collecting belief and behavior data. CTIS also reports
34+
demographic-corrected versions of some metrics. See our
3235
[surveys page](https://delphi.cmu.edu/covid19/ctis/) for more detail
3336
about how CTIS works.
3437

@@ -97,91 +100,31 @@ can be asymptomatic. Instead, we expect these indicators to be useful for
97100
comparison across the United States and across time, to determine where symptoms
98101
appear to be increasing.
99102

100-
**Smoothing.** The signals beginning with `smoothed` estimate the same quantities as their
101-
`raw` partners, but are smoothed in time to reduce day-to-day sampling noise;
102-
see [details below](#smoothing). Crucially, because the smoothed signals combine
103-
information across multiple days, they have larger sample sizes and hence are
104-
available for more locations than the raw signals.
105-
106-
107-
### Defining Household ILI and CLI
108-
109-
[TODO check]
110-
111-
For a single survey, we are interested in the quantities:
112-
113-
- $$X =$$ the number of people in the household with ILI;
114-
- $$Y =$$ the number of people in the household with CLI;
115-
- $$N =$$ the number of people in the household.
116-
117-
Note that $$N$$ comes directly from the answer to Q3, but neither $$X$$ nor
118-
$$Y$$ can be computed directly (because Q2 does not give an answer to the
119-
precise symptomatic profile of all individuals in the household, it only asks
120-
how many individuals have fever and at least one other symptom from the list).
121-
122-
We hence estimate $$X$$ and $$Y$$ with the following simple strategy. Consider
123-
ILI, without a loss of generality (we apply the same strategy to CLI). Let $$Z$$
124-
be the answer to Q2.
125-
126-
- If the answer to Q1 does not meet the ILI definition, then we report $$X=0$$.
127-
- If the answer to Q1 does meet the ILI definition, then we report $$X = Z$$.
128-
129-
This can only "over count" (result in too large estimates of) the true $$X$$ and
130-
$$Y$$. For example, this happens when some members of the household experience
131-
ILI that does not also qualify as CLI, while others experience CLI that does not
132-
also qualify as ILI. In this case, for both $$X$$ and $$Y$$, our simple strategy
133-
would return the sum of both types of cases. However, given the extreme degree
134-
of overlap between the definitions of ILI and CLI, it is reasonable to believe
135-
that, if symptoms across all household members qualified as both ILI and CLI,
136-
each individual would have both, or neither---with neither being more common.
137-
Therefore we do not consider this "over counting" phenomenon practically
138-
problematic.
139103

104+
## Estimation
140105

141106
### Estimating Percent ILI and CLI
142107

143-
[TODO check]
144-
145-
Let $$x$$ and $$y$$ be the number of people with ILI and CLI, respectively, over
146-
a given time period, and in a given location (for example, the time period being
147-
a particular day, and a location being a particular state). Let $$n$$ be the
148-
total number of people in this location. We are interested in estimating the
149-
true ILI and CLI percentages, which we denote by $$p$$ and $$q$$, respectively:
150-
151-
$$
152-
p = 100 \cdot \frac{x}{n}
153-
\quad\text{and}\quad
154-
q = 100 \cdot \frac{y}{n}.
155-
$$
156-
157-
In a given aggregation unit (for example, daily-state), let $$X_i$$ and $$Y_i$$
158-
denote number of ILI and CLI cases in the household, respectively (computed
159-
according to the simple strategy [described
160-
above](#defining-household-ili-and-cli)), and let $$N_i$$ denote the total
161-
number of people in the household, in survey $$i$$, out of $$m$$ surveys we
162-
collected. Then our unweighted estimates of $$p$$ and $$q$$ are:
163-
164-
$$
165-
\hat{p} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{X_i}{N_i}
166-
\quad\text{and}\quad
167-
\hat{q} = 100 \cdot \frac{1}{m}\sum_{i=1}^m \frac{Y_i}{N_i}.
168-
$$
169-
108+
Estimates are calculated using the
109+
[same method as CTIS](./fb-survey#estimating-percent-ili-and-cli).
110+
However, the Youtube survey does not do weighting.
170111

171112
### Smoothing
172113

173114
The smoothed versions of all `youtube-survey` signals (with `smoothed` prefix) are
174115
calculated using seven day pooling. For example, the estimate reported for June
175116
7 in a specific geographical area is formed by
176117
collecting all surveys completed between June 1 and 7 (inclusive) and using that
177-
data in the estimation procedures described above.
178-
118+
data in the estimation procedures described above. Because the smoothed signals combine
119+
information across multiple days, they have larger sample sizes and hence are
120+
available for more locations than the raw signals.
179121

180122
## Lag and Backfill
181123

182-
Lag is 1 day. Backfill continues for a couple days.
183-
184-
[TODO more detail]
124+
This indicator has a lag of 2 days. Reported values can be revised for one
125+
day (corresponding to a lag of 3 days), due to how we receive survey
126+
responses. However, these tend to be associated with minimal changes in
127+
value.
185128

186129

187130
## Limitations
@@ -205,19 +148,6 @@ limitations of this survey data.
205148
to answer that they *do*. This survey is anonymous and online, meaning we
206149
expect the social desirability effect to be smaller, but it may still be
207150
present.
208-
* **False responses.** As with anything on the Internet, a small percentage of
209-
users give deliberately incorrect responses. [TODO check if true] We discard a small number of
210-
responses that are obviously false, but do **not** perform extensive
211-
filtering. However, the large size of the study, and [TODO check if true] our procedure for
212-
ensuring that each respondent can only be counted once when they are invited
213-
to take the survey, prevents individual respondents from having a large effect
214-
on results.
215-
* **Repeat invitations.** [TODO check] Individual respondents can be invited by Youtube to
216-
take the survey several times. Usually Youtube only re-invites a respondent
217-
after one month. Hence estimates of values on a single day are calculated
218-
using independent survey responses from unique respondents (or, at least,
219-
unique Youtube accounts), whereas estimates from different months may involve
220-
the same respondents.
221151

222152
Whenever possible, you should compare this data to other independent sources. We
223153
believe that while these biases may affect point estimates -- that is, they may
@@ -237,3 +167,9 @@ This affects some items more than others. It affects some geographic areas
237167
more than others, particularly areas with smaller populations. This affect is
238168
less pronounced with smoothed signals, since responses are pooled across a
239169
longer time period.
170+
171+
172+
## Source and Licensing
173+
174+
This indicator aggregates responses from a Delphi-run survey that is hosted on the Youtube platform.
175+
The data is licensed as [CC BY-NC](../covidcast_licensing.md#creative-commons-attribution-noncommercial).

0 commit comments

Comments
 (0)