Skip to content

Commit 41a7b1e

Browse files
committed
Reorganize and improve technical estimator detail docs
1 parent 0b03330 commit 41a7b1e

File tree

1 file changed

+111
-90
lines changed

1 file changed

+111
-90
lines changed

docs/api/covidcast-signals/fb-survey.md

Lines changed: 111 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -246,9 +246,9 @@ data in the estimation procedures described above.
246246

247247
## Behavior Indicators
248248

249-
Signals beginning `smoothed_w` are [adjusted using survey weights
250-
to be demographically representative](#survey-weighting) as described below.
251-
Weighted signals have 1-2 days of lag, so if low latency is paramount,
249+
Signals beginning `smoothed_w` are [adjusted using survey weights to be
250+
demographically representative](#survey-weighting-and-estimation) as described
251+
below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
252252
unweighted signals are also available. These begin `smoothed_`, such as
253253
`smoothed_wearing_mask` instead of `smoothed_wwearing_mask`.
254254

@@ -291,9 +291,9 @@ unweighted signals are also available. These begin `smoothed_`, such as
291291

292292
## Testing Indicators
293293

294-
Signals beginning `smoothed_w` are [adjusted using survey weights
295-
to be demographically representative](#survey-weighting) as described below.
296-
Weighted signals have 1-2 days of lag, so if low latency is paramount,
294+
Signals beginning `smoothed_w` are [adjusted using survey weights to be
295+
demographically representative](#survey-weighting-and-estimation) as described
296+
below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
297297
unweighted signals are also available. These begin `smoothed_`, such as
298298
`smoothed_tested_14d` instead of `smoothed_wtested_14d`.
299299

@@ -311,9 +311,9 @@ September 8, 2020.
311311

312312
## Vaccination Indicators
313313

314-
Signals beginning `smoothed_w` are [adjusted using survey weights
315-
to be demographically representative](#survey-weighting) as described below.
316-
Weighted signals have 1-2 days of lag, so if low latency is paramount,
314+
Signals beginning `smoothed_w` are [adjusted using survey weights to be
315+
demographically representative](#survey-weighting-and-estimation) as described
316+
below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
317317
unweighted signals are also available. These begin `smoothed_`, such as
318318
`smoothed_covid_vaccinated` instead of `smoothed_wcovid_vaccinated`.
319319

@@ -436,9 +436,9 @@ V1 beginning January 6, 2021.
436436

437437
## Mental Health Indicators
438438

439-
Signals beginning `smoothed_w` are [adjusted using survey weights
440-
to be demographically representative](#survey-weighting) as described below.
441-
Weighted signals have 1-2 days of lag, so if low latency is paramount,
439+
Signals beginning `smoothed_w` are [adjusted using survey weights to be
440+
demographically representative](#survey-weighting-and-estimation) as described
441+
below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
442442
unweighted signals are also available. These begin `smoothed_`, such as
443443
`smoothed_anxious_5d` instead of `smoothed_wanxious_5d`.
444444

@@ -463,9 +463,9 @@ include respondents to Wave 4 and later waves, beginning September 8, 2020.
463463
## Belief, Experience, and Information Indicators
464464

465465
Signals beginning `smoothed_w` are [adjusted using survey weights to be
466-
demographically representative](#survey-weighting) as described below. Weighted
467-
signals have 1-2 days of lag, so if low latency is paramount, unweighted signals
468-
are also available. These begin `smoothed_`, such as
466+
demographically representative](#survey-weighting-and-estimation) as described
467+
below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
468+
unweighted signals are also available. These begin `smoothed_`, such as
469469
`smoothed_belief_children_immune` instead of `smoothed_wbelief_children_immune`.
470470

471471
### Beliefs About COVID-19
@@ -536,18 +536,19 @@ When interpreting the signals above, it is important to keep in mind several
536536
limitations of this survey data.
537537

538538
* **Survey population.** People are eligible to participate in the survey if
539-
they are age 18 or older, they are currently located in the USA, and they are an active user of Facebook. The survey
540-
data does not report on children under age 18, and the Facebook adult user
541-
population may differ from the United States population generally in important
542-
ways. We use our [survey weighting](#survey-weighting) to adjust the estimates
543-
to match age and gender demographics by state, but this process doesn't adjust
544-
for other demographic biases we may not be aware of.
539+
they are age 18 or older, they are currently located in the USA, and they are
540+
an active user of Facebook. The survey data does not report on children under
541+
age 18, and the Facebook adult user population may differ from the United
542+
States population generally in important ways. We use our [survey
543+
weighting](#survey-weighting-and-estimation) to adjust the estimates to match
544+
age and gender demographics by state, but this process doesn't adjust for
545+
other demographic biases we may not be aware of.
545546
* **Non-response bias.** The survey is voluntary, and people who accept the
546547
invitation when it is presented to them on Facebook may be different from
547-
those who do not. The [survey weights provided by Facebook](#survey-weighting)
548-
attempt to model the probability of response for each user and hence adjust
549-
for this, but it is difficult to tell if these weights account for all
550-
possible non-response bias.
548+
those who do not. The [survey weights provided by
549+
Facebook](#survey-weighting-and-estimation) attempt to model the probability
550+
of response for each user and hence adjust for this, but it is difficult to
551+
tell if these weights account for all possible non-response bias.
551552
* **Social desirability.** Previous survey research has shown that people's
552553
responses to surveys are often biased by what responses they believe are
553554
socially desirable or acceptable. For example, if it there is widespread
@@ -557,10 +558,11 @@ limitations of this survey data.
557558
present.
558559
* **False responses.** As with anything on the Internet, a small percentage of
559560
users give deliberately incorrect responses. We discard a small number of
560-
responses that are obviously false, but do not perform extensive filtering.
561-
However, the large size of the study, and our procedure for ensuring that each
562-
respondent can only be counted once when they are invited to take the survey,
563-
prevents individual respondents from having a large effect on results.
561+
responses that are obviously false, but do **not** perform extensive
562+
filtering. However, the large size of the study, and our procedure for
563+
ensuring that each respondent can only be counted once when they are invited
564+
to take the survey, prevents individual respondents from having a large effect
565+
on results.
564566
* **Repeat invitations.** Individual respondents can be invited by Facebook to
565567
take the survey several times. Usually Facebook only re-invites a respondent
566568
after one month. Hence estimates of values on a single day are calculated
@@ -575,14 +577,30 @@ strongly over time. This means that *changes* in signals, such as increases or
575577
decreases, are likely to represent true changes in the underlying population,
576578
even if point estimates are biased.
577579

580+
### Privacy Restrictions
581+
582+
To protect respondent privacy, we discard any estimate (whether at a county,
583+
MSA, HRR, or state level) that is based on fewer than 100 survey responses. For
584+
signals reported using a 7-day average (those beginning with `smoothed_`), this
585+
means a geographic area must have at least 100 responses in 7 days to be
586+
reported.
587+
588+
This affects some items more than others. For instance, items about vaccine
589+
hesitancy reasons are only asked of respondents who are unvaccinated and
590+
hesitant, not to all survey respondents. It also affects some geographic areas
591+
more than others, particularly rural areas with low population densities. When
592+
doing analysis of county-level data, one should be aware that missing counties
593+
are typically more rural and less populous than those present in the data, which
594+
may introduce bias into the analysis.
595+
578596
### Declining Response Rate
579597

580598
We have noted a steady decrease in the number of daily survey responses,
581599
beginning no later than January 2021. As the number of survey responses
582600
declines, some indicators will become unavailable once they no longer meet the
583-
[privacy limit for sample size](../../symptom-survey/coding.md#privacy-restrictions).
584-
This affects some signals, such as those based on a subset of responses, more
585-
than others, with finer geographic resolutions becoming unavailable first.
601+
privacy limit for sample size. This affects some signals, such as those based on
602+
a subset of responses, more than others, with finer geographic resolutions
603+
becoming unavailable first.
586604

587605
### Target Region
588606

@@ -595,21 +613,15 @@ live in Puerto Rico or another US territory, we do not include their response
595613
in the aggregations.
596614

597615

598-
## Survey Weighting
599-
600-
Notice that the estimates defined in the previous sections are calculated with
601-
respect to the population of US Facebook users. (To be precise, the ILI and CLI
602-
indicators reflect the population of US Facebook users *and* their household
603-
members). In reality, our estimates are even further skewed by the varying
604-
propensity of people in the population of US Facebook users to take our survey
605-
in the first place.
616+
## Survey Weighting and Estimation
606617

607618
When Facebook sends a user to our survey, it generates a random ID number and
608619
sends this to us as well. Once the user completes the survey, we pass this ID
609-
number back to Facebook to confirm completion, and in return receive a
610-
weight---call it $$w_i$$ for user $$i$$. (The random ID number is completely
611-
meaningless for any other purpose than receiving this weight, and does not allow
612-
us to access any information about the user's Facebook profile.)
620+
number back to Facebook to confirm completion, and in return receive a weight.
621+
(The random ID number is completely meaningless for any other purpose than
622+
receiving this weight, and does not allow us to access any information about the
623+
user's Facebook profile. Nor does it provide Facebook any information about the
624+
survey responses.)
613625

614626
We can use these weights to adjust our estimates so that they are representative
615627
of the US population---adjusting both for the differences between the US
@@ -626,34 +638,40 @@ $$
626638
where $$\pi_i$$ is an estimated probability (produced by Facebook) that an
627639
individual with the same state-by-age-gender profile as user $$i$$ would be a
628640
Facebook user and take our survey. The adjustment we make follows a standard
629-
inverse probability weighting strategy (this being a special case of importance
630-
sampling).
641+
inverse probability weighting strategy.
642+
643+
Detailed documentation on how Facebook calculates these weights is available in
644+
our [survey weight documentation](../../symptom-survey/weights.md).
645+
646+
For unweighted survey signals, we set $$w^\text{part}_i = 1$$ for all
647+
respondents.
648+
649+
### Geographic Weighting and Mixing
631650

632-
Detailed documentation on how Facebook calculates these weights is available on
633-
our [survey weight documentation page](../../symptom-survey/weights.md).
651+
Besides the participation weight $$w^\text{part}_i$$, each survey response
652+
receives a geographical-division weight $$w^{\text{geodiv}}_i$$ describing how
653+
much a participant's ZIP code "belongs" in the spatial unit of interest. For
654+
example, a ZIP code may overlap with multiple counties, so the weight describes
655+
what proportion of the ZIP code's population is in each county.
656+
657+
Each survey's weight is hence $$w^{\text{init}}_i = w^{\text{part}}_i
658+
w^{\text{geodiv}}_i$$. When a ZIP code spans multiple counties or states, a
659+
single survey may have different weights when used to calculate different
660+
geographic aggregates.
634661

635662
### Adjusting Household ILI and CLI
636663

637-
As before, for a given aggregation unit (for example, daily-county), let $$X_i$$
638-
and $$Y_i$$ denote the numbers of ILI and CLI cases in household $$i$$,
639-
respectively (computed according to the simple strategy above), and let $$N_i$$
640-
denote the total number of people in the household. Let $$i = 1, \dots, m$$
641-
denote the surveys started during the time period of interest and reported in a
642-
ZIP code intersecting the spatial unit of interest.
643-
644-
Each of these surveys is assigned two weights: the participation weight
645-
$$w^{\text{part}}_i$$, and a geographical-division weight
646-
$$w^{\text{geodiv}}_i$$ describing how much a participant's ZIP code "belongs"
647-
in the spatial unit of interest. (For example, a ZIP code may overlap with
648-
multiple counties, so the weight describes what proportion of the ZIP code's
649-
population is in each county.)
650-
651-
Let $$w^{\text{init}}_i=w^{\text{part}}_i w^{\text{geodiv}}_i$$ denote the
652-
initial weight assigned to this survey. First, we adjust these initial weights
653-
to reduce sensitivity to any individual survey by "mixing" them with a uniform
654-
weighting across all relevant surveys. This prevents specific survey respondents
655-
with high survey weights having disproportionate influence on the weighted
656-
estimates.
664+
For a given aggregation unit (for example, daily-county), let $$X_i$$ and
665+
$$Y_i$$ denote the numbers of ILI and CLI cases in household $$i$$, respectively
666+
(computed according to the simple strategy above), and let $$N_i$$ denote the
667+
total number of people in the household. Let $$i = 1, \dots, m$$ denote the
668+
surveys started during the time period of interest and reported in a ZIP code
669+
intersecting the spatial unit of interest.
670+
671+
First, we adjust the initial weights $$w^\text{init}$$ to reduce sensitivity to
672+
any individual survey by "mixing" them with a uniform weighting across all
673+
relevant surveys. This prevents specific survey respondents with high survey
674+
weights having disproportionate influence on the weighted estimates.
657675

658676
Specifically, we select the smallest value of $$a \in [0.05, 1]$$ such that
659677

@@ -702,8 +720,15 @@ $$
702720

703721
which are the delta method estimates of variance associated with self-normalized
704722
importance sampling estimators above, after combining with a pseudo-observation
705-
of 1/2 with weight $$\frac{1}{n_e}$$, assigned to appear like a single effective
706-
observation according to importance sampling diagnostics.
723+
of 1/2 with weight $$1/n_e$$, assigned to appear like a single effective
724+
observation. The use of the pseudo-observation prevents standard error estimates
725+
of zero, and in simulations improves the quality of the standard error
726+
estimates. See the [Appendix](#appendix) for further motivation for these
727+
estimators.
728+
729+
The pseudo-observation is not used in $$\hat{p}$$ and $$\hat{q}$$ themselves, to
730+
avoid potentially large amounts of estimation bias, as $$p$$ and $$q$$ are
731+
expected to be small.
707732

708733
The sample size reported is calculated by rounding down $$\sum_{i=1}^{m}
709734
w^{\text{geodiv}}_i$$ before adding the pseudo-observations. When ZIP codes do
@@ -725,38 +750,34 @@ knowing someone in their community who is sick. In this subsection we will
725750
describe how survey weights are used to construct weighted estimates for these
726751
indicators, using community CLI as an example.
727752

728-
As before, in a given aggregation unit (for example, daily-county), let $$U_i$$
729-
and $$V_i$$ denote the indicators that the survey respondent knows someone in
730-
their community with CLI, including and not including their household,
731-
respectively, for survey $$i$$, out of $$m$$ surveys collected. Also let
732-
$$w_i$$ be the self-normalized weight that accompanies survey $$i$$, as
733-
above. Then our initial weighted estimates of $$a$$ and $$b$$ are:
753+
In a given aggregation unit (for example, daily-county), let $$U_i$$ the
754+
indicator that the survey respondent knows someone in their community with CLI,
755+
including their household, for survey $$i$$, out of $$m$$ surveys collected.
756+
Also let $$w_i$$ be the weight that accompanies survey $$i$$, normalized to sum
757+
to 1 as above. Then our initial weighted estimate of the population proportion
758+
$$a$$ is:
734759

735760
$$
736-
\begin{aligned}
737-
\hat{a}_{w, init} &= 100 \cdot \sum_{i=1}^m w_i U_i \\
738-
\hat{b}_{w, init} &= 100 \cdot \sum_{i=1}^m w_i V_i.
739-
\end{aligned}
761+
\hat{a}_{w, \text{init}} = 100 \cdot \sum_{i=1}^m w_i U_i
740762
$$
741763

742-
After combining with a pseudo-observation, defined as before,
764+
To prevent observations and standard errors from being zero, we add a
765+
pseudo-observation of 1/2 with weight $$1/n_e$$. (This psuedo-observation can be
766+
thought of as equivalent to using a Bayesian estimate of the proportion, with a
767+
Jeffreys prior.) The estimate is hence:
743768

744769
$$
745-
\begin{aligned}
746-
\hat{a}_w &= 100 \cdot \frac{n_e \frac{\hat{a}_{w, init}}{100} + \frac12}{1 + n_e} \\
747-
\hat{b}_w &= 100 \cdot \frac{n_e \frac{\hat{b}_{w, init}}{100} + \frac12}{1 + n_e}.
748-
\end{aligned}
770+
\hat{a}_w = 100 \cdot \frac{n_e \frac{\hat{a}_{w, \text{init}}}{100} + \frac12}{1 + n_e},
749771
$$
750772

751-
with estimated standard errors:
773+
with estimated standard error:
752774

753775
$$
754-
\begin{aligned}
755-
\widehat{\mathrm{se}}(\hat{a}_w) &= 100 \cdot \sqrt{\frac{\frac{\hat{a}_w}{100}(1-\frac{\hat{a}_w}{100})}{1 + n_e}} \\
756-
\widehat{\mathrm{se}}(\hat{b}_w) &= 100 \cdot \sqrt{\frac{\frac{\hat{b}_w}{100}(1-\frac{\hat{b}_w}{100})}{1 + n_e}}.
757-
\end{aligned}
776+
\widehat{\mathrm{se}}(\hat{a}_w) = 100 \cdot \sqrt{\frac{\frac{\hat{a}_w}{100}(1-\frac{\hat{a}_w}{100})}{1 + n_e}}
758777
$$
759778

779+
which is the plug-in estimate of the standard error of the binomial proportion.
780+
760781

761782
## Appendix
762783

0 commit comments

Comments
 (0)