@@ -246,9 +246,9 @@ data in the estimation procedures described above.
246
246
247
247
## Behavior Indicators
248
248
249
- Signals beginning ` smoothed_w ` are [ adjusted using survey weights
250
- to be demographically representative] ( #survey-weighting ) as described below.
251
- Weighted signals have 1-2 days of lag, so if low latency is paramount,
249
+ Signals beginning ` smoothed_w ` are [ adjusted using survey weights to be
250
+ demographically representative] ( #survey-weighting-and-estimation ) as described
251
+ below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
252
252
unweighted signals are also available. These begin ` smoothed_ ` , such as
253
253
` smoothed_wearing_mask ` instead of ` smoothed_wwearing_mask ` .
254
254
@@ -291,9 +291,9 @@ unweighted signals are also available. These begin `smoothed_`, such as
291
291
292
292
## Testing Indicators
293
293
294
- Signals beginning ` smoothed_w ` are [ adjusted using survey weights
295
- to be demographically representative] ( #survey-weighting ) as described below.
296
- Weighted signals have 1-2 days of lag, so if low latency is paramount,
294
+ Signals beginning ` smoothed_w ` are [ adjusted using survey weights to be
295
+ demographically representative] ( #survey-weighting-and-estimation ) as described
296
+ below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
297
297
unweighted signals are also available. These begin ` smoothed_ ` , such as
298
298
` smoothed_tested_14d ` instead of ` smoothed_wtested_14d ` .
299
299
@@ -311,9 +311,9 @@ September 8, 2020.
311
311
312
312
## Vaccination Indicators
313
313
314
- Signals beginning ` smoothed_w ` are [ adjusted using survey weights
315
- to be demographically representative] ( #survey-weighting ) as described below.
316
- Weighted signals have 1-2 days of lag, so if low latency is paramount,
314
+ Signals beginning ` smoothed_w ` are [ adjusted using survey weights to be
315
+ demographically representative] ( #survey-weighting-and-estimation ) as described
316
+ below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
317
317
unweighted signals are also available. These begin ` smoothed_ ` , such as
318
318
` smoothed_covid_vaccinated ` instead of ` smoothed_wcovid_vaccinated ` .
319
319
@@ -436,9 +436,9 @@ V1 beginning January 6, 2021.
436
436
437
437
## Mental Health Indicators
438
438
439
- Signals beginning ` smoothed_w ` are [ adjusted using survey weights
440
- to be demographically representative] ( #survey-weighting ) as described below.
441
- Weighted signals have 1-2 days of lag, so if low latency is paramount,
439
+ Signals beginning ` smoothed_w ` are [ adjusted using survey weights to be
440
+ demographically representative] ( #survey-weighting-and-estimation ) as described
441
+ below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
442
442
unweighted signals are also available. These begin ` smoothed_ ` , such as
443
443
` smoothed_anxious_5d ` instead of ` smoothed_wanxious_5d ` .
444
444
@@ -463,9 +463,9 @@ include respondents to Wave 4 and later waves, beginning September 8, 2020.
463
463
## Belief, Experience, and Information Indicators
464
464
465
465
Signals beginning ` smoothed_w ` are [ adjusted using survey weights to be
466
- demographically representative] ( #survey-weighting ) as described below. Weighted
467
- signals have 1-2 days of lag, so if low latency is paramount, unweighted signals
468
- are also available. These begin ` smoothed_ ` , such as
466
+ demographically representative] ( #survey-weighting-and-estimation ) as described
467
+ below. Weighted signals have 1-2 days of lag, so if low latency is paramount,
468
+ unweighted signals are also available. These begin ` smoothed_ ` , such as
469
469
` smoothed_belief_children_immune ` instead of ` smoothed_wbelief_children_immune ` .
470
470
471
471
### Beliefs About COVID-19
@@ -536,18 +536,19 @@ When interpreting the signals above, it is important to keep in mind several
536
536
limitations of this survey data.
537
537
538
538
* ** Survey population.** People are eligible to participate in the survey if
539
- they are age 18 or older, they are currently located in the USA, and they are an active user of Facebook. The survey
540
- data does not report on children under age 18, and the Facebook adult user
541
- population may differ from the United States population generally in important
542
- ways. We use our [ survey weighting] ( #survey-weighting ) to adjust the estimates
543
- to match age and gender demographics by state, but this process doesn't adjust
544
- for other demographic biases we may not be aware of.
539
+ they are age 18 or older, they are currently located in the USA, and they are
540
+ an active user of Facebook. The survey data does not report on children under
541
+ age 18, and the Facebook adult user population may differ from the United
542
+ States population generally in important ways. We use our [ survey
543
+ weighting] ( #survey-weighting-and-estimation ) to adjust the estimates to match
544
+ age and gender demographics by state, but this process doesn't adjust for
545
+ other demographic biases we may not be aware of.
545
546
* ** Non-response bias.** The survey is voluntary, and people who accept the
546
547
invitation when it is presented to them on Facebook may be different from
547
- those who do not. The [ survey weights provided by Facebook ] ( #survey-weighting )
548
- attempt to model the probability of response for each user and hence adjust
549
- for this, but it is difficult to tell if these weights account for all
550
- possible non-response bias.
548
+ those who do not. The [ survey weights provided by
549
+ Facebook ] ( #survey-weighting-and-estimation ) attempt to model the probability
550
+ of response for each user and hence adjust for this, but it is difficult to
551
+ tell if these weights account for all possible non-response bias.
551
552
* ** Social desirability.** Previous survey research has shown that people's
552
553
responses to surveys are often biased by what responses they believe are
553
554
socially desirable or acceptable. For example, if it there is widespread
@@ -557,10 +558,11 @@ limitations of this survey data.
557
558
present.
558
559
* ** False responses.** As with anything on the Internet, a small percentage of
559
560
users give deliberately incorrect responses. We discard a small number of
560
- responses that are obviously false, but do not perform extensive filtering.
561
- However, the large size of the study, and our procedure for ensuring that each
562
- respondent can only be counted once when they are invited to take the survey,
563
- prevents individual respondents from having a large effect on results.
561
+ responses that are obviously false, but do ** not** perform extensive
562
+ filtering. However, the large size of the study, and our procedure for
563
+ ensuring that each respondent can only be counted once when they are invited
564
+ to take the survey, prevents individual respondents from having a large effect
565
+ on results.
564
566
* ** Repeat invitations.** Individual respondents can be invited by Facebook to
565
567
take the survey several times. Usually Facebook only re-invites a respondent
566
568
after one month. Hence estimates of values on a single day are calculated
@@ -575,14 +577,30 @@ strongly over time. This means that *changes* in signals, such as increases or
575
577
decreases, are likely to represent true changes in the underlying population,
576
578
even if point estimates are biased.
577
579
580
+ ### Privacy Restrictions
581
+
582
+ To protect respondent privacy, we discard any estimate (whether at a county,
583
+ MSA, HRR, or state level) that is based on fewer than 100 survey responses. For
584
+ signals reported using a 7-day average (those beginning with ` smoothed_ ` ), this
585
+ means a geographic area must have at least 100 responses in 7 days to be
586
+ reported.
587
+
588
+ This affects some items more than others. For instance, items about vaccine
589
+ hesitancy reasons are only asked of respondents who are unvaccinated and
590
+ hesitant, not to all survey respondents. It also affects some geographic areas
591
+ more than others, particularly rural areas with low population densities. When
592
+ doing analysis of county-level data, one should be aware that missing counties
593
+ are typically more rural and less populous than those present in the data, which
594
+ may introduce bias into the analysis.
595
+
578
596
### Declining Response Rate
579
597
580
598
We have noted a steady decrease in the number of daily survey responses,
581
599
beginning no later than January 2021. As the number of survey responses
582
600
declines, some indicators will become unavailable once they no longer meet the
583
- [ privacy limit for sample size] ( ../../symptom-survey/coding.md#privacy-restrictions ) .
584
- This affects some signals, such as those based on a subset of responses, more
585
- than others, with finer geographic resolutions becoming unavailable first.
601
+ privacy limit for sample size. This affects some signals, such as those based on
602
+ a subset of responses, more than others, with finer geographic resolutions
603
+ becoming unavailable first.
586
604
587
605
### Target Region
588
606
@@ -595,21 +613,15 @@ live in Puerto Rico or another US territory, we do not include their response
595
613
in the aggregations.
596
614
597
615
598
- ## Survey Weighting
599
-
600
- Notice that the estimates defined in the previous sections are calculated with
601
- respect to the population of US Facebook users. (To be precise, the ILI and CLI
602
- indicators reflect the population of US Facebook users * and* their household
603
- members). In reality, our estimates are even further skewed by the varying
604
- propensity of people in the population of US Facebook users to take our survey
605
- in the first place.
616
+ ## Survey Weighting and Estimation
606
617
607
618
When Facebook sends a user to our survey, it generates a random ID number and
608
619
sends this to us as well. Once the user completes the survey, we pass this ID
609
- number back to Facebook to confirm completion, and in return receive a
610
- weight---call it $$ w_i $$ for user $$ i $$ . (The random ID number is completely
611
- meaningless for any other purpose than receiving this weight, and does not allow
612
- us to access any information about the user's Facebook profile.)
620
+ number back to Facebook to confirm completion, and in return receive a weight.
621
+ (The random ID number is completely meaningless for any other purpose than
622
+ receiving this weight, and does not allow us to access any information about the
623
+ user's Facebook profile. Nor does it provide Facebook any information about the
624
+ survey responses.)
613
625
614
626
We can use these weights to adjust our estimates so that they are representative
615
627
of the US population---adjusting both for the differences between the US
626
638
where $$ \pi_i $$ is an estimated probability (produced by Facebook) that an
627
639
individual with the same state-by-age-gender profile as user $$ i $$ would be a
628
640
Facebook user and take our survey. The adjustment we make follows a standard
629
- inverse probability weighting strategy (this being a special case of importance
630
- sampling).
641
+ inverse probability weighting strategy.
642
+
643
+ Detailed documentation on how Facebook calculates these weights is available in
644
+ our [ survey weight documentation] ( ../../symptom-survey/weights.md ) .
645
+
646
+ For unweighted survey signals, we set $$ w^\text{part}_i = 1 $$ for all
647
+ respondents.
648
+
649
+ ### Geographic Weighting and Mixing
631
650
632
- Detailed documentation on how Facebook calculates these weights is available on
633
- our [ survey weight documentation page] ( ../../symptom-survey/weights.md ) .
651
+ Besides the participation weight $$ w^\text{part}_i $$ , each survey response
652
+ receives a geographical-division weight $$ w^{\text{geodiv}}_i $$ describing how
653
+ much a participant's ZIP code "belongs" in the spatial unit of interest. For
654
+ example, a ZIP code may overlap with multiple counties, so the weight describes
655
+ what proportion of the ZIP code's population is in each county.
656
+
657
+ Each survey's weight is hence $$w^{\text{init}}_ i = w^{\text{part}}_ i
658
+ w^{\text{geodiv}}_ i$$. When a ZIP code spans multiple counties or states, a
659
+ single survey may have different weights when used to calculate different
660
+ geographic aggregates.
634
661
635
662
### Adjusting Household ILI and CLI
636
663
637
- As before, for a given aggregation unit (for example, daily-county), let $$ X_i $$
638
- and $$ Y_i $$ denote the numbers of ILI and CLI cases in household $$ i $$ ,
639
- respectively (computed according to the simple strategy above), and let $$ N_i $$
640
- denote the total number of people in the household. Let $$ i = 1, \dots, m $$
641
- denote the surveys started during the time period of interest and reported in a
642
- ZIP code intersecting the spatial unit of interest.
643
-
644
- Each of these surveys is assigned two weights: the participation weight
645
- $$ w^{\text{part}}_i $$ , and a geographical-division weight
646
- $$ w^{\text{geodiv}}_i $$ describing how much a participant's ZIP code "belongs"
647
- in the spatial unit of interest. (For example, a ZIP code may overlap with
648
- multiple counties, so the weight describes what proportion of the ZIP code's
649
- population is in each county.)
650
-
651
- Let $$ w^{\text{init}}_i=w^{\text{part}}_i w^{\text{geodiv}}_i $$ denote the
652
- initial weight assigned to this survey. First, we adjust these initial weights
653
- to reduce sensitivity to any individual survey by "mixing" them with a uniform
654
- weighting across all relevant surveys. This prevents specific survey respondents
655
- with high survey weights having disproportionate influence on the weighted
656
- estimates.
664
+ For a given aggregation unit (for example, daily-county), let $$ X_i $$ and
665
+ $$ Y_i $$ denote the numbers of ILI and CLI cases in household $$ i $$ , respectively
666
+ (computed according to the simple strategy above), and let $$ N_i $$ denote the
667
+ total number of people in the household. Let $$ i = 1, \dots, m $$ denote the
668
+ surveys started during the time period of interest and reported in a ZIP code
669
+ intersecting the spatial unit of interest.
670
+
671
+ First, we adjust the initial weights $$ w^\text{init} $$ to reduce sensitivity to
672
+ any individual survey by "mixing" them with a uniform weighting across all
673
+ relevant surveys. This prevents specific survey respondents with high survey
674
+ weights having disproportionate influence on the weighted estimates.
657
675
658
676
Specifically, we select the smallest value of $$ a \in [0.05, 1] $$ such that
659
677
702
720
703
721
which are the delta method estimates of variance associated with self-normalized
704
722
importance sampling estimators above, after combining with a pseudo-observation
705
- of 1/2 with weight $$ \frac{1}{n_e} $$ , assigned to appear like a single effective
706
- observation according to importance sampling diagnostics.
723
+ of 1/2 with weight $$ 1/n_e $$ , assigned to appear like a single effective
724
+ observation. The use of the pseudo-observation prevents standard error estimates
725
+ of zero, and in simulations improves the quality of the standard error
726
+ estimates. See the [ Appendix] ( #appendix ) for further motivation for these
727
+ estimators.
728
+
729
+ The pseudo-observation is not used in $$ \hat{p} $$ and $$ \hat{q} $$ themselves, to
730
+ avoid potentially large amounts of estimation bias, as $$ p $$ and $$ q $$ are
731
+ expected to be small.
707
732
708
733
The sample size reported is calculated by rounding down $$\sum_ {i=1}^{m}
709
734
w^{\text{geodiv}}_ i$$ before adding the pseudo-observations. When ZIP codes do
@@ -725,38 +750,34 @@ knowing someone in their community who is sick. In this subsection we will
725
750
describe how survey weights are used to construct weighted estimates for these
726
751
indicators, using community CLI as an example.
727
752
728
- As before, in a given aggregation unit (for example, daily-county), let $$ U_i $$
729
- and $$ V_i $$ denote the indicators that the survey respondent knows someone in
730
- their community with CLI, including and not including their household,
731
- respectively, for survey $$ i $$ , out of $$ m $$ surveys collected. Also let
732
- $$ w_i $$ be the self-normalized weight that accompanies survey $$ i $$ , as
733
- above. Then our initial weighted estimates of $$ a $$ and $$ b $$ are :
753
+ In a given aggregation unit (for example, daily-county), let $$ U_i $$ the
754
+ indicator that the survey respondent knows someone in their community with CLI,
755
+ including their household, for survey $$ i $$ , out of $$ m $$ surveys collected.
756
+ Also let $$ w_i $$ be the weight that accompanies survey $$ i $$ , normalized to sum
757
+ to 1 as above. Then our initial weighted estimate of the population proportion
758
+ $$ a $$ is :
734
759
735
760
$$
736
- \begin{aligned}
737
- \hat{a}_{w, init} &= 100 \cdot \sum_{i=1}^m w_i U_i \\
738
- \hat{b}_{w, init} &= 100 \cdot \sum_{i=1}^m w_i V_i.
739
- \end{aligned}
761
+ \hat{a}_{w, \text{init}} = 100 \cdot \sum_{i=1}^m w_i U_i
740
762
$$
741
763
742
- After combining with a pseudo-observation, defined as before,
764
+ To prevent observations and standard errors from being zero, we add a
765
+ pseudo-observation of 1/2 with weight $$ 1/n_e $$ . (This psuedo-observation can be
766
+ thought of as equivalent to using a Bayesian estimate of the proportion, with a
767
+ Jeffreys prior.) The estimate is hence:
743
768
744
769
$$
745
- \begin{aligned}
746
- \hat{a}_w &= 100 \cdot \frac{n_e \frac{\hat{a}_{w, init}}{100} + \frac12}{1 + n_e} \\
747
- \hat{b}_w &= 100 \cdot \frac{n_e \frac{\hat{b}_{w, init}}{100} + \frac12}{1 + n_e}.
748
- \end{aligned}
770
+ \hat{a}_w = 100 \cdot \frac{n_e \frac{\hat{a}_{w, \text{init}}}{100} + \frac12}{1 + n_e},
749
771
$$
750
772
751
- with estimated standard errors :
773
+ with estimated standard error :
752
774
753
775
$$
754
- \begin{aligned}
755
- \widehat{\mathrm{se}}(\hat{a}_w) &= 100 \cdot \sqrt{\frac{\frac{\hat{a}_w}{100}(1-\frac{\hat{a}_w}{100})}{1 + n_e}} \\
756
- \widehat{\mathrm{se}}(\hat{b}_w) &= 100 \cdot \sqrt{\frac{\frac{\hat{b}_w}{100}(1-\frac{\hat{b}_w}{100})}{1 + n_e}}.
757
- \end{aligned}
776
+ \widehat{\mathrm{se}}(\hat{a}_w) = 100 \cdot \sqrt{\frac{\frac{\hat{a}_w}{100}(1-\frac{\hat{a}_w}{100})}{1 + n_e}}
758
777
$$
759
778
779
+ which is the plug-in estimate of the standard error of the binomial proportion.
780
+
760
781
761
782
## Appendix
762
783
0 commit comments