-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-chap3.Rmd
319 lines (250 loc) · 20.2 KB
/
03-chap3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
```{r include = FALSE}
library(sf)
library(ggplot2)
library(broom)
library(here)
library(knitr)
knitr::opts_chunk$set(echo = FALSE)
options(scipen=999)
sf_precincts <- st_read(dsn = here("data", "sf_precincts"), stringsAsFactors = FALSE)
colnames(sf_precincts) <- c('precinct', 'over_count', 'no_over_count', 'under_count',
'no_under_count', 'PREC_2017', 'population', 'female',
'pop_18_24', 'pop_25_44', 'pop_45_64', 'pop_65_plus',
'hispanic', 'white', 'black', 'native', 'asian', 'pac_islander',
'other_race', 'no_hs', 'college', 'poverty', 'no_english',
'weight', 'overvote_rate', 'undervote_rate', 'turnout',
'no_turnout', 'turnout_rate', 'geometry')
# this just undoes the var name shortening in the shapefile format
linear_turnout <- lm(log(turnout_rate) ~ pop_18_24
+ pop_45_64
+ no_hs
+ college
+ white,
data = sf_precincts)
linear_over <- lm(overvote_rate ~ black,
data = sf_precincts)
linear_under <- lm(undervote_rate ~ black
+ pop_18_24
+ pop_25_44
+ pop_45_64
+ college
+ pac_islander
+ white
+ no_hs,
data = sf_precincts)
logit_turnout <- glm(cbind(turnout, no_turnout) ~ college
+ no_hs
+ pop_18_24
+ pop_45_64
+ poverty
+ asian
+ pac_islander
+ pop_65_plus
+ female
+ native
+ no_english
+ hispanic
+ white
+ black
+ pop_25_44
+ other_race,
data = sf_precincts, family = 'binomial')
logit_over <- glm(cbind(over_count, no_over_count) ~ black,
data = sf_precincts, family = 'binomial')
logit_under <- glm(cbind(under_count, no_under_count) ~ black
+ college
+ pop_18_24
+ pop_25_44
+ pop_45_64
+ white
+ asian
+ hispanic
+ no_hs
+ pac_islander,
data = sf_precincts, family = 'binomial')
```
# Demographic analysis results {#demo-results}
For the following models, I use variables^[These variables do experience some collinearity, but were included anyway in an attempt to "cover all our bases" in regards to the demographic variables. Further research could better refine the choice of variables.] taken from the US Census 2018 Planning Database^[All descriptions taken from the Planning Database as well.], estimated for the election precincts through the areal interpolation method^[We removed from the dataset one outlier with a turnout rate of 124%, Precinct 7024. We believe this is a particularly egregious inaccuracy in the interpolation process of calculating population. The gaps in the maps in this chapter are Golden Gate Park (to the northwest), Crocker-Amazon Playground and John McLaren Park (to the south), and our removed precinct 7024 (to the southeast).]. All data is self-reported through the Census process^[In regards to the `female` variable: the Census specifically asks about binary sex; there are currently no questions about gender identity. We thus refer to this as "percentage female" later on to avoid undue conclusions as a result of this distinction.]. A description of these variables is in the [Appendix](#appendix).
## Voter turnout
The turnout across the city is displayed in Figure \@ref(fig:turnout-map). This election apparently saw low turnout (generally sub-50% in precincts), and has a right-skewed distribution - that is, a handful of precincts had notably high turnout rates. There is no discernable spatial pattern to the turnout rate, other than maybe some slightly higher turnout in the central and east central parts of the city.
```{r turnout-map, fig.cap='Observed turnout rate by precinct', out.width = '6in', message=FALSE}
options(scipen=999)
turnout_map <- ggplot(sf_precincts) + geom_sf(aes(fill = turnout_rate), lwd = 0) +
theme_void() + labs(fill = 'Turnout rate')
turnout_hist <- ggplot(sf_precincts, aes(x = turnout_rate)) + geom_histogram() +
theme_bw() + labs(x = 'Turnout rate', y = 'Count')
turnout_map + turnout_hist + plot_layout(ncol = 2, widths = c(4,2))
```
### Linear
With a BIC value of $`r round(BIC(linear_turnout), 1)`$, the best linear model (Table \@ref(tab:linear-turnout-model)) we found for predicting turnout^[This was actually set to predict the natural log of turnout because the raw turnout saw heteroskedasticity in residuals, which indicates that a linear model was not an accurate descriptor of this phenomenon.] included variables for age, education, and race. Having more young people (ages 18-24) and people without high school degrees was negatively correlated with turnout, while having more middle-aged, college-educated, and White people was positively correlated with turnout. This is consistent with general literature on voter turnout - racial and ethnic minorities vote less, while the older and better educated vote more.
```{r linear-turnout-model, fig.cap="Linear turnout model validation", message=FALSE, out.width='6in'}
kable(tidy(linear_turnout) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Linear model for RCV voter turnout (log)",
caption.short = "Linear turnout model",
booktabs = TRUE)
a <- ggplot(linear_turnout, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Fitted values") +
ylab("Residuals")
b <- ggplot(linear_turnout, aes(x = .resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals")
# log-linear model is not half bad at turnout
# slightly left skewed residuals, one super low outliers
# investigate these
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:linear-turnout-model) details the residuals from the model. These are normally distributed, with no systematic change in variance. There is one particularly low outlier, but on the whole the residuals are distributed evenly. Unfortunately, the residuals here are quite large - since we're predicting turnout rates, a "valid" residual^[That is, a residual where the predicted value is indeed a rate between 0 and 1.] should only range between -1 and 1. This indicates that this model is not particularly well fitted to the data.
### Logistic
With a BIC value of $`r round(BIC(logit_turnout), 1)`$, the best logistic model (Table \@ref(tab:logit-turnout-model)) we found for predicting turnout^[Now, we are predicting the raw, un-logged turnout scores again.] included all of our variables. Education and age were positively correlated with turnout, while poverty, percentage female, and non-English-speaking population were negatively correlated. All of the racial groups were correlated negatively with turnout, and there is no distinct pattern among minority groups versus White voters.
```{r logit-turnout-model, fig.cap="Logistic turnout model validation", message=FALSE, out.width='6in'}
kable(tidy(logit_turnout) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Logistic model for RCV voter turnout",
caption.short = "Logit turnout model",
booktabs = TRUE)
turnout_df <- data.frame(fitted = predict(logit_turnout, sf_precincts, type = 'response'),
actual = sf_precincts$turnout_rate) %>%
mutate(resid = fitted - actual)
a <- ggplot(turnout_df, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Predicted turnout rates") +
ylab("Residuals")
b <- ggplot(turnout_df, aes(x = resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals") +
ylab("Count")
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:logit-turnout-model) details the residuals of our predictions^[This is somewhat an abuse of the linear model validation plots. Since we're using the logit link function to predict a turnout rate (rather than an individual's binary turnout measure), this was the best visualization I could come up with to express the "accuracy" of the logistic method. There are no equivalent assumptions that these plots address, like in the linear case.]. These are normally distributed (with a slight left skew), and notably are much smaller in general than the linear method^[They are bound to $[-1,1]$ because the logistic method outputs a probability, but are still much smaller aside from this limitation.]. The middle 95% of residuals for the linear model lie between `r round(quantile(linear_turnout$residuals, .025), 3)` and `r round(quantile(linear_turnout$residuals, .975), 3)`, significantly wider than the `r round(quantile(turnout_df$resid, .025), 3)` and `r round(quantile(turnout_df$resid, .975), 3)` from the logistic method. We see some heteroskedasticity, as the residuals are much smaller at the high and low ends of the predicted values. While a better version of this model may exist, the smaller residuals alone indicate that this is a much better explanation of the turnout method than the linear model.
## Overvoting
The rate of overvoting across the city is displayed in Figure \@ref(fig:overvote-map). Overvote rates are usually very small fractions, and here we see that no precinct had an overvote rates higher than 2%. Many, in fact, had no overvotes at all. There is no discernable spatial pattern to the overvote rate.
```{r overvote-map, fig.cap='Observed overvote rate by precinct', out.width = '6in', message=FALSE}
overvote_map <- ggplot(sf_precincts) + geom_sf(aes(fill = overvote_rate), lwd = 0) +
theme_void() + labs(fill = 'Overvote rate')
overvote_hist <- ggplot(sf_precincts, aes(x = overvote_rate)) + geom_histogram() +
theme_bw() + labs(x = 'Overvote rate', y = 'Count')
overvote_map + overvote_hist + plot_layout(ncol = 2, widths = c(4,2))
```
### Linear
With a BIC value of $`r round(BIC(linear_over), 1)`$, the best linear model (Table \@ref(tab:linear-overvote-model)) we found for predicting overvote rates included only the variable `black`. Having more African-Americans in a precinct was positively correlated with overvoting.
```{r linear-overvote-model, fig.cap="Linear overvote model validation", message=FALSE, out.width='6in'}
kable(tidy(linear_over) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Linear model for overvote rates",
caption.short = "Linear overvote model",
booktabs = TRUE)
a <- ggplot(linear_over, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Fitted values") +
ylab("Residuals")
b <- ggplot(linear_over, aes(x = .resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals")
# log-linear model is not half bad at turnout
# slightly left skewed residuals, one super low outliers
# investigate these
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:linear-overvote-model) details the residuals from the model. Here we see that the linear model is not a great fit for this data. The combination of low overvoting rates and a lower bound at 0 mean that we see significant heteroskedasticity in the residuals. They are skewed in a strange way due to the prevalence of so many 0s, with the "normal mass" of residuals showing a right skew but the 0 residuals causing large spikes on the negative end. Interestingly, there also seems to be a line of residuals in the upper section of the scatter plot, indicating that there are a set of precincts with roughly the same high overvote rate that were predicted to have much lower rates. All this indicates that this model is poorly fitted to the data.
### Logistic
With a BIC value of $`r round(BIC(logit_over), 1)`$, the best logistic model (Table \@ref(tab:logit-overvote-model)) we found for predicting overvote rates also included only the variable `black` (again positively correlated with overvoting).
```{r logit-overvote-model, fig.cap="Logistic overvote model validation", message=FALSE, out.width='6in'}
kable(tidy(logit_over) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Logistic model for overvote rates",
caption.short = "Logit overvote model",
booktabs = TRUE)
over_df <- data.frame(fitted = predict(logit_over, sf_precincts, type = 'response'),
actual = sf_precincts$overvote_rate) %>%
mutate(resid = fitted - actual)
a <- ggplot(over_df, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Predicted overvote rates") +
ylab("Residuals")
b <- ggplot(over_df, aes(x = resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals") +
ylab("Count")
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:logit-overvote-model) details the residuals of our predictions. We see a similarly strange pattern in the residual scatterplot as in the linear overvote model, but this time the lower line is curved and the upper line is straight. There is no clear modal pattern in the residuals, except for an increase moving from the negative end that looks roughly like a bell curve. The middle 95% of the data is equally spread between the two methods ([`r paste(round(quantile(linear_over$residuals, c(.025, .975)), 3), collapse = ",")`] for the linear model and [`r paste(round(quantile(over_df$resid, c(.025, .975)), 3), collapse = ",")`] for the logistic), but they have different signs - the logistic model has more negative values, the linear more positive. Neither the linear nor the logistic model appears to be a good fit for overvote rates here. Perhaps a zero-inflated method would perform better with this unusual distribution.
## Undervoting
The rate of undervoting across the city is displayed in Figure \@ref(fig:undervote-map). These are somewhat normally distributed, and show no clear spatial pattern. Most precincts had between 20% and 40% of voters undervote.
```{r undervote-map, fig.cap='Observed undervote rate by precinct', out.width = '6in', message=FALSE}
undervote_map <- ggplot(sf_precincts) + geom_sf(aes(fill = undervote_rate), lwd = 0) +
theme_void() + labs(fill = 'Undervote rate')
undervote_hist <- ggplot(sf_precincts, aes(x = undervote_rate)) + geom_histogram() +
theme_bw() + labs(x = 'Undervote rate', y = 'Count')
undervote_map + undervote_hist + plot_layout(ncol = 2, widths = c(4,2))
```
### Linear
With a BIC value of $`r round(BIC(linear_under), 1)`$, the best linear model (Table \@ref(tab:linear-undervote-model)) we found for predicting undervote rates included predictors about age, education, and race. The three youngest age categories were negatively correlated with undervoting, both education variables were positively correlated, and three ethnic groups (`black`, `pac_islander`, `white`) were positively correlated as well. While this doesn't agree with previous literature on voting habits, neither does it provide a clear pattern with which to base a disagreement on. Particularly, we would not expect the percentage of people with a college degree *and* the percentage of people with no high school diploma to trend in the same direction.
```{r linear-undervote-model, fig.cap="Linear undervote model validation", message=FALSE, out.width='6in'}
kable(tidy(linear_under) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Linear model for undervote rates",
caption.short = "Linear undervote model",
booktabs = TRUE)
a <- ggplot(linear_under, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Fitted values") +
ylab("Residuals")
b <- ggplot(linear_under, aes(x = .resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals")
# log-linear model is not half bad at turnout
# slightly left skewed residuals, one super low outliers
# investigate these
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:linear-overvote-model) details the residuals from the model. The linear model seems to be a decent fit for this data, with relatively normal residuals and no clear variance pattern. The scale of the residuals could be improved, but on the whole this is not a poorly fitting model.
### Logistic
With a BIC value of $`r round(BIC(logit_under), 1)`$, the best logistic model (Table \@ref(tab:logit-undervote-model)) we found for predicting undervote rates also included variables in age, education, and race. We see this strange pattern in variable coefficients here as well - both educational groups and all of these ethnic groups are positively associated with undervoting (no clear minority vs. White distinction), and the three younger age categories are negatively associated with it.
```{r logit-undervote-model, fig.cap="Logistic undervote model validation", message=FALSE, out.width='6in'}
kable(tidy(logit_under) %>% mutate(term = stringr::str_remove_all(term, '\\(|\\)')),
caption = "Logistic model for undervote rates",
caption.short = "Logit undervote model",
booktabs = TRUE)
under_df <- data.frame(fitted = predict(logit_under, sf_precincts, type = 'response'),
actual = sf_precincts$undervote_rate) %>%
mutate(resid = fitted - actual)
a <- ggplot(under_df, aes(x = fitted, y = resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
theme_bw() +
xlab("Predicted overvote rates") +
ylab("Residuals")
b <- ggplot(under_df, aes(x = resid)) +
geom_histogram() +
theme_bw() +
xlab("Residuals") +
ylab("Count")
a + b + plot_layout(ncol = 2)
```
Figure \@ref(fig:logit-undervote-model) details the residuals of our predictions. These look very similar to the linear model: approximately normally distributed, with no clear heteroskedasticy, and even the same cluster shape in the residual scatter plot. They even have similar spreads in the residuals ([`r paste(round(quantile(linear_under$residuals, c(.025, .975)), 3), collapse = ",")`] for the middle 95% of the linear model's residuals and [`r paste(round(quantile(under_df$resid, c(.025, .975)), 3), collapse = ",")`] for the logistic). Under all these metrics, the logistic model performs well. Despite these models seeming to fit appropriately, it's hard to draw a useful conclusion from them due to the aforementioned issues with variable meaning.
## Summary
Table: (\#tab:model-comp) Comparison of models fit
| Metric | Model | BIC | Residual spread (95%) | Residual shape |
|--------|-------|-----|-----------------------|----------------|
| Turnout | Linear | $`r BIC(linear_turnout)`$ | $`r round(diff(quantile(linear_turnout[['residuals']], c(.025, .975))), 3)`$ | Normal |
| | Logistic | $`r BIC(logit_turnout)`$ | $`r round(diff(quantile(turnout_df[['resid']], c(.025, .975))), 3)`$ | Normal |
| Overvoting | Linear | $`r BIC(linear_over)`$ | $`r round(diff(quantile(linear_over[['residuals']], c(.025, .975))), 3)`$ | Non-normal |
| | Logistic | $`r BIC(logit_over)`$ | $`r round(diff(quantile(over_df[['resid']], c(.025, .975))), 3)`$ | Non-normal |
| Undervoting | Linear | $`r BIC(linear_under)`$ | $`r round(diff(quantile(linear_under[['residuals']], c(.025, .975))), 3)`$ | Normal |
| | Logistic | $`r BIC(logit_under)`$ | $`r round(diff(quantile(under_df[['resid']], c(.025, .975))), 3)`$ | Normal |
In Table \@ref(tab:model-comp) we compare the models fit earlier. While the BIC is not entirely comparable between the linear and logistic models, we can still draw some insights here. A lower BIC is better, so on the whole the models for overvoting and undervoting were much better fits than the models for turnout. The spread of residuals also supports the above. We did see, however, that
1. The overvoting models had some significantly non-normal residual patterns, and
2. The undervoting models showed strange relationships between the coefficients.
Overall, however, this analysis shows that the rates of overvoting and undervoting are affected by some of the same demographic considerations as turnout rate in general is affected by: age, race, and education.