generated from dcl-docs/book
-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathcensus.Rmd
418 lines (271 loc) · 19.7 KB
/
census.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
---
nocite: "@walker-2021"
---
# U.S. Census Bureau
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(tidycensus)
```
## Census Bureau basics
The U.S. Census Bureau is a fantastic resource for data related to the U.S. population. As you saw in the [API Basics](http://dcl-wrangle.stanford.edu/api-basics.html) chapter, the Census Bureau makes a wide variety of APIs available. In this chapter, we'll focus on three: the decennial census, the American Community Survey, and the population estimates.
As we mentioned previously, many R packages wrap commonly used APIs, making it easier for you to obtain data. In this chapter, we'll introduce one such package: [tidycensus](https://walkerke.github.io/tidycensus/). We'll show you how to use tidycensus to obtain data from the decennial census and American Community Survey. Then, we'll go into more detail about working directly with the Census Bureau APIs.
First, we'll give a bit of background about three U.S. Census data sources: the decennial census, the American Community Survey, and the population estimates.
### Decennial census
When most people think of the U.S. Census, they're thinking about the decennial census. The Census Bureau conducts the decennial census every ten years (starting in 1790), with the goal of determining the number of people living in the United States. Because many features of the U.S. government, including the number of representatives awarded to each state, depend on accurate population counts, the decennial census is required by the Constitution.
For the decennial census, the Census Bureau tries to survey every household in the U.S. in an attempt to count every U.S. resident. The population estimates that come from the decennial census are therefore the most definitive that you can find. However, decennial census data come out only every ten years, so can be out-of-date. The decennial census survey also only asks a few questions, primarily about household size, race, ethnicity.^[U.S. Census Bureau. Questionnaires. https://www.census.gov/history/www/through_the_decades/questionnaires/] The American Community Survey provides more detailed and up-to-date data.
### ACS
Between 1790 and 2000, decennial censuses included both a short form and a long form. Every household filled out the short form, but a sample also filled out the long form, which included additional questions. After 2000, the Census Bureau turned the long form into the American Community Survey (ACS), and began administering the ACS every year.
The ACS, unlike the decennial census, is a sample. Every year, the ACS surveys a representative sample consisting of 3.5 million households. The Census Bureau then uses this sample to provide estimates for the entire U.S. population.^[U.S. Census Bureau. American Community Survey: Information Guide. https://www.census.gov/content/dam/Census/programs-surveys/acs/about/ACS_Information_Guide.pdf]
The ACS calculates these estimates over two time periods: 1 year and 5 years. The 1-year estimates are the most current, but have larger margins of error due to their smaller sample size. Most of the time, you'll use the 5-year estimates. Their larger sample sizes gives them greater accuracy, particularly for smaller geographic units.
### Population estimates
The third data source we'll discuss comes from the Census Bureau's [Population Estimates Program (PEP)](https://www.census.gov/programs-surveys/popest.html). The decennial census publishes the definitive population of the United States every ten years. However, if you want to know the population of a U.S. geographic area between decennial census years, you'll need to use the [Population Estimates APIs](https://www.census.gov/data/developers/data-sets/popest-popproj/popest.html).
The ACS also includes population estimates, but the estimation techniques used for the Population Estimates Program are more accurate.
### Choosing data
The first step to working with U.S. Census data is to decide which data source to use. Here's a quick guide:
__ACS__
Most of the time, you'll want the ACS. The ACS includes many different variables on social, economic, housing, and demographic aspects.
You'll typically want the 5-year ACS, unless you're looking at a large or rapidly changing geographic area or need yearly data.
__Decennial census__
Use the decennial census if you want definitive population data, don't need that many variables, and don't mind that the data is only available every 10 years.
__Population estimates__
Use data from the Population Estimates Program if you want accurate population data for a non-decennial census year (i.e., a year not divisible by 10).
## tidycensus
The [tidycensus](https://walkerke.github.io/tidycensus/) package wraps several U.S. Census Bureau APIs, allowing you to access decennial census and ACS data through R functions.
Before you use tidycensus for the first time, you'll need to obtain a Census Bureau API key. You can request one [here](https://api.census.gov/data/key_signup.html). You'll receive an email with your key. Copy your key to the clipboard, then navigate back to RStudio. Run the following line to open your .Renviron file:
```{r eval=FALSE}
usethis::edit_r_environ()
```
Then, add the following line, replacing `YOUR_API_KEY` with the key sent to you by the Census Bureau.
```{r eval=FALSE}
CENSUS_API_KEY=YOUR_API_KEY
```
Save and close the file, and then restart R (_Ctrl/Cmd_ + _Shift_ + _F10_) for the changes to take effect.
From now on, you won't need to worry about a key. Your key will stay in your .Renviron across R sessions.
### Specify a dataset
We'll show you how to use the tidycensus package to access two Census APIs: the decennial census and the ACS. Before you start using tidycensus, you'll need to decide which dataset and year to use.
__Dataset__
See our discussion in [Choosing data](http://dcl-wrangle.stanford.edu/census.html#choosing-data) for the trade-off between the decennial census, ACS, and the different ACS estimates.
For many tidycensus functions, you specify the different surveys in the following way:
* `"acs5"`: 5-year ACS
* `"acs1"`: 1-year ACS
* `"sf1"`: Decennial census
_sf_ stands for _Summary File_. Summary File 1 (`"sf1"`) corresponds to the short form described earlier, while Summary File 3 (`"sf3"`) corresponds to the long form. As explained earlier, the ACS took the place of the long form in 2001, so `"sf3"` is only available for censuses from 2000 or earlier.
__Year__
You'll also need to decide on a year. For the decennial censuses, `year` will just be the year of the decennial census. Remember that the decennial census occurs in years ending in 0. The tidycensus package can access the 1990, 2000, and 2010 decennial censuses.
For ACS data, the `year` argument of tidycensus functions refers to the end-year of the sample period. For example, if you want to use a 5-year ACS that ended in 2019, set `year = 2019`. As of July 2021, tidycensus supports 5-year ACS end-years 2009 through 2019, and 1-year ACS end-years 2005 through 2019.
### Find variables
```{r echo=FALSE}
all_vars_decennial <-
load_variables(year = 2010, dataset = "sf1")
all_vars_acs5 <-
load_variables(year = 2019, dataset = "acs5")
```
Data from both the ACS and decennial census capture many variables. The 2010 decennial census includes `r format(nrow(all_vars_decennial), big.mark = ",")`, while the 2019 5-year ACS includes `r format(nrow(all_vars_acs5), big.mark = ",")`!
A code, like `H001001` or `P011014`, identifies each of these variables. To use tidycensus, you'll need to determine the codes of your variables of interest. We'll use the function `tidycensus::load_variables()` to find ACS or decennial census variables and their accompanying codes.
`load_variables()` returns a tibble of all variable codes from a given dataset, alongside brief descriptions. We'll load the variables from the 2019 5-year ACS as an example.
```{r}
all_vars_acs5 <-
load_variables(year = 2019, dataset = "acs5")
all_vars_acs5
```
`load_variables()` returns a tibble with the three variables:
* `name`: The variable code.
* `label`: A description of the variable.
* `concept`: A broader categorization.
Let's take a closer look at just one concept: sex by age.
```{r}
all_vars_acs5 %>%
filter(concept == "SEX BY AGE")
```
`r nrow(all_vars_acs5 %>% filter(concept == "SEX BY AGE"))` variables belong to the "SEX BY AGE" concept. Each row refers to a variable under that concept.
This way of thinking about variables can be a bit confusing at first. The `Estimate!!Total` variable captures the number of people for whom sex by age data is available. `Estimate!!Total!!Male` captures the total number of males, while `Estimate!!Total!!Male!!Under 5 years` captures the total number of males under 5 years old.
This is a bit more intuitive for concepts like `"SEX BY AGE (ASIAN ALONE)"`.
```{r}
all_vars_acs5 %>%
filter(concept == "SEX BY AGE (ASIAN ALONE)")
```
Here, `Estimate!!Total` represents the total number of Asian U.S. residents for whom sex/age data is relevant.
```{r echo=FALSE, out.height='50%'}
knitr::include_graphics("images/census/acs-variables.png", dpi = 50)
```
Most concepts have a `Estimate!!Total` variable, or something similar. If you want to calculate a proportion, such as the proportion of males, use the relevant `Estimate!!Total` as the denominator.
To find the variables you want, pipe the result of `load_variables()` into `view()`.
```{r eval=FALSE}
all_vars_acs5 %>%
view()
```
You can use the search bar to search for variables with a given a keyword (e.g., "income"). You can also click on the _Filter_ button to get a search bar for each variable. Once you've found the variables you want, copy their codes (the `name` variable), and store them in a named vector.
```{r}
vars_acs5 <-
c(
median_income = "B06011_001",
median_rent = "B25064_001"
)
```
If we wanted to get variables from, say the 2010 decennial census, we'd use
```{r eval=FALSE}
load_variables(year = 2010, dataset = "sf1") %>%
view()
```
The [documentation](https://www.census.gov/programs-surveys/acs/technical-documentation/code-lists.html) for the ACS is helpful if you need additional information about ACS variables.
### Get data
tidycensus provides the functions `get_acs()` and `get_decennial()` to get Census Bureau ACS and decennial data. At minimum, you should supply these functions with three variables:
* `geography`
* `variables`
* `year`
`geography` controls the geographic level of the data returned. The tidycensus website includes a helpful [table of all available geographies](https://walkerke.github.io/tidycensus/articles/basic-usage.html#geography-in-tidycensus). Common values are "state" and "county".
Supply `variables` with a vector of variable codes. Earlier, we stored the codes of some variables in the vector `vars_acs5`.
```{r}
vars_acs5
```
For `get_acs()`, `year` indicates the _end-year_ for the ACS estimates. If you want ACS estimates from 2015-2019, set `year = 2019`. By default, `get_acs()` uses the 5-year estimates. You can use other estimates by specifying `survey`.
```{r}
df_acs <-
get_acs(
geography = "state",
variables = vars_acs5,
year = 2019
)
df_acs
```
`get_acs()` will return the estimate and margin of error (`moe`) for each variable. Because the ACS values are estimates, the Census Bureau calculates a margin of error for most variables.
To pivot the data into a wider format, we can use `pivot_wider()`.
```{r}
df_acs %>%
pivot_wider(
names_from = variable,
values_from = c(estimate, moe)
)
```
`get_decennial()` works similarly. First, we'll find the variables with `load_variables()` and store several in a vector.
```{r eval=FALSE}
load_variables(year = 2010, dataset = "sf1") %>%
view()
```
```{r}
vars_decennial <-
c(
pop_urban = "H002002",
pop_rural = "H002005"
)
```
Then, we'll use `get_decennial()` to access the data. This time, we'll get the data at the county level. Here, `year` is the year of the decennial census.
```{r}
df_decennial <-
get_decennial(
geography = "county",
variables = vars_decennial,
year = 2010
)
df_decennial
```
Again, we can use `pivot_wider()`.
```{r}
df_decennial %>%
pivot_wider(names_from = variable, values_from = value)
```
Now, you can use your data however you wish. Note that, if you're interested in geospatial aspects of ACS or decennial census data, we recommend using our [ussf package](https://github.com/dcl-docs/ussf) for boundaries. You can install the package with the following command.
```{r eval=FALSE}
remotes::install_github("dcl-docs/ussf")
```
You can join your ACS or decennial census data with the result of `ussf::boundaries()`, using `GEOID` as a unique identifier.
```{r}
df_acs %>%
left_join(
ussf::boundaries(geography = "state") %>% select(GEOID),
by = "GEOID"
)
```
You can make makes with this data using ggplot2 and `geom_sf()`.
## Population estimates
For population estimates outside decennial census years, you'll need to work directly with the [Population Estimates APIs](https://www.census.gov/data/developers/data-sets/popest-popproj/popest.html).
In the [API Basics](http://dcl-wrangle.stanford.edu/api-basics.html#find-your-api) chapter, we discussed a workflow for working directly with APIs, and walked you through an example that used the Population Estimates. In this section, we'll dive into more detail about the Population Estimates APIs.
### Choose data
There are two types of population estimate APIs: the _vintages_ and the _intercensals_. Each vintage contains data on all years since the last decennial census. For example, the 2018 Vintage contains data for each year between 2010 and 2018. You should use the most recent vintage available, since the Census Bureau updates all previous years' estimates.
The intercensals contain data between previous decennial census. For example, the 2000-2010 intercensal contains yearly data for 2000 through 2010. If you want data from, for example, 2000 to 2018, you'll need to use both a vintage and an intercensal, which will involve two API queries.
As of February 2020, the Vintage 2018 estimates are the most recent, fully available estimates. Vintage 2019 estimates are available for some geographic units, but won't be available at the county level until March 2020.
Once you've decided on a vintage or a intercensal, click on the corresponding tab. You'll see the various available APIs for those estimates.
The links under the API name provide you with more information about the API. If you want population data, you'll probably want the _Population Estimates_ API, which provides yearly population estimates at various geographic levels.
```{r echo=FALSE}
knitr::include_graphics(
"images/census/vintage-2018-pop-est.png", dpi = image_dpi
)
```
### Craft your request
In the Census Basics chapter, we laid out the steps for working with an API. After finding your API, the next step is to craft the request.
First, you'll need to find the __base request__, which will be listed next to _API Call_.
```{r echo=FALSE}
knitr::include_graphics(
"images/census/vintage-2018-api-call.png", dpi = image_dpi
)
```
Next, add __parameters__ to the base request to specify exactly what data you want. For the Population Estimates APIs, there are two important parameters: `get` and `for`. (You'll also see `key` in the examples on the Census website, but an API key isn't actually necessary.)
```{r echo=FALSE}
knitr::include_graphics("images/census/api-request.png", dpi = image_dpi)
```
`get` controls which variables (i.e., what will eventually become your tibble columns) the request returns. `for` controls the geographic level. The parameters come after a `?` and are separated by a `&`. We'll go over both in more detail next.
__Variables__
To see all possible variables, click on the _Variables_ link under your chosen API.
```{r echo=FALSE}
knitr::include_graphics("images/census/vintage-2018-variables.png", dpi = image_dpi)
```
This will lead you to a table of variables.
```{r echo=FALSE}
knitr::include_graphics(
"images/census/vintage-2018-variables-table.png",
dpi = image_dpi
)
```
The variables in all caps, like POP and GEONAME, are the names of variables returned by the API. The variables in lowercase, like `for` and `in`, are actually API request parameters. For now, just pay attention to the uppercase variables. You'll specify the names of your desired variable after the API parameter `get`.
Often, you'll want to get population over time. To get data for each year, you'll need the variables `DATE_DESC` and `DATE_CODE`. `DATE_CODE` is a code associated with a date (e.g., `1` or `2`), and `DATE_DESC` describes the date (e.g., `"7/1/2018 population estimate"`). Without `DATE_DESC`, you won't know what each `DATE_CODE` refers to, and without `DATE_CODE`, the API only returns data for one year.
__Geographies__
Next, you'll need to specify what geographic level of the data. For each Population Estimate API, check which geographies are available by clicking on the _Examples and Supported Geographies_ link.
```{r echo=FALSE}
knitr::include_graphics(
"images/census/vintage-2018-examples-geographies.png",
dpi = image_dpi
)
```
This link leads to a table with a description of the API, and links to examples, geographies, and other information. Click on the _geographies_ link. Then, you'll see a table with different geographies.
```{r echo=FALSE}
knitr::include_graphics(
"images/census/geographies.png",
dpi = image_dpi
)
```
Use these geographies to determine the data returned. For example:
* `for=us` will return data for the entire U.S.
* `for=state:30` will return data for Montana.
* `for=state:*` will return data by state, for all states.
* `for=county:*` will return data by county, for all counties in the U.S.
* `for=county:*&in=state:30` will return data by county, just for Montana.
#### Examples
Each API has an examples page. From the Population Estimates homepage, navigate to _Examples and Supported Geographies_ > _Examples_. [This page](https://api.census.gov/data/2018/pep/population/examples.html) lists examples for the 2018 Vintage Population Estimates API.
Here are some more examples:
__Entire U.S. population data by year__
[https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=us](https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=us)
__Population data for only Montana, by year__
[https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=state:30](https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=state:30)
__Population data for all counties, by year__
[https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=county:*](https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=county:*)
### Read the data into R
The next step is to read your data into R. We'll use the same process introduced in [API Basics](http://dcl-wrangle.stanford.edu/api-basics.html#read-the-data-into-r). Let's use a request that gets population data for all states, by year.
```{r, warning=FALSE}
request <-
"https://api.census.gov/data/2018/pep/population?get=GEONAME,DATE_CODE,DATE_DESC,POP&for=state:*"
response <-
request %>%
jsonlite::fromJSON() %>%
as_tibble() %>%
janitor::row_to_names(row_number = 1)
response
```
You'll still need to do a bit of light cleaning. Also, notice that there are three different estimates for 2010 (the decennial census year).
```{r}
response %>%
distinct(DATE_DESC) %>%
pull(DATE_DESC)
```
You'll generally want to use all the July 1st estimates so that all estimates are a year apart.
## To learn more
A good source to go deeper into working with census data is Walker, [Analyzing US Census Data: Methods, Maps, and Models in R](https://walker-data.com/census-r/).