introduction-to-datascience/index.Rmd at master · eoas-a448/introduction-to-datascience · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
---
title: "Data Science: A First Introduction"
author:
- Tiffany-Anne Timbers
- Trevor Campbell
- Melissa Lee
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
documentclass: book
bibliography: [references.bib]
biblio-style: apalike
link-citations: yes
description: "This is a textbook for teaching a first introduction to data science."
output:
  bookdown::gitbook:
    css: style.css
    config:
      toc:
        before: |
          <li><a href="./">Data Science: A First Introduction</a></li>
        after: |
          <li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
      edit: https://github.com/rstudio/bookdown-demo/edit/master/%s
      download: ["pdf", "epub"]
  bookdown::pdf_book:
    includes:
      in_header: preamble.tex
    latex_engine: xelatex
    citation_package: natbib
    keep_tex: yes
  bookdown::epub_book: default
  always_allow_html: true
---

```{r setup, include=FALSE}
library(forcats)

# read canlang data from GitHub and place it in the data directory
can_lang <- readr::read_csv("https://github.com/ttimbers/canlang/raw/master/inst/extdata/can_lang.csv") %>%
  readr::write_csv("data/can_lang.csv")
```

# R, Jupyter, and the tidyverse

This is an open source textbook aimed at introducing undergraduate students to data science. It was originally written for the University of British Columbia's [DSCI 100 - Introduction to Data Science](https://ubc-dsci.github.io/dsci-100/) course. In this book, we define data science as the study and development of reproducible, auditable processes to obtain value (i.e., insight) from data.

The book is structured so that learners spend the first four chapters learning how to use the R programming language and Jupyter notebooks to load, wrangle/clean, and visualize data, while answering descriptive and exploratory data analysis questions. The remaining chapters illustrate how to solve four common problems in data science, which are useful for answering predictive and inferential data analysis questions:

1. Predicting a class/category for a new observation/measurement (e.g., cancerous or benign tumour)
2. Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25).
3. Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon)
4. Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population  (e.g., the proportion of undergraduate students that own an iphone)

For each of these problems, we map them to the type of data analysis question being asked and discuss what kinds of data are needed to answer such questions [@leek2015question; @peng2015art]. More advanced (e.g., causal or mechanistic) data analysis questions are beyond the scope of this text.

**Types of data analysis questions**

| Question type | Description | Example |
|---------------|-------------|---------|
| Descriptive | A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact). | How many people live in each province or territory in Canada? |
| Exploratory | A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. | Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada? |
| Inferential | A question that looks for patterns, trends, or relationships in a single data set **and** also asks for quantification of how applicable these findings are to the wider population. | Does political party voting change with indicators of wealth for all people living in Canada? |
| Predictive | A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. | What political party will someone vote for in the next Canadian election? |
| Causal | A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. | Does wealth lead to voting for a certain political party in Canadian elections? |
| Mechanistic | A question that asks about the underlying mechanism of the observed patterns, trends, or relationship (i.e., how does it happen?) | How does wealth lead to voting for a certain political party in Canadian elections? |

Source: [What is the question?](https://science.sciencemag.org/content/347/6228/1314) by Jeffery T. Leek, Roger D. Peng & [The Art of Data Science](https://leanpub.com/artofdatascience) by Roger Peng & Elizabeth Matsui

## Chapter learning objectives

By the end of the chapter, students will be able to:

- use a Jupyter notebook to execute provided R code
- edit code and markdown cells in a Jupyter notebook
- create new code and markdown cells in a Jupyter notebook
- load the `tidyverse` package into R
- create new variables and objects in R using the assignment symbol
- use the help and documentation tools in R
- match the names of the following functions from the `tidyverse` package to their documentation descriptions:
    - `read_csv`
    - `select`
    - `filter`
    - `mutate`
    - `ggplot`
    - `aes`

## Jupyter notebooks

Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they are able to combine these two in a single document---code is not separate from the output or written report---notebooks are one of the leading tools to create *reproducible data analyses*. A reproducible data analysis is one where you can reliably and easily recreate the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner.

The name Jupyter came from combining the names of the three programming language that it was initially targeted for (Julia, Python, and R), and now many other languages can be used with Jupyter notebooks.

A notebook looks like this:

```{r img-jupyter, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Jupyter Notebook", fig.retina = 2}
knitr::include_graphics("img/jupyter.png")
```

We have included a short demo video here to help you get started and to introduce you to R and Jupyter.
However, the best way to learn how to write and run code and formattable text in a Jupyter notebook is to do it yourself! [Here is a worksheet](https://github.com/UBC-DSCI/dsci-100-assets/blob/master/2019-fall/materials/worksheet_01/worksheet_01.ipynb) that provides a step-by-step guide through the basics.

<iframe width="840" height="473" src="https://www.youtube.com/embed/iel3ZvS-Ryc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

## Loading a spreadsheet-like dataset

Often, the first thing we need to do in data analysis is to load a dataset into R. When we bring spreadsheet-like (think Microsoft Excel tables) data, generally shaped like a rectangle, into R it is represented as what we call a *data frame* object. It is very similar to a spreadsheet where the rows are the collected observations and the columns are the variables.


```{r img-spreadsheet-vs-dataframe, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "A spreadsheet versus a data frame in R", out.width="850", fig.retina = 2}
knitr::include_graphics("img/spreadsheet_vs_dataframe.PNG")
```

The first kind of data we will learn how to load into R (as a data frame) is the
spreadsheet-like *comma-separated values* format (`.csv` for short).
These files have names ending in `.csv`, and can be opened open and saved from common spreadsheet programs like Microsoft Excel and Google Sheets.
For example, a `.csv` file named `can_lang.csv` [is included with the code for this book](https://github.com/UBC-DSCI/introduction-to-datascience/blob/master/data/can_lang.csv).
This file--- originally from [{canlang} R data package](https://ttimbers.github.io/canlang/)---has language data collected in the 2016 Canadian census [@cancensus2016].
If we were to open this data in a plain text editor, we would see each row on its own line, and each entry in the table separated by a comma:

```{bash, echo=FALSE, comment=NA}
head -n 10 data/can_lang.csv
```

To load this data into R, and then to do anything else with it afterwards, we will need to use something called a *function.*
A function is a special word in R that takes in instructions (we call these *arguments*) and does something. The function we will
use to read a `.csv` file into R is called `read_csv`.

In its most basic use-case, `read_csv` expects that the data file:

- has column names (or *headers*),
- uses a comma (`,`) to separate the columns, and
- does not have row names.

Below you'll see the code used to load the data into R using the `read_csv` function. But there is one extra step we need to do first. Since `read_csv` is not included in the base installation of R,
to be able to use it we have to load it from somewhere else: a collection of useful functions known as a *package*. The `read_csv` function in particular
is in the `tidyverse` package (more on this later), which we load using the `library` function.

Next, we call the `read_csv` function and pass it a single argument: the name of the file, `"can_lang.csv"`. We have to put quotes around filenames and other letters and words that we
use in our code to distinguish it from the special words that make up R programming language.  This is the only argument we need to provide for this file, because our file satifies everthing else
the `read_csv` function expects in the default use-case (which we just discussed). Later in the course, we'll learn more about how to deal with more complicated files where the default arguments are not
appropriate. For example, files that use spaces or tabs to separate the columns, or with no column names.

```{r load_state_property_data, warning=FALSE, message=FALSE}
library(tidyverse)
read_csv("data/can_lang.csv")
```
Above you can also see something neat that Jupyter does to help us understand our code: it colours text depending on its meaning in R. For example,
you'll note that functions get bold green text, while letters and words surrounded by quotations like filenames get blue text.

> **In case you want to know more (optional):**
> We use the `read_csv` function from the `tidyverse` instead of the base R function `read.csv` because it's faster and it creates a nicer variant of the base R data frame called a *tibble*.
> This has several benefits that we'll discuss in further detail later in the course.

## Assigning value to a data frame

When we loaded the language data collected in the 2016 Canadian census in R above using `read_csv`, we did not give this data frame a name, so it was
just printed to the screen and we cannot do anything else with it. That isn't very useful; what we would like to do is give a name to the data frame that `read_csv` outputs
so that we can use it later for analysis and visualization.

To assign name to something in R, there are two possible ways---using either the assignment symbol (`<-`) or the equals symbol (`=`). From a style perspective,
the assignment symbol is preferred and is what we will use in this course. When we name something in R using the assignment symbol, `<-`, we do not need to surround
it with quotes like the filename. This is because we are formally telling R about this word and giving it a value. Only characters and words that act as values need
to be surrounded by quotes.

Let's now use the assignment symbol to give the name `can_lang` to the language data collected in the 2016 Canadian census data frame that we get from `read_csv`.

```{r load_data_with_name, message=FALSE}
can_lang <- read_csv("data/can_lang.csv")
```

Wait a minute! Nothing happened this time! Or at least it looks like that. But actually, something did happen: the data was read in and now has the name `can_lang` associated with it.
And we can use that name to access the data frame and do things with it. First we will type the name of the data frame to print it to the screen.


```{r print}
can_lang
```

## Creating subsets of data frames with `select` & `filter`

Now, we are going to learn how to obtain subsets of data from a data frame in R using two other `tidyverse` functions: `select` and `filter`.
The `select` function allows you to create a subset of the columns of a data frame, while the `filter` function allows you to obtain a subset of the rows with specific values.

Before we start using `select` and `filter`, let's take a look at the language data collected in the 2016 Canadian census again to familiarize ourselves with it.
We will do this by printing the data we loaded earlier in the chapter to the screen.

```{r print_data_again}
can_lang
```


In this data frame there are 214 rows rows (corresponding to the 214 languages recorded on the 2016 Canadian census)
and 6 columns:

1. `category`: Higher level language category (describing whether the language is an Official Canadian language, an Aboriginal language, or a Non-Official and Non-Aboriginal language).
2. `language`: Language queried about on the Canadian Census.
3. `mother_tongue`: Total count of Canadians from the Census who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth.
4. `most_at_home`: Total count of Canadians from the Census who reported the language as spoken most often at home.
5. `most_at_work`: Total count of Canadians from the Census who reported the language as used most often at work for the population.
6. `lang_known`: Total count of Canadians from the Census who reported knowledge of language for the population in private households.

Now let's use `select` to extract the language column from this data frame. To do this, we need to provide the `select` function with two arguments. The first argument is the
name of the data frame object, which in this example is `can_lang`. The second argument is the column name that we want to select, here `language`. After passing these two arguments,
the  `select` function returns a single column (the `language` column that we asked for) as a data frame.


```{r}
language_column <- select(can_lang, language)
language_column
```


### Using `select` to extract multiple columns

We can also use `select` to obtain a subset of the data frame with multiple columns. Again, the first argument is the name of the data frame.
Then we list all the columns we want as arguments separated by commas. Here we create a subset of three columns: language, mother tongue, and language spoken most often at home.

```{r}
three_columns <- select(can_lang, language, mother_tongue, most_at_home)
three_columns
```


### Using `select` to extract a range of columns

We can also use `select` to obtain a subset of the data frame constructed from a range of columns. To do this we use the colon (`:`) operator to denote the range.
For example, to get all the columns in the data frame from `language` to `most_at_home` we pass `language:most_at_home` as the second argument to the `select` function.

```{r}
column_range <- select(can_lang, language:most_at_home)
column_range
```

### Using `filter` to extract a single row

We can use the `filter` function to obtain the subset of rows with desired values from a data frame. Again, our first argument is the name of the data frame object, `can_lang`.
The second argument is a logical statement to use when filtering the rows. Here, for example, we'll say that we are interested in rows where the language is Mandarin. To make
this comparison, we use the *equivalency operator* `==` to compare the values of the `language` column with the value `"Mandarin"`. Similar to when we loaded the data file and put quotes around the filename,
here we need to put quotes around `"Mandarin"` to tell R that this is a character value and not one of the special words that make up R programming language, nor one of the names
we have given to data frames in the code we have already written.

With these arguments, `filter` returns a data frame that has all the columns of the input data frame but only the rows we asked for in our logical filter statement.

```{r}
mandarin <- filter(can_lang, language == "Mandarin")
mandarin
```

### Using `filter` to extract rows with values above a threshold


If we are interested in finding information about the languages who have a higher number of people who primarily speak it at home compared to Mandarin---which is reported to have 462890 people speaking it as the primary language they speak in their home---then we can create a filter
to obtain rows where the value of `most_at_home` is greater than 462890.
In this case, we see that `filter` returns a data frame with 2 rows; this indicates that there are two languages that are spoken more often at
home compared to Mandarin.

```{r}
spoke_often_at_home <- filter(can_lang, most_at_home > 462890)
spoke_often_at_home
```


## Exploring data with visualizations

Creating effective data visualizations is an essential piece to any data analysis. For the remainder of Chapter 1, we will learn how to use
functions from the `tidyverse` to make visualizations that let us explore relationships in data. In particular, we'll develop a visualization
of the language data collected in the 2016 Canadian census we've been working with that will help us understand two potential relationships in the data:
first, the relationship between the number of people who speak a language as their mother tongue and the number of people who speak that language as their primary spoken language at home, and second, whether there is a pattern in the strength of this relationship in the higher level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages). This is an example of an exploratory data analysis
question: we are looking for relationships and patterns within the data set we have, but are not trying to generalize what we find beyond this data set.

### Using `ggplot` to create a scatter plot

Taking another look at our data set below, we can immediately see that the three columns (or variables) we are interested in visualizing---mother tongue, language spoken most at home, and higher level language category---are all in separate columns. In addition, there is a single row (or observation) for each language.
The data are therefore in what we call a *tidy data* format.
Tidy data is particularly important concept and will be a major focus in the remainder of this course: many of the functions from `tidyverse` require tidy data,
including the `ggplot` function that we will use shortly for our visualization. We will formally introduce this concept in chapter 3.

```{r}
can_lang
```

We will begin with a scatter plot of the `mother_tongue` and `most_at_home` columns from our data frame.
To create a scatter plot of these two variables using the `ggplot` function, we do the following:

1. call the `ggplot` function
2. provide the name of the data frame as the first argument
3. call the aesthetic function, `aes`, to specify which column will correspond to the x-axis and which will correspond to the y-axis
4. add a `+` symbol at the end of the `ggplot` call to add a layer to the plot
5. call the `geom_point` function to tell R that we want to represent the data points as dots/points to create a scatter plot.

```{r mother-tongue-vs-most-at-home, fig.width=5.75, fig.height=5, warning=FALSE, fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home"}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point()
```

> **In case you have used R before and are curious:**
> There are a small number of situations in which you can have a single R expression span multiple lines.
> Here, the `+` symbol at the end of the first line tells R that the expression isn't done yet and to
> continue reading on the next line. While not strictly necessary, this sort of pattern will appear a
> lot when using `ggplot` as it keeps things more readable.


### Formatting ggplot objects

It is motivating and exciting that we have already been able to visualize our
data to help answer our question, but we are not done yet! There is more we can
(and should) do to improve the interpretability of the data visualization that
we created. For example, by default, R uses the column names as the axis labels,
however, usually these column names do not have enough information about
the variable in the column. We really should replace this default with a more
informative label. For the example above, the column name `mother_tongue` is
used as the label for the y-axis, but most people will not know what that is.
And even if they did, they will not know how we are measuring mother tongue, nor
which group of people the measurements were taken. An axis label that
read "Mother tongue (number of Canadian residents)" would be much more
informative.

Adding additional layers to our visualizations that we create in `ggplot` is one
common and easy way to improve and refine our data visualizations. New layers
are added to `ggplot` objects using the `+` symbol. For example, we can use the
`xlab` and `ylab` functions to add layers where we specify meaningful and
informative labels for the x and y axes. Again, since we are specifying words
(e.g. `"Mother tongue (number of Canadian residents)"`) as arguments to `xlab`
and `ylab`, we surround them with double quotes. There are many more layers we
can add to format the plot further, and we will explore these in later chapters.

```{r mother-tongue-vs-most-at-home-labs, fig.width=5.75, fig.height=5, warning=FALSE, fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels"}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point() +
  xlab("Language spoken most at home (number of Canadian residents)") +
  ylab("Mother tongue (number of Canadian residents)")
```

Most of the data points from the 214 observations in this data set are bunched up in the lower left-handside of this visualization. This is because many many more people in Canada speak the two official languages (English and French). Thus to answer our question, we will need to adjust the scale of the x and y axes so that they are on a log scale. We can again add additional layers to the plot object using the `+` symbol to do this:

```{r mother-tongue-vs-most-at-home-scale, fig.width=5.75, fig.height=5, warning=FALSE, fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with log adjusted x and y axes"}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point() +
  xlab("Language spoken most at home (number of Canadian residents)") +
  ylab("Mother tongue (number of Canadian residents)") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma)
```

From this visualization we see that for the 214 languages in this data set, as the number of people who have a language as their mother tongue increases, so does the number of people who speak that language at home. When we see
two variables do this, we call this a *positive relationship*. Because the points are fairly close together, we can say that the relationship is strong. Because drawing a straight line through these
points would fit the pattern we observe quite well, we say that it's linear.

Learning how to describe data visualizations is a very useful skill. We will provide descriptions for you in this course (as we did above) until we get to Chapter 4,
which focuses on data visualization. Then, we will explicitly teach you how to do this yourself, and how to not over-state or over-interpret the results
from a visualization.

### Changing the units

```{r changing_the_units, include = FALSE}
english_mother_tongue <- can_lang %>%
  filter(language == "English") %>%
  pull(mother_tongue)
census_popn <- 35151728
```

What does it mean that `r format(english_mother_tongue, scientific = FALSE, big.mark = ",")`
people reported that their mother tongue was English in the 2016 Canadian
census? To really understand this number, we need context. In particular, how
many people were in Canada when this data was collected? From the 2016 Canadian
census profile, we can see that the population was reported to be
`r format(census_popn, scientific = FALSE, big.mark = ",")` people. The count of
the number of people who report that English is their mother tongue is much more
meaningful when we report it in this context. We can even go a step further and
transform this count to a relative frequency, or proportion, so that we can
represent this as a single meaningful number in our data visualizations. We can
do this by dividing the number of people reporting a given language as their
mother tongue by the number of people who live in Canada. For example, the
proportion of people who reported that their mother tongue was English in the
2016 Canadian census was
`r format(round(english_mother_tongue/census_popn, 2), scientific = FALSE, big.mark = ",")`.

We can use the `mutate` function in R to do this for all of the languages in the 2016 Canadian census data set. `mutate` is useful for creating new columns in a data frame, as well as transforming existing columns. It's general syntax is: `mutate(dataframe, column_to_create/transform = value/how_to_transform)`. Below we use mutate to calculate the proportion of people reporting a given language as their mother tongue for all the languages in the `can_lang` data set:

```{r}
can_lang <- mutate(can_lang, mother_tongue = mother_tongue / 35151728)
can_lang
```

Let's also do this for the counts of the number of people who report that they speak a given language most often at home:

```{r}
can_lang <- mutate(can_lang, most_at_home = most_at_home / 35151728)
can_lang
```

Finally, let's visualize the data now that we have represented it as proportions (and change our axis labels to reflect this change in units!):

```{r mother-tongue-vs-most-at-home-scale-props, fig.width=5.75, fig.height=5, warning=FALSE, fig.cap = "Scatter plot of proportion of Canadians reporting a language as their mother tongue vs the primary language at home"}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point() +
  xlab("Language spoken most at home (proportion of Canadian residents)") +
  ylab("Mother tongue (proportion of Canadian residents)") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma)
```

From the visualization above, we can now clearly see that not just a lot, but that the majority of Canadians reported English as their mother tongue and as the language they speak most often at home. Changing the units to include this context increases our understanding and allows us to interpret the numbers in our data set better.

### Coloring points by group
Now we'll move onto the second part of our exploratory data analysis question: when considering the relationship between the number of people who have a language as their mother tongue and the number of people who speak that language at home, is there a pattern in the strength of this relationship in the higher-level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages)? One common way to explore this is to colour the data points on the
scatter plot we have already created by group/category. For example,
given that we have the higher level language category for each language recorded in the 2016 Canadian census, we can colour the points in our previous
scatter plot to represent each language's higher-level language category.

To do this, we modify our scatter plot code above. Specifically, we will add an argument to the `aes` function, specifying that the points should be coloured by the `category` column:

```{r scatter-colour-by-category, fig.width=7.75, fig.height=4, warning=FALSE,  fig.cap = "Scatter plot of proportion of Canadians reporting a language as their mother tongue vs the primary language at home coloured by language category"}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue, color = category)) +
  geom_point() +
  xlab("Language spoken most at home (proportion of Canadian residents)") +
  ylab("Mother tongue (proportion of Canadian residents)") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma)
```

What do we see when considering the second part of our exploratory question? Do we see a difference in the pattern of the relationship between the number of people who speak a language as their mother tongue and the number of people who speak a language as their primary spoken language at home between higher-level language categories? Probably not!

For each higher-level language category there appears to be a positive relationship between the number of people who speak a language as their mother tongue and the number of people who speak a language as their primary spoken language at home. This relationship looks similar, regardless of the category.

Does this mean that this relationship is positive for all languages in the world? Can we use this data visualization on its own to predict how many people have a given language as their mother tongue if we know how many people speak it as their primary language at home?

The answer to both these questions is "no." However, with this exploratory data analysis, we can create new hypotheses, ideas,
and questions (like the ones at the beginning of this paragraph). Answering those questions would likely involve gathering additional data and doing more complex analyses, which we will
see more of later in this course.

### Putting it all together

Below, we put everything from this chapter together in one code chunk. We have
added a few more layers to make the data visualization even more effective.
Specifically we used have improved the visualizations accessibility by choosing
colours that are easier to distinguish and also mapped category to
shape, we handled the problem of overlapping data points by making them slightly
transparent, and we changed the background from grey to white to improve the
contrast. This demonstrates the power of R: in relatively few lines of code, we
are able to create an entire data science workflow with a highly effective
data visualization.

```{r nachos-to-cheesecake, fig.width=7.75, fig.height=4, warning=FALSE, message=FALSE, fig.cap = "Putting it all together: Scatter plot of proportion of Canadians reporting a language as their mother tongue vs the primary language at home coloured by language category"}
library(tidyverse)

can_lang <- read_csv("data/can_lang.csv")

can_lang <- mutate(can_lang, mother_tongue = mother_tongue / 35151728)
can_lang <- mutate(can_lang, most_at_home = most_at_home / 35151728)

ggplot(can_lang, aes(
  x = most_at_home,
  y = mother_tongue,
  colour = category,
  shape = category
)) + # map categories to different shapes
  geom_point(alpha = 0.6) + # set the transparency of the points
  scale_color_manual(values = c("deepskyblue2", "firebrick1", "black")) + # choose point colours
  xlab("Language spoken most at home (proportion of Canadian residents)") +
  ylab("Mother tongue (proportion of Canadian residents)") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma) +
  theme_bw() # use a theme to have a white background
```