-
Notifications
You must be signed in to change notification settings - Fork 100
/
Copy path05_word_embeddings.Rmd
609 lines (449 loc) · 37.3 KB
/
05_word_embeddings.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
# Word Embeddings {#embeddings}
```{r, include=FALSE}
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
> You shall know a word by the company it keeps.
> `r tufte::quote_footer('--- [John Rupert Firth](https://en.wikiquote.org/wiki/John_Rupert_Firth)')`
So far in our discussion of natural language features, we have discussed \index{preprocessing}preprocessing steps such as tokenization, removing stop words, and stemming in detail. We implement these types of preprocessing steps to be able to represent our text data in some data structure that is a good fit for modeling.
## Motivating embeddings for sparse, high-dimensional data {#motivatingsparse}
What kind of data structure might work well for typical text data? Perhaps, if we wanted to analyze or build a model for consumer complaints to the [United States Consumer Financial Protection Bureau (CFPB)](https://www.consumerfinance.gov/data-research/consumer-complaints/), described in Section \@ref(cfpb-complaints), we would start with straightforward word counts. Let's create a sparse matrix\index{matrix!sparse}, where the matrix elements are the counts of words in each document.
```{r complaints, R.options=list(quanteda_print_dfm_max_ndoc = 0, quanteda_print_dfm_max_nfeat = 0), linewidth = 60}
library(tidyverse)
library(tidytext)
library(SnowballC)
complaints <- read_csv("data/complaints.csv.gz")
complaints %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
cast_dfm(complaint_id, stem, n)
```
\index{matrix!sparse}
```{block, type = "rmdwarning"}
A _sparse matrix_ is a matrix where most of the elements are zero. When working with text data, we say our data is "sparse" because most documents do not contain most words, resulting in a representation of mostly zeroes. There are special data structures and algorithms for dealing with sparse data that can take advantage of their structure. For example, an array can more efficiently store the locations and values of only the non-zero elements instead of all elements.
```
The data set of consumer complaints used in this book has been filtered to those submitted to the CFPB since January 1, 2019 that include a consumer complaint narrative (i.e., some submitted text).
Another way to represent our text data is to create a sparse matrix where the elements are weighted, rather than straightforward counts only. The *term frequency*\index{term frequency} of a word is how frequently a word occurs in a document, and the *inverse document frequency*\index{inverse document frequency} of a word decreases the weight for commonly-used words and increases the weight for words that are not used often in a collection of documents. It is typically defined as:
$$idf(\text{term}) = \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)}$$
\index{term frequency-inverse document frequency|see {tf-idf}}
These two quantities can be combined to calculate a term's \index{tf-idf}tf-idf (the two quantities multiplied together). This statistic measures the frequency of a term adjusted for how rarely it is used, and it is an example of a weighting scheme that can often work better than counts for predictive modeling with text features.
```{r dependson="complaints", R.options=list(quanteda_print_dfm_max_ndoc = 0, quanteda_print_dfm_max_nfeat = 0), linewidth = 60}
complaints %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
bind_tf_idf(stem, complaint_id, n) %>%
cast_dfm(complaint_id, stem, tf_idf)
```
Notice that, in either case, our final data structure is incredibly sparse and of high dimensionality with a huge number of features. Some modeling algorithms and the libraries that implement them can take advantage of the memory characteristics of sparse matrices for better performance; an example of this is regularized regression implemented in **glmnet** [@Friedman2010]. Some modeling algorithms, including tree-based algorithms, do not perform better with sparse input, and then some libraries are not built to take advantage of sparse data structures, even if it would improve performance for those algorithms. We have some computational tools to take advantage of sparsity, but they don't always solve all the problems that come along with big text data sets.
\index{matrix!sparse}
\index{corpus}
As the size of a corpus increases in terms of words or other tokens, both the sparsity and RAM required to hold the corpus in memory increase. Figure \@ref(fig:sparsityram) shows how this works out; as the corpus grows, there are more words used just a few times included in the corpus. The sparsity increases and approaches 100%, but even more notably, the memory required to store the corpus increases with the square of the number of tokens.
```{r sparsityram, dependson="complaints", fig.cap="Memory requirements and sparsity increase with corpus size"}
get_dfm <- function(frac) {
complaints %>%
sample_frac(frac) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
anti_join(get_stopwords(), by = "word") %>%
mutate(stem = wordStem(word)) %>%
count(complaint_id, stem) %>%
cast_dfm(complaint_id, stem, n)
}
set.seed(123)
tibble(frac = 2 ^ seq(-16, -6, 2)) %>%
mutate(dfm = map(frac, get_dfm),
words = map_dbl(dfm, quanteda::nfeat),
sparsity = map_dbl(dfm, quanteda::sparsity),
`RAM (in bytes)` = map_dbl(dfm, lobstr::obj_size)) %>%
pivot_longer(sparsity:`RAM (in bytes)`, names_to = "measure") %>%
ggplot(aes(words, value, color = measure)) +
geom_line(size = 1.5, alpha = 0.5) +
geom_point(size = 2) +
facet_wrap(~measure, scales = "free_y") +
scale_x_log10(labels = scales::label_comma()) +
scale_y_continuous(labels = scales::label_comma()) +
theme(legend.position = "none") +
labs(x = "Number of unique words in corpus (log scale)",
y = NULL)
```
Linguists have long worked on vector models for language that can reduce the number of dimensions representing text data based on how people use language; the quote that opened this chapter dates to 1957. These kinds of dense word vectors are often called _word embeddings_.
## Understand word embeddings by finding them yourself
Word embeddings are a way to represent text data as \index{vector}vectors of numbers based on a huge \index{corpus}corpus of text, capturing semantic meaning from words' context.
```{block, type = "rmdnote"}
Modern word embeddings are based on a statistical approach to modeling language, rather than a linguistics or rules-based approach.
```
We can determine these vectors for a corpus of text using word counts and matrix factorization, as outlined by @Moody2017. This approach is valuable because it allows practitioners to find word vectors for their own collections of text (with no need to rely on pre-trained vectors) using familiar techniques that are not difficult to understand. Let's walk through how to do this using tidy data principles and sparse matrices\index{matrix!sparse}, on the data set of CFPB complaints. First, let's filter out words that are used only rarely in this data set and create a nested dataframe, with one row per complaint.
```{r nestedwords, dependson="complaints"}
tidy_complaints <- complaints %>%
select(complaint_id, consumer_complaint_narrative) %>%
unnest_tokens(word, consumer_complaint_narrative) %>%
add_count(word) %>%
filter(n >= 50) %>%
select(-n)
nested_words <- tidy_complaints %>%
nest(words = c(word))
nested_words
```
Next, let’s create a `slide_windows()` function, using the `slide()` function from the **slider** package [@Vaughan2020] that implements fast sliding window computations written in C. Our new function identifies skipgram windows\index{skipgram windows} in order to calculate the skipgram probabilities, how often we find each word near each other word. We do this by defining a fixed-size moving window that centers around each word. Do we see `word1` and `word2` together within this window? We can calculate probabilities based on when we do or do not.
One of the arguments to this function is the `window_size`, which determines the size of the sliding window that moves through the text, counting up words that we find within the window. The best choice for this window size depends on your analytical question because it determines what kind of semantic meaning the embeddings capture. A smaller window size, like three or four, focuses on how the word is used and learns what other words are functionally similar. A larger window size, like 10, captures more information about the domain or topic of each word, not constrained by how functionally similar the words are [@Levy2014]. A smaller window size is also faster to compute.
```{r slidewindows}
slide_windows <- function(tbl, window_size) {
skipgrams <- slider::slide(
tbl,
~.x,
.after = window_size - 1,
.step = 1,
.complete = TRUE
)
safe_mutate <- safely(mutate)
out <- map2(skipgrams,
1:length(skipgrams),
~ safe_mutate(.x, window_id = .y))
out %>%
transpose() %>%
pluck("result") %>%
compact() %>%
bind_rows()
}
```
Now that we can find all the skipgram windows\index{skipgram windows}, we can calculate how often words occur on their own, and how often words occur together with other words. We do this using the point-wise mutual information (PMI)\index{PMI}\index{point-wise mutual information|see {PMI}}, a measure of association that measures exactly what we described in the previous sentence; it's the logarithm of the probability of finding two words together, normalized for the probability of finding each of the words alone. We use PMI to measure which words occur together more often than expected based on how often they occurred on their own.
For this example, let's use a window size of *four*.
```{block2, type = "rmdpackage"}
This next step is the computationally expensive part of finding word embeddings with this method, and can take a while to run. Fortunately, we can use the **furrr** package [@Vaughan2018] to take advantage of parallel processing because identifying skipgram windows in one document is independent from all the other documents.
```
```{r eval=FALSE}
library(widyr)
library(furrr)
plan(multisession) ## for parallel processing
tidy_pmi <- nested_words %>%
mutate(words = future_map(words, slide_windows, 4L)) %>%
unnest(words) %>%
unite(window_id, complaint_id, window_id) %>%
pairwise_pmi(word, window_id)
tidy_pmi
```
```{r tidypmi, echo=FALSE}
library(widyr)
load("data/tidy_pmi.rda")
tidy_pmi
```
When PMI is high, the two words are associated with each other, i.e., likely to occur together. When PMI is low, the two words are not associated with each other, unlikely to occur together.
```{block, type = "rmdwarning"}
The step above used `unite()`, a function from **tidyr** that pastes multiple columns into one, to make a new column for `window_id` from the old `window_id` plus the `complaint_id`. This new column tells us which combination of window and complaint each word belongs to.
```
We can next determine the word vectors from the PMI values using singular value decomposition (SVD).
SVD\index{SVD}\index{singular value decomposition|see {SVD}} is a method for dimensionality reduction via \index{matrix factorization}matrix factorization [@Golub1970] that works by taking our data and decomposing it onto special orthogonal axes. The first axis is chosen to capture as much of the variance as possible. Keeping that first axis fixed, the remaining orthogonal axes are rotated to maximize the variance in the second. This is repeated for all the remaining axes.
In our application, we will use SVD to factor the PMI matrix into a set of smaller matrices containing the word embeddings with a size we get to choose. The embedding size is typically chosen to be in the low hundreds. Thus we get a matrix of dimension (`n_vocabulary * n_dim`) instead of dimension (`n_vocabulary * n_vocabulary`), which can be a vast reduction in size for large vocabularies.
Let's use the `widely_svd()` function in **widyr** [@R-widyr], creating 100-dimensional word embeddings. This matrix factorization is much faster than the previous step of identifying the skipgram windows\index{skipgram windows} and calculating PMI.
```{r tidywordvectors, dependson="tidypmi"}
tidy_word_vectors <- tidy_pmi %>%
widely_svd(
item1, item2, pmi,
nv = 100, maxit = 1000
)
tidy_word_vectors
```
```{block, type = "rmdnote"}
`tidy_word_vectors` is not drastically smaller than `tidy_pmi` since the vocabulary is not enormous and `tidy_pmi` is represented in a sparse format.
```
We have now successfully found word embeddings, with clear and understandable code. This is a real benefit of this approach; this approach is based on counting, dividing, and matrix decomposition and is thus easier to understand and implement than options based on \index{deep learning}deep learning. Training word vectors or embeddings, even with this straightforward method, still requires a large data set (ideally, hundreds of thousands of documents or more) and a not insignificant investment of time and computational power.
## Exploring CFPB word embeddings
Now that we have determined word embeddings for the data set of CFPB complaints, let's explore them and talk about how they are used in modeling. We have projected the sparse, high-dimensional set of word features into a more dense, 100-dimensional set of features.
```{block, type = "rmdwarn"}
Each word can be represented as a numeric vector in this new feature space. A single word is mapped to only one vector, so be aware that all senses of a word are conflated in word embeddings. Because of this, word embeddings are limited for understanding lexical semantics.
```
Which words are close to each other in this new feature space of word embeddings? Let's create a simple function that will find the nearest words to any given example in using our newly created word embeddings.
```{r nearestsynonyms}
nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) /
(sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE
)(item1, dimension, value) %>%
select(-item2)
}
```
This function takes the tidy word embeddings as input, along with a word (or token, more strictly) as a string. It uses matrix multiplication and sums to calculate the cosine similarity between the word and all the words in the embedding to find which words are closer or farther to the input word, and returns a dataframe sorted by similarity.
What words are closest to `"error"` in the data set of CFPB complaints, as determined by our word embeddings?
```{r dependson=c("tidywordvectors", "nearestsynonyms")}
tidy_word_vectors %>%
nearest_neighbors("error")
```
```{r echo=FALSE, dependson=c("tidywordvectors", "nearestsynonyms")}
test_words1 <-
tidy_word_vectors %>%
nearest_neighbors("error") %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("mistake", "problem", "glitch") %in% test_words1)) {
rlang::abort('In Chapter 5 on word embeddings, words closest to "error" do not match previous results.')
}
```
Mistakes, problems, glitches -- sounds bad!
What is closest to the word `"month"`?
```{r dependson=c("tidywordvectors", "nearestsynonyms")}
tidy_word_vectors %>%
nearest_neighbors("month")
```
```{r echo=FALSE, dependson=c("tidywordvectors", "nearestsynonyms")}
test_words2 <-
tidy_word_vectors %>%
nearest_neighbors("month") %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("installments", "payment", "year", "week", "monthly", "months") %in% test_words2)) {
rlang::abort('In Chapter 5 on word embeddings, words closest to "month" do not match previous results.')
}
```
We see words about installments and payments, along with other time periods such as years and weeks. Notice that we did not stem this text data (see Chapter \@ref(stemming)), but the word embeddings learned that "month", "months", and "monthly" belong together.
What words are closest in this embedding space to `"fee"`?
```{r dependson=c("tidywordvectors", "nearestsynonyms")}
tidy_word_vectors %>%
nearest_neighbors("fee")
```
We find a lot of dollar amounts, which makes sense. Let us filter out the numbers to see what non-dollar words are similar to "fee".
```{r dependson=c("tidywordvectors", "nearestsynonyms")}
tidy_word_vectors %>%
nearest_neighbors("fee") %>%
filter(str_detect(item1, "[0-9]*.[0-9]{2}", negate = TRUE))
```
```{r echo=FALSE, dependson=c("tidywordvectors", "nearestsynonyms")}
test_words3 <-
tidy_word_vectors %>%
nearest_neighbors("fee") %>%
filter(str_detect(item1, "[0-9]*.[0-9]{2}", negate = TRUE)) %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("fees", "overdraft", "charge") %in% test_words3)) {
rlang::abort('In Chapter 5 on word embeddings, words closest to "fee" do not match previous results.')
}
```
We now find words about overdrafts and charges. The top two words are "fee" and "fees"; word embeddings can learn that singular and plural\index{singular versus plural} forms of words are related and belong together. In fact, word embeddings can accomplish many of the same goals of tasks like stemming (Chapter \@ref(stemming)) but more reliably and less arbitrarily.
Since we have found word embeddings via singular value decomposition, we can use these vectors to understand what principal components explain the most variation in the CFPB complaints. The orthogonal axes that SVD\index{SVD} used to represent our data were chosen so that the first axis accounts for the most variance, the second axis accounts for the next most variance, and so on. We can now explore which and how much each _original_ dimension (tokens in this case) contributed to each of the resulting principal components produced using SVD.
```{r echo=FALSE, dependson="tidywordvectors"}
test_words4 <-
tidy_word_vectors %>%
filter(dimension <= 24) %>%
group_by(dimension) %>%
top_n(12, abs(value)) %>%
ungroup()
dim12 <- test_words4 %>% filter(dimension == 12) %>% pull(item1)
dim09 <- test_words4 %>% filter(dimension == 9) %>% pull(item1)
if (!all(c("regarding", "contacted", "called") %in% dim12)) {
rlang::abort('In Chapter 5 on word embeddings, component 12 of word vectors has changed.')
}
if (!all(c("deceptive", "predatory", "unethical") %in% dim09)) {
rlang::abort('In Chapter 5 on word embeddings, component 12 of word vectors has changed.')
}
```
```{r embeddingpca, dependson="tidywordvectors", fig.width=8, fig.asp = 1.5, fig.cap="Word embeddings for Consumer Finance Protection Bureau complaints"}
tidy_word_vectors %>%
filter(dimension <= 24) %>%
group_by(dimension) %>%
top_n(12, abs(value)) %>%
ungroup() %>%
mutate(item1 = reorder_within(item1, value, dimension)) %>%
ggplot(aes(item1, value, fill = dimension)) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~dimension, scales = "free_y", ncol = 4) +
scale_x_reordered() +
coord_flip() +
labs(
x = NULL,
y = "Value",
title = "First 24 principal components for text of CFPB complaints",
subtitle = paste("Top words contributing to the components that explain",
"the most variation")
)
```
It becomes very clear in Figure \@ref(fig:embeddingpca) that stop words have not been removed, but notice that we can learn meaningful relationships in how very common words are used. Component 12 shows us how common prepositions are often used with words like `"regarding"`, `"contacted"`, and `"called"`, while component 9 highlights the use of *different* common words when submitting a complaint about unethical, predatory, and/or deceptive practices. Stop words do carry information, and methods like determining word embeddings can make that information usable.
We created word embeddings\index{embeddings} and can explore them to understand our text data set, but how do we use this vector representation in modeling? The classic and simplest approach is to treat each document as a collection of words and summarize the word embeddings into *document embeddings*, either using a mean or sum. This approach loses information about word order but is straightforward to implement. Let's `count()` to find the sum here in our example.
```{r docmatrix, dependson=c("nestedwords", "tidywordvectors")}
word_matrix <- tidy_complaints %>%
count(complaint_id, word) %>%
cast_sparse(complaint_id, word, n)
embedding_matrix <- tidy_word_vectors %>%
cast_sparse(item1, dimension, value)
doc_matrix <- word_matrix %*% embedding_matrix
dim(doc_matrix)
```
We have a new matrix here that we can use as the input for modeling. Notice that we still have over 100,000 documents (we did lose a few complaints, compared to our example sparse matrices\index{matrix!sparse} at the beginning of the chapter, when we filtered out rarely used words) but instead of tens of thousands of features, we have exactly 100 features.
```{block, type = "rmdnote"}
These hundred features are the word embeddings we learned from the text data itself.
```
If our word embeddings are of high quality, this translation of the high-dimensional space of words to the lower-dimensional space of the word embeddings allows our modeling based on such an input matrix to take advantage of the semantic meaning\index{semantics} captured in the embeddings.
This is a straightforward method for finding and using word embeddings, based on counting and linear algebra. It is valuable both for understanding what word embeddings are and how they work, but also in many real-world applications. This is not the method to reach for if you want to publish an academic NLP paper, but is excellent for many applied purposes. Other methods for determining word embeddings include GloVe [@Pennington2014], implemented in R in the **text2vec** package [@Selivanov2018], word2vec [@Mikolov2013], and FastText [@Bojanowski2016].
\index{embeddings!GloVe}
\index{embeddings!word2vec}
\index{embeddings!FastText}
## Use pre-trained word embeddings {#glove}
\index{embeddings!pre-trained}If your data set is too small, you typically cannot train reliable word embeddings.
```{block, type = "rmdwarning"}
How small is too small? It is hard to make definitive statements because being able to determine useful word embeddings depends on the semantic and pragmatic details of *how* words are used in any given data set. However, it may be unreasonable to expect good results with data sets smaller than about a million words or tokens. (Here, we do not mean about a million unique tokens, i.e., the vocabulary size, but instead about that many observations in the text data.)
```
In such situations, we can still use word embeddings for feature creation in modeling, just not embeddings that we determine ourselves from our own data set. Instead, we can turn to *pre-trained* word embeddings, such as the GloVe\index{embeddings!GloVe} word vectors trained on six billion tokens from Wikipedia and news sources. Several pre-trained GloVe vector representations are available in R via the **textdata** package [@Hvitfeldt2020]. Let's use `dimensions = 100`, since we trained 100-dimensional word embeddings in the previous section.
```{r glove6b, R.options = list(tibble.max_extra_cols=9, tibble.print_min=10, tibble.width=80)}
library(textdata)
glove6b <- embedding_glove6b(dimensions = 100)
glove6b
```
We can transform these word embeddings into a more tidy format, using `pivot_longer()` from **tidyr**. Let's also give this tidied version the same column names as `tidy_word_vectors`, for convenience.
```{r tidyglove, dependson="glove6b"}
tidy_glove <- glove6b %>%
pivot_longer(contains("d"),
names_to = "dimension") %>%
rename(item1 = token)
tidy_glove
```
We've already explored some sets of "synonyms" in the embedding space we determined ourselves from the CPFB complaints. What about this embedding space learned via the GloVe algorithm\index{embeddings!GloVe} on a much larger data set? We just need to make one change to our `nearest_neighbors()` function and add `maximum_size = NULL`, because the matrices we are multiplying together are much larger this time.
```{r nearestsynonyms2}
nearest_neighbors <- function(df, token) {
df %>%
widely(
~ {
y <- .[rep(token, nrow(.)), ]
res <- rowSums(. * y) /
(sqrt(rowSums(. ^ 2)) * sqrt(sum(.[token, ] ^ 2)))
matrix(res, ncol = 1, dimnames = list(x = names(res)))
},
sort = TRUE,
maximum_size = NULL
)(item1, dimension, value) %>%
select(-item2)
}
```
\index{embeddings!pre-trained}Pre-trained word embeddings are trained on very large, general purpose English language data sets. Commonly used [word2vec embeddings](https://code.google.com/archive/p/word2vec/) \index{embeddings!word2vec}are based on the Google News data set, and \index{embeddings!GloVe} [GloVe embeddings](https://nlp.stanford.edu/projects/glove/) (what we are using here) and\index{embeddings!FastText} [FastText embeddings](https://fasttext.cc/docs/en/english-vectors.html) are learned from the text of Wikipedia plus other sources. Keeping that in mind, what words are closest to `"error"` in the GloVe embeddings?
```{r dependson=c("tidyglove", "nearestsynonyms2")}
tidy_glove %>%
nearest_neighbors("error")
```
```{r echo=FALSE, dependson=c("tidyglove", "nearestsynonyms2")}
test_words5 <-
tidy_glove %>%
nearest_neighbors("error") %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("correct", "incorrect") %in% test_words5)) {
rlang::abort('In Chapter 5 on word embeddings, GloVe words closest to "error" do not match previous results.')
}
```
Instead of the more specific clerical mistakes and discrepancies in the CFPB embeddings, we now see more general words about being correct or incorrect. This could present a challenge for using the GloVe embeddings with the CFPB text data. Remember that different senses or uses of the same word are conflated in word embeddings; the high-dimensional space of any set of word embeddings cannot distinguish between different uses of a word, such as the word "error".
What is closest to the word `"month"` in these pre-trained \index{embeddings!GloVe}GloVe embeddings?\index{embeddings!pre-trained}
```{r dependson=c("tidyglove", "nearestsynonyms2")}
tidy_glove %>%
nearest_neighbors("month")
```
```{r echo=FALSE, dependson=c("tidyglove", "nearestsynonyms2")}
test_words6 <-
tidy_glove %>%
nearest_neighbors("month") %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("week", "year") %in% test_words6)) {
rlang::abort('In Chapter 5 on word embeddings, GloVe words closest to "month" do not match previous results.')
}
```
Instead of words about payments, the GloVe results here focus on different time periods only.
What words are closest in the \index{embeddings!GloVe}GloVe embedding space to `"fee"`?
```{r dependson=c("tidyglove", "nearestsynonyms2")}
tidy_glove %>%
nearest_neighbors("fee")
```
```{r echo=FALSE, dependson=c("tidyglove", "nearestsynonyms2")}
test_words7 <-
tidy_glove %>%
nearest_neighbors("fee") %>%
slice_max(value, n = 10) %>%
pull(item1)
if (!all(c("payment", "pay", "salary") %in% test_words7)) {
rlang::abort('In Chapter 5 on word embeddings, GloVe words closest to "fee" do not match previous results.')
}
```
The most similar words are, like with the CPFB embeddings, generally financial, but they are largely about salary and pay instead of about charges and overdrafts.
\index{semantics}\index{preprocessing!challenges}
```{block, type = "rmdnote"}
These examples highlight how pre-trained word embeddings can be useful because of the incredibly rich semantic relationships they encode, but also how these vector representations are often less than ideal for specific tasks.
```
If we do choose to use pre-trained word embeddings, how do we go about integrating them into a modeling workflow? Again, we can create simple document embeddings by treating each document as a collection of words and summarizing the word embeddings. The GloVe embeddings\index{embeddings!GloVe} do not contain all the tokens in the CPFB complaints, and vice versa, so let's use `inner_join()` to match up our data sets.
```{r glovematrix, dependson=c("docmatrix", "tidyglove")}
word_matrix <- tidy_complaints %>%
inner_join(by = "word",
tidy_glove %>%
distinct(item1) %>%
rename(word = item1)) %>%
count(complaint_id, word) %>%
cast_sparse(complaint_id, word, n)
glove_matrix <- tidy_glove %>%
inner_join(by = "item1",
tidy_complaints %>%
distinct(word) %>%
rename(item1 = word)) %>%
cast_sparse(item1, dimension, value)
doc_matrix <- word_matrix %*% glove_matrix
dim(doc_matrix)
```
Since these GloVe embeddings\index{embeddings!GloVe} had the same number of dimensions as the word embeddings we found ourselves (100), we end up with the same number of columns as before but with slightly fewer documents in the data set. We have lost documents that contain only words not included in the GloVe embeddings.
```{block2, type = "rmdpackage"}
The package **wordsalad** [@R-wordsalad] provides a unified interface for finding different kinds of word vectors from text using pre-trained embeddings. The options include fastText, GloVe, and word2vec.
```
\index{embeddings!word2vec}
\index{embeddings!FastText}
## Fairness and word embeddings {#fairnessembeddings}
Perhaps more than any of the other preprocessing steps this book has covered so far, using word embeddings opens an analysis or model up to the possibility of being influenced by systemic unfairness and bias.\index{bias}
\index{corpus}
```{block, type = "rmdwarning"}
Embeddings are trained or learned from a large corpus of text data, and whatever human prejudice or bias exists in the corpus becomes imprinted into the vector data of the embeddings.
```
This is true of all machine learning to some extent (models learn, reproduce, and often amplify whatever biases exist in training data), but this is literally, concretely true of word embeddings. @Caliskan2016 show how the GloVe word embeddings (the same embeddings we used in Section \@ref(glove)) replicate human-like semantic biases.
- Typically Black first names are associated with more unpleasant feelings than typically white first names.
- Women's first names are more associated with family and men's first names are more associated with career.
- Terms associated with women are more associated with the arts and terms associated with men are more associated with science.
\index{preprocessing!challenges}Results like these have been confirmed over and over again, such as when @Bolukbasi2016 demonstrated gender stereotypes in how word embeddings encode professions or when Google Translate [exhibited apparently sexist behavior when translating text from languages with no gendered pronouns](https://twitter.com/seyyedreza/status/935291317252493312). Google has since [worked to correct this problem](https://www.blog.google/products/translate/reducing-gender-bias-google-translate/), but in 2021 the problem [still exists for some languages](https://twitter.com/doravargha/status/1373211762108076034). @Garg2018 even used the way bias and stereotypes can be found in word embeddings to quantify how social attitudes towards women and minorities have changed over time.
Remember that word embeddings are *learned* or trained from some large data set of text; this training data is the source of the biases we observe when applying word embeddings to NLP tasks. @Bender2021 outline how the very large data sets used in large language models do not mean that such models reflect representative or diverse viewpoints, or even can respond to changing social views. As one concrete example, a common data set used to train large embedding models is the text of Wikipedia, but Wikipedia [itself has problems with, for example, gender bias](https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia). Some of the gender discrepancies on Wikipedia can be attributed to social and historical factors, but some can be attributed to the site mechanics of Wikipedia itself [@Wagner2016].
```{block, type = "rmdnote"}
It's safe to assume that any large corpus of language will contain latent structure reflecting the biases of the people who generated that language.
```
\index{corpus}
\index{bias}
When embeddings with these kinds of stereotypes are used as a preprocessing step in training a predictive model, the final model can exhibit racist, sexist, or otherwise biased characteristics. @Speer2017 demonstrated how using pre-trained word embeddings\index{embeddings, pre-trained} to train a straightforward sentiment analysis model can result in text such as
> Let's go get Italian food
being scored much more positively than text such as\index{preprocessing!challenges}\index{bias}
> Let's go get Mexican food
because of characteristics of the text the word embeddings were trained on.
## Using word embeddings in the real world
Given these profound and fundamental challenges with word embeddings, what options are out there? First, consider not using word embeddings when building a text model. Depending on the particular analytical question you are trying to answer, another numerical representation of text data (such as word frequencies or tf-idf\index{tf-idf} of single words or n-grams) may be more appropriate. Consider this option even more seriously if the model you want to train is already entangled with issues of bias, such as the sentiment analysis example in Section \@ref(fairnessembeddings).
Consider whether finding your own word embeddings, instead of relying on pre-trained embeddings\index{embeddings!pre-trained} created using an algorithm like GloVe\index{embeddings!GloVe} or word2vec\index{embeddings!word2vec}, may help you. Building your own vectors is likely to be a good option when the text domain you are working in is *specific* rather than general purpose; some examples of such domains could include customer feedback for a clothing e-commerce site, comments posted on a coding Q&A site, or legal documents.
Learning good quality word embeddings is only realistic when you have a large corpus of text data (say, a million tokens), but if you have that much data, it is possible that embeddings learned from scratch based on your own data may not exhibit the same kind of semantic biases that exist in pre-trained word embeddings. Almost certainly there will be some kind of bias latent in any large text corpus, but when you use your own training data for learning word embeddings, you avoid the problem of *adding* historic, systemic prejudice from general purpose language data sets.
```{block, type = "rmdnote"}
You can use the same approaches discussed in this chapter to check any new embeddings for dangerous biases such as racism or sexism.
```
\index{bias}
NLP researchers have also proposed methods for debiasing embeddings. @Bolukbasi2016 aim to remove stereotypes by postprocessing\index{postprocessing} pre-trained word vectors, choosing specific sets of words that are reprojected in the vector space so that some specific bias, such as gender bias, is mitigated. This is the most established method for reducing bias in embeddings to date, although other methods have been proposed as well, such as augmenting data with counterfactuals [@Lu2018]. Recent work [@Ethayarajh2019] has explored whether the association tests used to measure bias are even useful, and under what conditions debiasing can be effective.\index{preprocessing!challenges}
Other researchers, such as @Caliskan2016, suggest that corrections for fairness should happen at the point of *decision* or action rather than earlier in the process of modeling, such as preprocessing steps like building word embeddings. The concern is that methods for debiasing word embeddings may allow the stereotypes to seep back in, and more recent work shows that this is exactly what can happen. @Gonen2019 highlight how pervasive and consistent gender bias is across different word embedding models, *even after* applying current debiasing methods.
## Summary {#embeddingssummary}
Mapping words (or other tokens) to an embedding in a special vector space is a powerful approach in natural language processing. This chapter started from fundamentals to demonstrate how to determine word embeddings from a text data set, but a whole host of highly sophisticated techniques have been built on this foundation. For example, document embeddings can be learned from text directly [@Le2014] rather than summarized from word embeddings. More recently, embeddings have acted as one part of language models with transformers like ULMFiT [@Howard2018] and ELMo [@Peters2018]. It's important to keep in mind that even more advanced natural language algorithms, such as these language models with transformers, also exhibit such systemic biases [@Sheng2019].
### In this chapter, you learned:
- what word embeddings are and why we use them
- how to determine word embeddings from a text data set
- how the vector space of word embeddings encodes word similarity
- about a simple strategy to find document similarity
- how to handle pre-trained word embeddings
- why word embeddings carry historic and systemic bias
- about approaches for debiasing word embeddings