-
Notifications
You must be signed in to change notification settings - Fork 100
/
Copy path07_ml_classification.Rmd
1451 lines (1049 loc) · 75.9 KB
/
07_ml_classification.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Classification {#mlclassification}
```{r, include=FALSE}
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
In Chapter \@ref(mlregression), we focused on modeling to predict *continuous values* for documents, such as what year a Supreme Court opinion was published. This is an example of a regression model. We can also use machine learning to predict *labels* on documents using a classification model. For both types of prediction questions, we develop a learner or model to describe the relationship between a target or outcome variable and our input features; what is different about a classification model is the nature of that outcome.
- A **regression model** predicts a numeric or continuous value.
- A **classification model** predicts a class label or group membership.
For our classification example in this chapter, let's consider the data set of consumer complaints submitted to the US Consumer Finance Protection Bureau. Let's read in the complaint data (Section \@ref(cfpb-complaints)) with `read_csv()`.
```{r complaints, message=FALSE}
library(tidyverse)
complaints <- read_csv("data/complaints.csv.gz")
```
We can start by taking a quick `glimpse()` at the data to see what we have to work with. This data set contains a text field with the complaint, along with information regarding what it was for,
how and when it was filed, and the response from the bureau.
```{r, dependson="complaints"}
glimpse(complaints)
```
In this chapter, we will build classification models to predict what type of financial `product` the complaints are referring to, i.e., a label or categorical variable. The goal of predictive modeling with text input features and a categorical outcome is to learn and model the relationship between those input features, typically created through steps as outlined in Chapters \@ref(language) through \@ref(embeddings), and the class label or categorical outcome. Most classification models do predict the probability of a class (a numeric output), but the particular characteristics of this output make classification models different enough from regression models that we handle them differently.
## A first classification model {#classfirstattemptlookatdata}
For our first model, let's build a binary classification model to predict whether a submitted complaint is about "Credit reporting, credit repair services, or other personal consumer reports" or not.
```{block, type = "rmdnote"}
This kind of "yes or no" binary classification model is both common and useful in real-world text machine learning problems.
```
The outcome variable `product` contains more categories than this, so we need to transform this variable to only contain the values "Credit reporting, credit repair services, or other personal consumer reports" and "Other".
It is always a good idea to look at your data! Here are the first six complaints:
```{r, dependson="complaints", linewidth=80}
head(complaints$consumer_complaint_narrative)
```
The complaint narratives contain many series of capital `"X"`'s. These strings (like "XX/XX" or "XXXX XXXX XXXX XXXX") are used to to protect personally identifiable information (PII)\index{PII}\index{personally identifiable information|see {PII}} in this publicly available data set. This is not a universal censoring mechanism; censoring and PII protection will vary from source to source. Hopefully you will be able to find information on PII\index{PII} censoring\index{censoring} in a data dictionary, but you should always look at the data yourself to verify.
We also see that monetary amounts are surrounded by curly brackets (like `"{$21.00}"`); this is another text preprocessing\index{preprocessing} step that has been taken care of for us. We could craft a regular expression\index{regex} to extract all the dollar amounts.
```{r, dependson="complaints"}
complaints$consumer_complaint_narrative %>%
str_extract_all("\\{\\$[0-9\\.]*\\}") %>%
compact() %>%
head()
```
In Section \@ref(customfeatures), we will use an approach like this for custom feature engineering from the text.
### Building our first classification model {#classfirstmodel}
This data set includes more possible predictors than the text alone, but for this first model we will only use the text variable `consumer_complaint_narrative`.
Let's create a factor outcome variable `product` with two levels, "Credit" and "Other".
Then, we split the data into training and testing data sets.
We can use the `initial_split()` function from **rsample** to create this binary split of the data.
The `strata` argument ensures that the distribution of `product` is similar in the training set and testing set.
Since the split uses random sampling, we set a seed so we can reproduce our results.
```{r, complaintssplit}
library(tidymodels)
set.seed(1234)
complaints2class <- complaints %>%
mutate(product = factor(if_else(
product == paste("Credit reporting, credit repair services,",
"or other personal consumer reports"),
"Credit", "Other"
)))
complaints_split <- initial_split(complaints2class, strata = product)
complaints_train <- training(complaints_split)
complaints_test <- testing(complaints_split)
```
The dimensions of the two splits show that this first step worked as we planned.
```{r, dependson="complaintssplit"}
dim(complaints_train)
dim(complaints_test)
```
Next we need to preprocess\index{preprocessing} this data to prepare it for modeling; we have text data, and we need to build numeric features for machine learning from that text.
The **recipes** package, part of tidymodels, allows us to create a specification of preprocessing steps we want to perform. These transformations are estimated (or "trained") on the training set so that they can be applied in the same way on the testing set or new data at prediction time, without data leakage.
We initialize our set of preprocessing transformations with the `recipe()` function, using a formula expression to specify the variables, our outcome plus our predictor, along with the data set.
```{r complaintrec1, dependson="complaintssplit"}
complaints_rec <-
recipe(product ~ consumer_complaint_narrative, data = complaints_train)
```
Now we add steps to process the text of the complaints; we use **textrecipes** to handle the `consumer_complaint_narrative` variable. First we tokenize the text to words with `step_tokenize()`. By default this uses `tokenizers::tokenize_words()`.
Before we calculate tf-idf\index{tf-idf} we use `step_tokenfilter()` to only keep the 1000 most frequent tokens, to avoid creating too many variables in our first model. To finish, we use `step_tfidf()` to compute tf-idf.
```{r complaintrec4, dependson="complaintrec1"}
library(textrecipes)
complaints_rec <- complaints_rec %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative)
```
Now that we have a full specification of the preprocessing recipe, we can build up a tidymodels `workflow()` to bundle together our modeling components.
```{r complaintwf, dependson="complaintrec4"}
complaint_wf <- workflow() %>%
add_recipe(complaints_rec)
```
Let's start with a naive Bayes model [@kim2006; @Kibriya2005; @Eibe2006], which is available in the tidymodels package **discrim**.
One of the main advantages of a naive Bayes model is its ability to handle a large number of features, such as those we deal with when using word count methods.
Here we have only kept the 1000 most frequent tokens, but we could have kept more tokens and a naive Bayes model would still be able to handle such predictors well. For now, we will limit the model to a moderate number of tokens.
```{block2, type = "rmdpackage"}
In **tidymodels**, the package for creating model specifications is **parsnip** [@R-parsnip]. The **parsnip** package provides the functions for creating all the models we have used so far, but other extra packages provide more. The **discrim** package is an extension package for **parsnip** that contains model definitions for various discriminant analysis models, including naive Bayes.
```
```{r nbspec}
library(discrim)
nb_spec <- naive_Bayes() %>%
set_mode("classification") %>%
set_engine("naivebayes")
nb_spec
```
Now we have everything we need to fit our first classification model. We can add the naive Bayes model to our workflow, and then we can fit this workflow to our training data.
```{r nbfit, dependson=c("nbspec", "complaintwf")}
nb_fit <- complaint_wf %>%
add_model(nb_spec) %>%
fit(data = complaints_train)
```
We have trained our first classification model!
### Evaluation
Like we discussed in Section \@ref(firstregressionevaluation), we should not use the test set to compare models or different model parameters. The test set is a precious resource that should only be used at the end of the model training process to estimate performance on new data. Instead, we will use *resampling* methods to evaluate our model.
Let's use resampling to estimate the performance of the naive Bayes classification model we just fit. We can do this using resampled data sets built from the training set. Let's create 10-fold cross-validation sets, and use these resampled sets for performance estimates.
```{r complaintsfolds, dependson="complaintssplit"}
set.seed(234)
complaints_folds <- vfold_cv(complaints_train)
complaints_folds
```
Each of these splits contains information about how to create cross-validation folds from the original training data. In this example, 90% of the training data is included in each fold, and the other 10% is held out for evaluation.
For convenience, let's again use a `workflow()` for our resampling estimates of performance.
```{block, type = "rmdwarning"}
Using a `workflow()` isn't required (you can fit or tune a model plus a preprocessor), but it can make your code easier to read and organize.
```
```{r nbwf, dependson=c("nbspec", "complaintrec4")}
nb_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(nb_spec)
nb_wf
```
In the last section, we fit one time to the training data as a whole. Now, to estimate how well that model performs, let's fit the model many times, once to each of these resampled folds, and then evaluate on the heldout part of each resampled fold.
```{r nbrs, dependson=c("nbwf", "complaintsfolds")}
nb_rs <- fit_resamples(
nb_wf,
complaints_folds,
control = control_resamples(save_pred = TRUE)
)
```
We can extract the relevant information using `collect_metrics()` and `collect_predictions()`.
```{r nbrsmetrics}
nb_rs_metrics <- collect_metrics(nb_rs)
nb_rs_predictions <- collect_predictions(nb_rs)
```
What results do we see, in terms of performance metrics?
```{r}
nb_rs_metrics
```
The default performance parameters for binary classification are accuracy and ROC AUC (area under the receiver operator characteristic curve). For these resamples, the average accuracy is `r nb_rs_metrics %>% filter(.metric == "accuracy") %>% pull(mean) %>% scales::percent(accuracy = 0.1)`.
\index{accuracy}
\index{ROC AUC}
\index{area under the receiver operator characteristic curve|see {ROC AUC}}
```{block, type = "rmdnote"}
Accuracy and ROC AUC are performance metrics used for classification models. For both, values closer to 1 are better.
Accuracy is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets.
ROC AUC measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing.
```
Figure \@ref(fig:firstroccurve) shows the ROC curve, a visualization of how well a classification model can distinguish between classes, for our first classification model on each of the resampled data sets.
```{r firstroccurve, opts.label = "fig.square", fig.cap="ROC curve for naive Bayes classifier with resamples of US Consumer Finance Bureau complaints"}
nb_rs_predictions %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "ROC curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)
```
The area under each of these curves is the `roc_auc` metric we have computed. If the curve was close to the diagonal line, then the model's predictions would be no better than random guessing.
Another way to evaluate our model is to evaluate the \index{matrix!confusion}confusion matrix. A confusion matrix tabulates a model's false positives and false negatives for each class.
The function `conf_mat_resampled()` computes a separate confusion matrix for each resample and takes the average of the cell counts. This allows us to visualize an overall confusion matrix rather than needing to examine each resample individually.
```{r firstheatmap, dependson="nbrs", fig.cap='Confusion matrix for naive Bayes classifier, showing some bias toward predicting the credit category'}
conf_mat_resampled(nb_rs, tidy = FALSE) %>%
autoplot(type = "heatmap")
```
In \index{matrix!confusion}Figure \@ref(fig:firstheatmap), the squares for "Credit"/"Credit" and "Other"/"Other" have a darker shade than the off-diagonal squares. This is a good sign, meaning that our model is right more often than not! However, this first model is struggling somewhat since many observations from the "Credit" class are being mispredicted as "Other".
```{block, type = "rmdwarning"}
One metric alone cannot give you a complete picture of how well your classification model is performing. The confusion matrix is a good starting point to get an overview of your model performance, as it includes rich information.
```
\index{matrix!confusion}
This is real data from a government agency, and these kinds of performance metrics must be interpreted in the context of how such a model would be used. What happens if the model we trained gets a classification wrong for a consumer complaint? What impact will it have if more "Other" complaints are correctly identified than "Credit" complaints, either for consumers or for policymakers?
## Compare to the null model {#classnull}
Like we did in Section \@ref(regnull), we can assess a model like this one by comparing its performance to a "null model" or baseline model, a simple, non-informative model that always predicts the largest class for classification. Such a model is perhaps the simplest heuristic or rule-based alternative that we can consider as we assess our modeling efforts.
We can build a classification `null_model()` specification and add it to a `workflow()` with the same preprocessing recipe we used in the previous section, to estimate performance.
```{r nullrs2, dependson=c("nbwf", "complaintsfolds")}
null_classification <- null_model() %>%
set_engine("parsnip") %>%
set_mode("classification")
null_rs <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(null_classification) %>%
fit_resamples(
complaints_folds
)
```
What results do we obtain from the null model, in terms of performance metrics?
```{r, dependson="nullrs"}
null_rs %>%
collect_metrics()
```
The accuracy and ROC AUC indicate that this null model is, like in the regression case, dramatically worse than even our first model. The text of the CFPB complaints is predictive relative to the category we are building models for.
## Compare to a lasso classification model {#comparetolasso}
Regularized linear models are a class of statistical model that can be used in regression and classification tasks. Linear models are not considered cutting edge in NLP research, but are a \index{models!in production}workhorse in real-world practice. Here we will use a lasso regularized model [@Tibshirani1996], where the regularization method also performs variable selection. In text analysis, we typically have many tokens, which are the features in our machine learning problem.
```{block, type = "rmdnote"}
Using regularization helps us choose a simpler model that we expect to generalize better to new observations, and variable selection helps us identify which features to include in our model.
```
Lasso regression or classification learns how much of a _penalty_ to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
Let's create a specification of a lasso regularized model. Remember that in tidymodels, specifying a model has three components: the algorithm, the mode, and the computational engine.
```{r lassospec}
lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
lasso_spec
```
Then we can create another `workflow()` object with the lasso specification. Notice that we can reuse our text preprocessing recipe.
```{r lassowf, dependson=c("lassospec", "complaintrec4")}
lasso_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(lasso_spec)
lasso_wf
```
Now we estimate the performance of this first lasso classification model with `fit_resamples()`.
```{r lassors, dependson=c("lassowf", "complaintsfolds")}
set.seed(2020)
lasso_rs <- fit_resamples(
lasso_wf,
complaints_folds,
control = control_resamples(save_pred = TRUE)
)
```
Let's again extract the relevant information using `collect_metrics()` and `collect_predictions()`
```{r lassorsmetrics, dependson="lassors"}
lasso_rs_metrics <- collect_metrics(lasso_rs)
lasso_rs_predictions <- collect_predictions(lasso_rs)
```
Now we can see that `lasso_rs_metrics` contains the same default performance metrics we have been using so far in this chapter.
```{r, dependson="lassorsmetrics"}
lasso_rs_metrics
```
This looks pretty promising, considering we haven't yet done any tuning of the lasso hyperparameters.
Figure \@ref(fig:lassoroccurve) shows the ROC curves for this regularized model on each of the resampled data sets.
```{r lassoroccurve, opts.label = "fig.square", fig.cap="ROC curve for lasso regularized classifier with resamples of US Consumer Finance Bureau complaints"}
lasso_rs_predictions %>%
group_by(id) %>%
roc_curve(truth = product, .pred_Credit) %>%
autoplot() +
labs(
color = NULL,
title = "ROC curve for US Consumer Finance Complaints",
subtitle = "Each resample fold is shown in a different color"
)
```
Let's finish this section by generating a confusion matrix\index{matrix!confusion}, shown in Figure \@ref(fig:lassoheatmap).
Our lasso model is better at separating the classes than the naive Bayes model in Section \@ref(classfirstmodel), and our results are more symmetrical than those for the naive Bayes model in Figure \@ref(fig:firstheatmap).
```{r lassoheatmap, dependson="lassorsmetrics", fig.cap="Confusion matrix for a lasso regularized classifier, with more symmetric results"}
conf_mat_resampled(lasso_rs, tidy = FALSE) %>%
autoplot(type = "heatmap")
```
## Tuning lasso hyperparameters {#tunelasso}
\index{models!tuning}The value `penalty = 0.01` for regularization in Section \@ref(comparetolasso) was picked somewhat arbitrarily. How do we know the *right* or *best* regularization parameter penalty? This is a model hyperparameter, and we cannot learn its best value during model training, but we can estimate the best value by training many models on resampled data sets and exploring how well all these models perform. Let's build a new model specification for **model tuning**.
```{r complaintstunespec}
tune_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
tune_spec
```
After the tuning process, we can select a single best numeric value.
```{block, type = "rmdnote"}
Think of `tune()` here as a placeholder for the regularization penalty.
```
We can create a regular grid of values to try, using a convenience function for `penalty()`.
```{r complaintslambdagrid}
lambda_grid <- grid_regular(penalty(), levels = 30)
lambda_grid
```
The function `grid_regular()` is from the **dials** package. It chooses sensible values to try for a parameter like the regularization penalty; here, we asked for 30 different possible values.
Now it is time to tune! Let's use `tune_grid()` to fit a model at each of the values for the regularization penalty in our regular grid.
```{block, type = "rmdpackage"}
In **tidymodels**, the package for tuning is called **tune**. Tuning a model uses a similar syntax compared to fitting a model to a set of resampled data sets for the purposes of evaluation (`fit_resamples()`) because the two tasks are so similar. The difference is that when you tune, each model that you fit has _different_ parameters and you want to find the best one.
```
We add our tunable model specification `tune_spec` to a workflow with the same preprocessing recipe we've been using so far, and then fit it to every possible parameter in `lambda_grid` and every resample in `complaints_folds` with `tune_grid()`.
```{r complaintstuners, dependson=c("complaintsfolds", "complaintslambdagrid", "complaintstunespec")}
tune_wf <- workflow() %>%
add_recipe(complaints_rec) %>%
add_model(tune_spec)
set.seed(2020)
tune_rs <- tune_grid(
tune_wf,
complaints_folds,
grid = lambda_grid,
control = control_resamples(save_pred = TRUE)
)
tune_rs
```
```{block, type = "rmdwarning"}
Like when we used `fit_resamples()`, tuning in tidymodels can use multiple cores or multiple machines via parallel processing, because the resampled data sets and possible parameters are independent of each other. A discussion of parallel processing for all possible operating systems is beyond the scope of this book, but it is well worth your time to learn how to parallelize your machine learning tasks on _your_ system.
```
\index{computational speed}
Now, instead of one set of metrics, we have a set of metrics for each value of the regularization penalty.
```{r dependson="complaintstuners"}
collect_metrics(tune_rs)
```
Let's visualize these metrics, accuracy and ROC AUC, in Figure \@ref(fig:complaintstunevis) to see what the best model is.
```{r complaintstunevis, dependson="complaintstuners", opts.label = "fig.square", fig.cap="We can identify the best regularization penalty from model performance metrics, for example, at the highest ROC AUC. Note the logarithmic scale for the regularization penalty."}
autoplot(tune_rs) +
labs(
title = "Lasso model performance across regularization penalties",
subtitle = "Performance metrics can be used to identity the best penalty"
)
```
We can view the best results with `show_best()` and a choice for the metric, such as ROC AUC.
```{r dependson="complaintstuners"}
tune_rs %>%
show_best("roc_auc")
```
```{r tunersauc, include=FALSE, dependson="complaintstuners"}
tune_rs_auc <- show_best(tune_rs, "roc_auc") %>%
pull(mean) %>%
max() %>%
round(3)
```
The best value for ROC AUC from this tuning run is `r tune_rs_auc`. We can extract the best regularization parameter for this value of ROC AUC from our tuning results with `select_best()`, or a simpler model with higher regularization with `select_by_pct_loss()` or `select_by_one_std_err()` Let's choose the model with the best ROC AUC within one standard error of the numerically best model [@Breiman1984].
```{r complaintschooseauc, dependson="complaintstuners"}
chosen_auc <- tune_rs %>%
select_by_one_std_err(metric = "roc_auc", -penalty)
chosen_auc
```
Next, let's finalize our tunable workflow with this particular regularization penalty. This is the regularization penalty that our tuning results indicate give us the best model.
```{r tunedlasso, dependson=c("complaintschooseauc", "complaintstuners")}
final_lasso <- finalize_workflow(tune_wf, chosen_auc)
final_lasso
```
Instead of `penalty = tune()` like before, now our workflow has finalized values for all arguments. The preprocessing recipe has been evaluated on the training data, and we tuned the regularization penalty so that we have a penalty value of `r format(round(chosen_auc$penalty, 5), scientific = FALSE)`. This workflow is ready to go! It can now be fit to our training data.
```{r lassofit, dependson=c("tunedlasso", "complaintssplit")}
fitted_lasso <- fit(final_lasso, complaints_train)
```
What does the result look like? We can access the fit using `extract_fit_parsnip()`, and even `tidy()` the model coefficient results into a convenient dataframe format.
```{r dependson="lassofit"}
fitted_lasso %>%
extract_fit_parsnip() %>%
tidy() %>%
arrange(-estimate)
```
We see here, for the penalty we chose, what terms contribute the most to a complaint _not_ being about credit. The words are largely about mortgages and other financial products.
What terms contribute to a complaint being about credit reporting, for this tuned model? Here we see the names of the credit reporting agencies and words about credit inquiries.
```{r dependson="lassofit"}
fitted_lasso %>%
extract_fit_parsnip() %>%
tidy() %>%
arrange(estimate)
```
```{block, type = "rmdnote"}
Since we are using a linear model, the model coefficients are directly interpretable and transparently give us variable importance. Many models useful for machine learning with text do _not_ have such transparent variable importance; in those situations, you can use other model-independent or model-agnostic approaches like [permutation variable importance](https://juliasilge.com/blog/last-airbender/).
```
## Case study: sparse encoding {#casestudysparseencoding}
We can change how our text data is represented to take advantage of its sparsity, especially for models like lasso regularized models. The regularized regression model we have been training in previous sections used `set_engine("glmnet")`; this computational engine can be more efficient when text data is transformed to a \index{matrix!sparse}sparse matrix (Section \@ref(motivatingsparse)), rather than a dense dataframe or tibble representation.
To keep our text data sparse throughout modeling and use the sparse capabilities of `set_engine("glmnet")`, we need to explicitly set a non-default preprocessing blueprint, using the package **hardhat** [@R-hardhat].
```{block, type = "rmdpackage"}
The **hardhat** package is used by other tidymodels packages like recipes and parsnip under the hood. As a tidymodels user, you typically don't use hardhat functions directly. The exception is when you need to customize something about your model or preprocessing, like in this sparse data example.
```
```{r sparsebp}
library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")
```
This "blueprint" lets us specify during modeling how we want our data passed around from the \index{preprocessing}preprocessing into the model. The composition `"dgCMatrix"` is the most common sparse matrix type, from the Matrix package [@R-Matrix], used in R for modeling. We can use this `blueprint` argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.
```{r sparsewf, dependson=c("complaintstunespec", "complaintrec4", "sparsebp")}
sparse_wf <- workflow() %>%
add_recipe(complaints_rec, blueprint = sparse_bp) %>%
add_model(tune_spec)
sparse_wf
```
The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels. Let's restrict the values this time using the `range` argument, so we don't test out as small values for regularization, and only try 20 levels.
```{r smallerlambda}
smaller_lambda <- grid_regular(penalty(range = c(-5, 0)), levels = 20)
smaller_lambda
```
We can tune this lasso regression model, in the same way that we did in Section \@ref(tunelasso). We will fit and assess each possible regularization parameter on each resampling fold, to find the best amount of regularization.
```{r sparsetuners, dependson=c("sparsewf", "complaintsfolds", "smallerlambda")}
set.seed(2020)
sparse_rs <- tune_grid(
sparse_wf,
complaints_folds,
grid = smaller_lambda
)
sparse_rs
```
How did this model turn out, especially compared to the tuned model that did not use the sparse capabilities of `set_engine("glmnet")`?
```{r dependson="sparsetuners"}
sparse_rs %>%
show_best("roc_auc")
```
The best ROC AUC is nearly identical; the best ROC AUC for the non-sparse tuned lasso model in Section \@ref(tunelasso) was `r tune_rs_auc`. The best regularization parameter (`penalty`) is a little different (the best value in Section \@ref(tunelasso) was `r select_best(tune_rs, "roc_auc") %>% pull(penalty) %>% round(5) %>% format(scientific = FALSE)`), but we used a different grid so didn't try out exactly the same values. We ended up with nearly the same performance and best tuned model.
Importantly, this tuning also took a bit less time to complete.\index{computational speed}
- The _preprocessing_\index{preprocessing} was not much faster, because tokenization and computing tf-idf take a long time.\index{tf-idf}
- The _model fitting_ was much faster, because for highly sparse data, this implementation of regularized regression is much faster for sparse matrix\index{matrix!sparse} input than any dense input.
Overall, the whole tuning workflow is about 10% faster using the sparse preprocessing blueprint. Depending on how computationally expensive your preprocessing is relative to your model and how sparse your data is, you may expect to see larger (or smaller) gains from moving to a sparse data representation.
```{block, type = "rmdnote"}
Since our model performance is about the same and we see gains in training time, let's use this sparse representation for the rest of this chapter.
```
## Two-class or multiclass? {#mlmulticlass}
Most of this chapter focuses on binary classification, where we have two classes in our outcome variable (such as "Credit" and "Other") and each observation can either be one or the other. This is a simple scenario with straightforward evaluation strategies because the results only have a two-by-two contingency matrix.
However, it is not always possible to limit a modeling question to two classes. Let's explore how to deal with situations where we have more than two classes.
The CFPB complaints data set in this chapter has nine different `product` classes. In decreasing frequency, they are:
- Credit reporting, credit repair services, or other personal consumer reports
- Debt collection
- Credit card or prepaid card
- Mortgage
- Checking or savings account
- Student loan
- Vehicle loan or lease
- Money transfer, virtual currency, or money service
- Payday loan, title loan, or personal loan
We assume that there is a reason why these product classes have been created in this fashion by this government agency.
Perhaps complaints from different classes are handled by different people or organizations.
Whatever the reason, in this section we would like to build a multiclass classifier to identify these nine specific product classes.
We need to create a new split of the data using `initial_split()` on the unmodified `complaints` data set.
```{r, multicomplaintssplit}
set.seed(1234)
multicomplaints_split <- initial_split(complaints, strata = product)
multicomplaints_train <- training(multicomplaints_split)
multicomplaints_test <- testing(multicomplaints_split)
```
Before we continue, let us take a look at the number of cases in each of the classes.
```{r, dependson="multicomplaintssplit"}
multicomplaints_train %>%
count(product, sort = TRUE) %>%
select(n, product)
```
There is significant imbalance between the classes that we must address, with over 20 times more cases of the majority class than there is of the smallest class.
This kind of imbalance is a common problem\index{classification!challenges} with multiclass classification, with few multiclass data sets in the real world exhibiting balance between classes.
Compared to binary classification, there are several additional issues to keep in mind when working with multiclass classification:
- Many machine learning algorithms do not handle imbalanced data well and are likely to have a hard time predicting minority classes.
- Not all machine learning algorithms are built for multiclass classification at all.
- Many evaluation metrics need to be reformulated to describe multiclass predictions.
When you have multiple classes in your data, it is possible to formulate the multiclass problem in two ways. With one approach, any given observation can belong to multiple classes. With the other approach, an observation can belong to one and only one class. We will be sticking to the second, "one class per observation" model formulation in this section.
There are many different ways to deal with imbalanced data.
We will demonstrate one of the simplest methods, downsampling\index{downsampling}, where observations from the majority classes are removed during training to achieve a balanced class distribution.
We will be using the **themis** [@R-themis] add-on package for recipes which provides the `step_downsample()` function to perform downsampling.
```{block, type = "rmdpackage"}
The **themis** package provides many more algorithms to deal with imbalanced data during data preprocessing.
```
We have to create a new recipe specification from scratch, since we are dealing with new training data this time.
The specification `multicomplaints_rec` is similar to what we created in Section \@ref(classfirstattemptlookatdata). The only changes are that different data is passed to the `data` argument in the `recipe()` function (it is now `multicomplaints_train`) and we have added `step_downsample(product)` to the end of the recipe specification to downsample after all the text preprocessing. We want to downsample last so that we still generate features on the full training data set. The downsampling\index{downsampling} will then _only_ affect the modeling step, not the preprocessing\index{preprocessing} steps, with hopefully better results.
```{r multicomplaintsrec, message=FALSE, dependson="multicomplaintssplit"}
library(themis)
multicomplaints_rec <-
recipe(product ~ consumer_complaint_narrative,
data = multicomplaints_train) %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative) %>%
step_downsample(product)
```
We also need a new cross-validation object since we are using a different data set.
```{r multicomplaintsfolds, dependson="complaintssplit"}
multicomplaints_folds <- vfold_cv(multicomplaints_train)
```
We cannot reuse the tuneable lasso classification specification from Section \@ref(tunelasso) because it only works for binary classification. Some model algorithms and computational engines (examples are most random forests and SVMs) automatically detect when we perform multiclass classification from the number of classes in the outcome variable and do not require any changes to our model specification. For lasso regularization, we need to create a new special model specification just for the multiclass class using `multinom_reg()`.
```{r multispec}
multi_spec <- multinom_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
multi_spec
```
We used the same arguments for `penalty` and `mixture` as in Section \@ref(tunelasso), as well as the same mode and engine, but this model specification is set up to handle more than just two classes. We can combine this model specification with our preprocessing recipe for multiclass data in a `workflow()`.
```{r multilassowf, dependson=c("multispec", "multicomplaintsrec")}
multi_lasso_wf <- workflow() %>%
add_recipe(multicomplaints_rec, blueprint = sparse_bp) %>%
add_model(multi_spec)
multi_lasso_wf
```
Now we have everything we need to tune the regularization penalty and find an appropriate value. Note that we specify `save_pred = TRUE`, so we can create ROC curves and a confusion matrix later. This is especially beneficial for multiclass classification.
```{r multilassors, dependson=c("multilassowf", "multicomplaintsfolds")}
multi_lasso_rs <- tune_grid(
multi_lasso_wf,
multicomplaints_folds,
grid = smaller_lambda,
control = control_resamples(save_pred = TRUE)
)
```
What do we see, in terms of performance metrics?
```{r dependson="multilassors"}
best_acc <- multi_lasso_rs %>%
show_best("accuracy")
best_acc
```
The accuracy metric naturally extends to multiclass tasks, but even the very best value is quite low at `r collect_metrics(multi_lasso_rs) %>% filter(.metric == "accuracy") %>% pull(mean) %>% max() %>% scales::percent(accuracy = 0.1)`, significantly lower than for the binary case in Section \@ref(tunelasso). This is expected since multiclass classification is a harder task than binary classification.
```{block, type = "rmdwarning"}
In binary classification, there is one right answer and one wrong answer; in this multiclass case, there is one right answer and _eight_ wrong answers.
```
To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices\index{matrix!confusion} in Figure \@ref(fig:multiheatmap).
```{r multiheatmap, fig.width = 10, fig.asp = 0.8, fig.cap="Confusion matrix for multiclass lasso regularized classifier, with most of the classifications along the diagonal"}
multi_lasso_rs %>%
collect_predictions() %>%
filter(penalty == best_acc$penalty) %>%
filter(id == "Fold01") %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
```
The diagonal is fairly well populated, which is a good sign. This means that the model generally predicted the right class.
The off-diagonal numbers are all the failures and where we should direct our focus.
It is a little hard to see these cases well since the majority class affects the scale.
A trick to deal with this problem is to remove all the correctly predicted observations.
```{r multiheatmapminusdiag, fig.width = 10, fig.asp = 0.8, fig.cap="Confusion matrix for multiclass lasso regularized classifier without diagonal"}
multi_lasso_rs %>%
collect_predictions() %>%
filter(penalty == best_acc$penalty) %>%
filter(id == "Fold01") %>%
filter(.pred_class != product) %>%
conf_mat(product, .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
```
Now we can more clearly see where our model breaks down in \index{matrix!confusion}Figure \@ref(fig:multiheatmapminusdiag). Some of the most common errors are "Credit reporting, credit repair services, or other personal consumer reports" complaints being wrongly being predicted as "Debt collection" or "Credit card of prepaid card" complaints. Those mistakes by the model are not hard to understand since all deal with credit and debt and do have overlap in vocabulary.
Knowing what the problem is helps us figure out how to improve our model.
The next step for improving our model is to revisit the data preprocessing\index{preprocessing} steps and model selection.
We can look at different models or model engines that might be able to more easily separate the classes.
Now that we have an idea of where the model isn't working, we can look more closely at the data and attempt to create features that could distinguish between these classes. In Section \@ref(customfeatures) we will demonstrate how you can create your own custom features.
## Case study: including non-text data
We are building a model from a data set that includes more than text data alone. Annotations and labels have been added by the CFPB that we can use during modeling, but we need to ensure that only information that would be available at the time of prediction is included in the model.
Otherwise we will be very disappointed once our model is used to predict on new data!
The variables we identify as available for use as predictors are:
- `date_received`
- `issue`
- `sub_issue`
- `consumer_complaint_narrative`
- `company`
- `state`
- `zip_code`
- `tags`
- `submitted_via`
Let's try including `date_received` in our modeling, along with the text variable we have already used, `consumer_complaint_narrative`, and a new variable `tags`.
The `submitted_via` variable could have been a viable candidate, but all the entries are "web".
The other variables like ZIP code could be of use too, but they are categorical variables with many values so we will exclude them for now.
```{r complaintrec5, dependson="complaintssplit"}
more_vars_rec <-
recipe(product ~ date_received + tags + consumer_complaint_narrative,
data = complaints_train)
```
How should we preprocess the `date_received` variable? We can use the `step_date()` function to extract the month and day of the week (`"dow"`). Then we remove the original date variable and convert the new month and day-of-the-week columns to \index{variables!dummy}indicator variables with `step_dummy()`.
```{block, type = "rmdnote"}
Categorical variables like the month can be stored as strings or factors, but for some kinds of models, they must be converted to indicator or dummy variables. These are numeric binary variables for the levels of the original categorical variable. For example, a variable called `December` would be created that is all zeroes and ones specifying which complaints were submitted in December, plus a variable called `November`, a variable called `October`, and so on.
```
```{r complaintrec6, dependson="complaintrec1"}
more_vars_rec <- more_vars_rec %>%
step_date(date_received, features = c("month", "dow"), role = "dates") %>%
step_rm(date_received) %>%
step_dummy(has_role("dates"))
```
The `tags` variable has some missing data. We can deal with this by using `step_unknown()`, which adds a new level to this factor variable for cases of missing data. Then we "dummify" (create dummy/indicator variables) the variable with `step_dummy()`.\index{variables!dummy}
```{r complaintrec7, dependson="complaintrec2"}
more_vars_rec <- more_vars_rec %>%
step_unknown(tags) %>%
step_dummy(tags)
```
Now we add steps to process the text of the complaints, as before.
```{r complaintrec8, dependson="complaintrec1"}
more_vars_rec <- more_vars_rec %>%
step_tokenize(consumer_complaint_narrative) %>%
step_tokenfilter(consumer_complaint_narrative, max_tokens = 1e3) %>%
step_tfidf(consumer_complaint_narrative)
```
Let's combine this more extensive preprocessing recipe that handles more variables together with the tuneable lasso regularized classification model specification.
```{r lassowfmorevars, dependson=c("complaintstunespec", "complaintrec8")}
more_vars_wf <- workflow() %>%
add_recipe(more_vars_rec, blueprint = sparse_bp) %>%
add_model(tune_spec)
more_vars_wf
```
Let's tune this `workflow()` with our resampled data sets, find a good value for the regularization penalty, and estimate the model's performance.
```{r morevarsrs, dependson=c("lassowfmorevars", "complaintsfolds")}
set.seed(123)
more_vars_rs <- tune_grid(
more_vars_wf,
complaints_folds,
grid = smaller_lambda,
)
```
We can extract the metrics for the best-performing regularization penalties from these results with `show_best()` with an option like `"roc_auc"` or `"accuracy"` if we prefer. How did our chosen performance metric turn out for our model that included more than just the text data?
```{r dependson="morevarsrs"}
more_vars_rs %>%
show_best("roc_auc")
```
We see here that including more predictors did not measurably improve our model performance or even change the regularization. With only text features in Section \@ref(casestudysparseencoding) and the same grid and sparse encoding, we achieved an accuracy of `r collect_metrics(sparse_rs) %>% dplyr::filter(.metric == "roc_auc") %>% pull(mean) %>% max() %>% round(3)`, the same as what we see now by including the features dealing with dates and tags as well. The best regularization penalty in Section \@ref(casestudysparseencoding) was `r select_best(sparse_rs, "roc_auc") %>% pull(penalty) %>% round(5) %>% format(scientific = FALSE)` and is about the same here. We can use `tidy()` and some **dplyr** manipulation to find at what rank (`term_rank`) any of the date or tag variables were included in the regularized results, by absolute value of the model coefficient.
```{r dependson="morevarsrs"}
finalize_workflow(more_vars_wf,
select_best(more_vars_rs, "roc_auc")) %>%
fit(complaints_train) %>%
extract_fit_parsnip() %>%
tidy() %>%
arrange(-abs(estimate)) %>%
mutate(term_rank = row_number()) %>%
filter(!str_detect(term, "tfidf"))
```
In our example here, some of the non-text predictors are included in the model with non-zero coefficients but ranked down in the 700s of all model terms, with smaller coefficients than many text terms. They are not that important.
```{block, type = "rmdnote"}
This whole book focuses on supervised machine learning for text data, but models can combine _both_ text predictors and other kinds of predictors.
```
## Case study: data censoring
The complaints data set already has sensitive information (PII)\index{PII} censored or protected using strings such as "XXXX" and "XX".
This data censoring\index{censoring} can be viewed as data _annotation_; specific account numbers and birthdays are protected, but we know they were there. These values would be mostly unique anyway, and likely filtered out in their original form.
Figure \@ref(fig:censoredtrigram) shows the most frequent trigrams (Section \@ref(tokenizingngrams)) in our training data set.
```{r censoredtrigram, dependson="complaintssplit", fig.cap="Many of the most frequent trigrams feature censored information"}
library(tidytext)
complaints_train %>%
slice(1:1000) %>%
unnest_tokens(trigrams,
consumer_complaint_narrative, token = "ngrams",
collapse = NULL) %>%
count(trigrams, sort = TRUE) %>%
mutate(censored = str_detect(trigrams, "xx")) %>%
slice(1:20) %>%
ggplot(aes(n, reorder(trigrams, n), fill = censored)) +
geom_col() +
scale_fill_manual(values = c("grey40", "firebrick")) +
labs(y = "Trigrams", x = "Count")
```
The vast majority of trigrams in Figure \@ref(fig:censoredtrigram) include one or more censored words.
Not only do the most used trigrams include some kind of censoring,
but the censoring itself is informative as it is not used uniformly across the product classes.
In Figure \@ref(fig:trigram25), we take the top-25 most frequent trigrams that include censoring,
and plot the proportions for "Credit" and "Other".
```{r trigram25, fig.cap="Many of the most frequent trigrams feature censored words, but there is a difference in how often they are used within each class"}
top_censored_trigrams <- complaints_train %>%
slice(1:1000) %>%
unnest_tokens(trigrams,
consumer_complaint_narrative, token = "ngrams",
collapse = NULL) %>%
count(trigrams, sort = TRUE) %>%
filter(str_detect(trigrams, "xx")) %>%
slice(1:25)
plot_data <- complaints_train %>%
unnest_tokens(trigrams,
consumer_complaint_narrative, token = "ngrams",
collapse = NULL) %>%
right_join(top_censored_trigrams, by = "trigrams") %>%
count(trigrams, product, .drop = FALSE)
plot_data %>%
ggplot(aes(n, trigrams, fill = product)) +
geom_col(position = "fill")
```
There is a difference in these proportions across classes. Tokens like "on xx xx" are used when referencing a date, e.g., "we had a problem on 06/25/2018".
Remember that the current tokenization engine strips punctuation before tokenizing.
This means that the above example will be turned into "we had a problem on 06 25 2018" before creating n-grams^[The censored\index{censoring} trigrams that include "oh" seem mysterious but upon closer examination, they come from censored addresses, with "oh" representing the US state of Ohio. Most two-letter state abbreviations are censored, but this one is not since it is ambiguous. This highlights the real challenge of anonymizing text.].
To crudely simulate what the data might look like before it was censored, we can replace all cases of "XX" and "XXXX" with random integers.
This isn't quite right since dates will be given values between `00` and `99`, and we don't know for sure that only numerals have been censored, but it gives us a place to start.
Below is a simple function `uncensor_vec()` that locates all instances of `"XX"` and replaces them with a number between 11 and 99.
We don't need to handle the special case of `XXXX` as it automatically being handled.
```{r uncensorvec}
uncensor <- function(n) {
as.character(sample(seq(10 ^ (n - 1), 10 ^ n - 1), 1))
}
uncensor_vec <- function(x) {
locs <- str_locate_all(x, "XX")
map2_chr(x, locs, ~ {
for (i in seq_len(nrow(.y))) {
str_sub(.x, .y[i, 1], .y[i, 2]) <- uncensor(2)
}
.x
})
}
```
We can run a quick test to see how it works.
```{r, dependson="uncensorvec"}
uncensor_vec("In XX/XX/XXXX I leased a XXXX vehicle")
```
Now we can produce the same visualization as Figure \@ref(fig:censoredtrigram) but can also apply our uncensoring function to the text before tokenizing.
```{r uncensoredtrigram, dependson=c("complaintssplit", "uncensor_vec"), fig.cap="Trigrams without numbers float to the top as the uncensored tokens are too spread out"}
complaints_train %>%
slice(1:1000) %>%
mutate(text = uncensor_vec(consumer_complaint_narrative)) %>%
unnest_tokens(trigrams, text, token = "ngrams",
collapse = NULL) %>%
count(trigrams, sort = TRUE) %>%
mutate(censored = str_detect(trigrams, "xx")) %>%
slice(1:20) %>%
ggplot(aes(n, reorder(trigrams, n), fill = censored)) +
geom_col() +
scale_fill_manual(values = c("grey40", "firebrick")) +
labs(y = "Trigrams", x = "Count")
```
Here in Figure \@ref(fig:uncensoredtrigram), we see the same trigrams that appeared in Figure \@ref(fig:censoredtrigram).
However, none of the uncensored words appear, because of our uncensoring function.
This is expected, because while `"xx xx 2019"` appears in the first plot indicating a date in the year 2019, after we uncensor it, it is split into 365 buckets (actually more, since we used numerical values between `00` and `99`).
Censoring the dates in these complaints gives more power to a date as a general construct.
```{block, type = "rmdwarning"}
What happens when we use these censored dates as a feature in supervised machine learning? We have a higher chance of understanding if dates in the complaint text are important to predicting the class, but we are blinded to the possibility that certain dates and months are more important.
```
Data censoring\index{censoring} can be a form of preprocessing in your data pipeline.
For example, it is highly unlikely to be useful (or ethical/legal) to have any specific person's social security number, credit card number, or any other kind of PII\index{PII} embedded into your model. Such values appear rarely and are most likely highly correlated with other known variables in your data set.
More importantly, that information can become embedded in your model and begin to leak as demonstrated by @carlini2018secret, @Fredrikson2014, and @Fredrikson2015.
Both of these issues are important, and one of them could land you in a lot of legal trouble.
Exposing such PII to modeling is an example of where we should all stop to ask, "Should we even be doing this?" as we discussed in the overview to these chapters.
If you have social security numbers in text data, you should definitely not pass them on to your machine learning model, but you may consider the option of annotating the _presence_ of a social security number.
Since a social security number has a very specific form, we can easily construct a \index{regex}regular expression (Appendix \@ref(regexp)) to locate them.
```{block, type = "rmdnote"}
A social security number comes in the form `AAA-BB-CCCC` where `AAA` is a number between `001` and `899` excluding `666`, `BB` is a number between `01` and `99` and `CCCC` is a number between `0001` and `9999`. This gives us the following regex:
`(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}`
```
We can use a function to replace each social security number with an indicator that can be detected later by \index{preprocessing}preprocessing steps.
It's a good idea to use a "word" that won't be accidentally broken up by a tokenizer.
```{r}
ssn_text <- c("My social security number is 498-08-6333",
"No way, mine is 362-60-9159",
"My parents numbers are 575-32-6985 and 576-36-5202")
ssn_pattern <- "(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}"
str_replace_all(string = ssn_text,
pattern = ssn_pattern,
replacement = "ssnindicator")
```
This technique isn't useful only for personally identifiable information\index{PII} but can be used anytime you want to gather similar words in the same bucket; hashtags, email addresses, and usernames can sometimes benefit from being annotated in this way.
```{block2, type = "rmdwarning"}
The practice of data re-identification or de-anonymization, where seemingly or partially "anonymized" data sets are mined to identify individuals, is out of scope for this section and our book. However, this is a significant and important issue for any data practitioner dealing with PII, and we encourage readers to familiarize themselves with results such as @Sweeney2000 and current best practices to protect against such mining.
```
## Case study: custom features {#customfeatures}
Most of what we have looked at so far has boiled down to counting tokens and weighting them in one way or another.
This approach is quite broad and domain agnostic, but you as a data practitioner often have specific knowledge about your data set that you should use in feature engineering.\index{feature engineering}
Your domain knowledge allows you to build more predictive features than the naive search of simple tokens.
As long as you can reasonably formulate what you are trying to count, chances are you can write a function that can detect it.
This is where having a little bit of knowledge about regular expressions pays off.