-
Notifications
You must be signed in to change notification settings - Fork 100
/
Copy path10_dl_cnn.Rmd
1000 lines (755 loc) · 45.5 KB
/
10_dl_cnn.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Convolutional neural networks {#dlcnn}
```{r include = FALSE}
library(keras)
knitr::opts_chunk$set(message = FALSE,
warning = FALSE, error = TRUE)
tensorflow::tf$random$set_seed(1234)
junk <- keras_model_sequential()
hook_output = knit_hooks$get('output')
knit_hooks$set(output = function(x, options) {
# this hook is used only when the linewidth option is not NULL
if (!is.null(n <- options$linewidth)) {
x = knitr:::split_lines(x)
# any lines wider than n should be wrapped
if (any(nchar(x) > n)) x = strwrap(x, width = n)
x = paste(x, collapse = '\n')
}
hook_output(x, options)
})
```
The first neural networks\index{network architecture} we built in Chapter \@ref(dldnn) did not have the capacity to learn much about structure, sequences, or long-range dependencies in our text data. The LSTM networks we trained in Chapter \@ref(dllstm) were especially suited to learning long-range dependencies. In this final chapter, we will focus on \index{neural network!convolutional} **convolutional neural network** (CNN) architecture [@kim2014], which can learn local, spatial structure within data.
CNNs can be well-suited for modeling text data because text often contains quite a lot of local structure. A CNN does not learn long-range structure within a sequence like an LSTM, but instead detects local patterns. A CNN network layer takes data (like text) as input and then hopefully produces output that represents specific structures\index{language!structure} in the data.
```{block, type = "rmdnote"}
Let's take more time with CNNs in this chapter to explore their construction, different features, and the hyperparameters we can tune.
```
## What are CNNs?
CNNs can work with data of different dimensions (like two-dimensional images or three-dimensional video), but for text modeling, we typically work in one dimension. The illustrations and explanations in this chapter use only one dimension to match the text use case.
Figure \@ref(fig:cnn-architecture) illustrates a typical CNN architecture.
A convolutional filter slides along the sequence to produce a new, smaller sequence. This is repeated multiple times, typically with different parameters for each layer, until we are left with a small data cube that we can transform into our required output shape, a value between 0 and 1 in the case of binary classification.
```{r cnn-architecture, echo= FALSE, fig.cap="A template CNN architecture for one-dimensional input data. A sequence of consecutive CNN layers incremently reduces the size, ending with single output value.", out.width="100%"}
knitr::include_graphics("diagram-files/cnn-architecture.png")
```
This figure isn't entirely accurate because we technically don't feed characters into a CNN, but instead use one-hot sequence encoding (Section \@ref(onehotsequence)) with a possible word embedding.
Let's talk about two of the most important CNN concepts, **kernels** and **kernel size**.
### Kernel
The kernel is a small vector that slides along the input. When it is sliding, it performs element-wise multiplication of the values in the input and its own weights, and then sums up the values to get a single value.
Sometimes an activation function is applied as well.
It is these weights that are trained via gradient descent to find the best fit.
In Keras, the `filters` represent how many different kernels are trained in each layer. You typically start with fewer `filters` at the beginning of your network and then increase them as you go along.
### Kernel size
The most prominent hyperparameter is the kernel size.
The kernel size is the length of the vector that contains the weights. A kernel of size 5 will have 5 weights. These kernels can capture local information similarly to how n-grams capture location patterns. Increasing the size of the kernel decreases the size of the output, as shown in Figure \@ref(fig:cnn-kernel-size).
```{r cnn-kernel-size, echo= FALSE, fig.cap="The kernel size affects the size of the output. A kernel size of 3 uses the information from 3 values to compute 1 value.", out.width="100%"}
knitr::include_graphics("diagram-files/cnn-kernel-size.png")
```
Larger kernels learn larger and less frequent patterns, while smaller kernels will find fine-grained features.
Notice how the choice of token affects how we think about kernel size.
For character tokenization, a kernel size of 5 will (in early layers) find patterns in subwords more often than patterns across words, since 5 characters will typically not span multiple words.
By contrast, a kernel size of 5 with word tokenization will learn patterns within sentences instead.
## A first CNN model {#firstcnn}
\index{neural network!convolutional}We will be using the same data, which we examine in Sections \@ref(kickstarter) and \@ref(kickstarter-blurbs) and use throughout Chapters \@ref(dldnn) and \@ref(dllstm). This data set contains short text blurbs for prospective crowdfunding campaigns on Kickstarter, along with if they were successful. Our goal of this modeling is to predict successful campaigns from the text contained in the blurb. We will also use the same \index{preprocessing}preprocessing and feature engineering recipe that we created and described in Sections \@ref(dnnrecipe) and \@ref(firstlstm).
```{r include=FALSE}
library(tidyverse)
kickstarter <- read_csv("data/kickstarter.csv.gz")
kickstarter
library(tidymodels)
set.seed(1234)
kickstarter_split <- kickstarter %>%
filter(nchar(blurb) >= 15) %>%
initial_split()
kickstarter_train <- training(kickstarter_split)
kickstarter_test <- testing(kickstarter_split)
library(textrecipes)
max_words <- 20000
max_length <- 30
kick_rec <- recipe(~blurb, data = kickstarter_train) %>%
step_tokenize(blurb) %>%
step_tokenfilter(blurb, max_tokens = max_words) %>%
step_sequence_onehot(blurb, sequence_length = max_length)
set.seed(234)
kick_val <- validation_split(kickstarter_train, strata = state)
kick_prep <- prep(kick_rec)
kick_analysis <- bake(kick_prep, new_data = analysis(kick_val$splits[[1]]),
composition = "matrix")
kick_assess <- bake(kick_prep, new_data = assessment(kick_val$splits[[1]]),
composition = "matrix")
state_analysis <- analysis(kick_val$splits[[1]]) %>% pull(state)
state_assess <- assessment(kick_val$splits[[1]]) %>% pull(state)
```
Our first CNN will look a lot like what is shown in Figure \@ref(fig:cnn-architecture).
We start with an embedding layer, followed by a single one-dimensional convolution layer `layer_conv_1d()`, then a global max pooling layer `layer_global_max_pooling_1d()`, a densely connected layer, and end with a dense layer with a sigmoid activation function to give us one value between 0 and 1 to use in our binary classification task.
```{r}
library(keras)
simple_cnn_model <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
simple_cnn_model
```
We are using the same embedding layer with the same `max_length` as in the previous networks so there is nothing new there.
The `layer_global_max_pooling_1d()` layer collapses the remaining CNN output into one dimension so we can finish it off with a densely connected layer and the sigmoid activation function.
This might not end up being the best CNN configuration, but it is a good starting point.
One of the challenges when working with CNNs is to ensure that we manage the dimensionality correctly. The length of the sequence decreases by `(kernel_size - 1)` for each layer. For this input, we have a sequence of length `max_length = 30`, which is decreased by `(5 - 1) = 4` resulting in a sequence of 26, as shown in the printed output of `simple_cnn_model`. We could create seven layers with `kernel_size = 5`, since we would end with `30 - 4 - 4 - 4 - 4 - 4 - 4 - 4 = 2` elements in the resulting sequence. However, we would not be able to do a network with 3 layers of
`kernel_size = 7` followed by 3 layers of `kernel_size = 5` since the resulting sequence would be `30 - 6 - 6 - 6 - 4 - 4 - 4 = 0` and we must have a positive length for our sequence.
Remember that `kernel_size` is not the only argument that will change the length of the resulting sequence. \index{network architecture}
```{block, type = "rmdnote"}
Constructing a sequence layer by layer and using the print method from **keras** to check the configuration is a great way to make sure your architecture is valid.
```
The compilation and fitting are the same as we have seen before, using a validation split created with tidymodels as shown in Sections \@ref(evaluate-dnn) and \@ref(lstmevaluation).
```{r}
simple_cnn_model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
cnn_history <- simple_cnn_model %>% fit(
x = kick_analysis,
y = state_analysis,
batch_size = 512,
epochs = 10,
validation_data = list(kick_assess, state_assess)
)
```
\index{optimization algorithm}
```{block, type = "rmdnote"}
We are using the `"adam"` optimizer since it performs well for many kinds of models. You may have to experiment to find the optimizer that works best for your model and data.
```
Now that the model is done fitting, we can evaluate it on the validation data set using the same `keras_predict()` function we created in Section \@ref(evaluate-dnn) and used throughout Chapters \@ref(dldnn) and \@ref(dllstm).
```{r}
val_res <- keras_predict(simple_cnn_model, kick_assess, state_assess)
val_res
```
We can calculate some standard metrics with `metrics()`.
```{r}
metrics(val_res, state, .pred_class, .pred_1)
```
We already see improvement over the densely connected network from Chapter \@ref(dldnn), our best performing model on the Kickstarter data so far.
The heatmap\index{matrix!confusion} in Figure \@ref(fig:cnnheatmap) shows that the model performs about the same for the two classes, success and failure for the crowdfunding campaigns; we are getting fairly good results from a baseline CNN model!
```{r cnnheatmap, fig.cap="Confusion matrix for first CNN model predictions of Kickstarter campaign success"}
val_res %>%
conf_mat(state, .pred_class) %>%
autoplot(type = "heatmap")
```
The ROC curve in Figure \@ref(fig:cnnroccurve) shows how the model performs at different thresholds.
```{r cnnroccurve, opts.label = "fig.square", fig.cap="ROC curve for first CNN model predictions of Kickstarter campaign success"}
val_res %>%
roc_curve(truth = state, .pred_1) %>%
autoplot() +
labs(
title = "Receiver operator curve for Kickstarter blurbs"
)
```
## Case study: adding more layers
Now that we know how our basic CNN performs, we can see what happens when we apply some common modifications to it.
This case study will examine:
- how we can add additional _convolutional_ layers to our base model and
- how additional _dense_ layers can be added.
\index{network architecture}Let's start by adding another fully connected layer. We take the architecture we used in `simple_cnn_model` and add another `layer_dense()` after the first `layer_dense()` in the model.
Increasing the depth of the model via the fully connected layers allows the model to find more complex patterns.
There is, however, a trade-off. Adding more layers adds more weights to the model, making it more complex and harder to train. If you don't have enough data or the patterns you are trying to classify aren't that complex, then model performance will suffer since the model will start overfitting as it starts memorizing patterns in the training data that don't generalize to new data.
```{block, type = "rmdwarning"}
When working with CNNs, the different layers perform different tasks. A convolutional layer extracts local patterns as it slides along the sequences, while a fully connected layer finds global patterns.
```
We can think of the convolutional layers as doing preprocessing\index{preprocessing} on the text, which is then fed into the dense neural network that tries to fit the best curve. Adding more fully connected layers allows the network to create more intricate curves, and adding more convolutional layers creates richer features that are used when fitting the curves. Your job when constructing a CNN is to make the architecture just complex enough to match the data without overfitting. One ad-hoc rule to follow when refining your network architecture is to start small and keep adding layers until the validation error does not improve anymore.
```{r}
cnn_double_dense <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
cnn_double_dense
```
We can compile and fit this new model. We will try to keep as much as we can constant as we compare the different models.
```{r}
cnn_double_dense %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
history <- cnn_double_dense %>% fit(
x = kick_analysis,
y = state_analysis,
batch_size = 512,
epochs = 10,
validation_data = list(kick_assess, state_assess)
)
```
```{r}
val_res_double_dense <- keras_predict(
cnn_double_dense,
kick_assess,
state_assess
)
metrics(val_res_double_dense, state, .pred_class, .pred_1)
```
This model performs well, but it is not entirely clear that it is working much better than the first CNN model we tried. This could be an indication that the original model had enough fully connected layers for the amount of training data we have available.
```{block, type = "rmdwarning"}
If we have two models with nearly identical performance, we should choose the less complex of the two, since it will have faster performance.
```
We can also change the number of convolutional layers, by adding more such layers.
```{r}
cnn_double_conv <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
layer_max_pooling_1d(pool_size = 2) %>%
layer_conv_1d(filter = 64, kernel_size = 3, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
cnn_double_conv
```
There are a lot of different ways we can extend the network by adding convolutional layers with `layer_conv_1d()`. We must consider the individual characteristics of each layer, with respect to kernel size, as well as other CNN parameters we have not discussed in detail yet like stride, padding, and dilation rate. We also have to consider the progression of these layers within the network itself.
The model is using an increasing number of filters in each layer, doubling the number of filters for each layer. This is to ensure that there are more filters later on to capture enough of the global information.
This model is using a kernel size of 5 twice. There aren't any hard rules about how you structure kernel sizes, but the sizes you choose will change what features the model can detect.
```{block, type = "rmdnote"}
The early layers extract general or low-level features while the later layers learn finer detail or high-level features in the data. The choice of kernel size determines the size of these features.
```
Having a small kernel size in the first layer will let the model detect low-level features locally.
We are also including a max-pooling layer with `layer_max_pooling_1d()` between the convolutional layers. This layer performs a pooling operation that calculates the maximum values in its pooling window; in this model, that is set to 2.
This is done in the hope that the pooled features will be able to perform better by weeding out the small weights.
This is another parameter you can tinker with when you are designing the network.
We compile this model like the others, again trying to keep as much as we can constant. The only thing that changed in this model compared to the first is the addition of a `layer_max_pooling_1d()` and a `layer_conv_1d()`.
```{r}
cnn_double_conv %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
history <- cnn_double_conv %>% fit(
x = kick_analysis,
y = state_analysis,
batch_size = 512,
epochs = 10,
validation_data = list(kick_assess, state_assess)
)
```
```{r}
val_res_double_conv <- keras_predict(
cnn_double_conv,
kick_assess,
state_assess
)
metrics(val_res_double_conv, state, .pred_class, .pred_1)
```
This model also performs well compared to earlier results. Let us extract the the prediction using `keras_predict()` we defined in Section \@ref(evaluate-dnn).
```{r}
all_cnn_model_predictions <- bind_rows(
mutate(val_res, model = "Basic CNN"),
mutate(val_res_double_dense, model = "Double Dense"),
mutate(val_res_double_conv, model = "Double Conv")
)
all_cnn_model_predictions
```
Now that the results are combined in `all_cnn_model_predictions` we can calculate group-wise evaluation statistics by grouping them by the `model` variable.
```{r}
all_cnn_model_predictions %>%
group_by(model) %>%
metrics(state, .pred_class, .pred_1)
```
We can also compute ROC curves for all our models so far. Figure \@ref(fig:allcnnroccurve) shows the three different ROC curves together in one chart.
```{r allcnnroccurve, opts.label = "fig.square", fig.cap="ROC curve for three CNN variants' predictions of Kickstarter campaign success"}
all_cnn_model_predictions %>%
group_by(model) %>%
roc_curve(truth = state, .pred_1) %>%
autoplot() +
labs(
title = "Receiver operator curve for Kickstarter blurbs"
)
```
The curves are _very_ close in this chart, indicating that we don't have much to gain by adding more layers and that they don't improve performance substantively.
This doesn't mean that we are done with CNNs! There are still many things we can explore, like different tokenization approaches and hyperparameters that can be trained.
## Case study: byte pair encoding
\index{tokenization!subword}In our models in this chapter so far we have used words as the token of interest. We saw in Section \@ref(casestudyngrams) how n-grams can be used in modeling as well.
One of the reasons why the Kickstarter data set is hard to work with is because the text is quite short so we don't have that many individual tokens to work with in a given blurb.
Another choice of token is _subwords_, where we split the text into smaller units than words; longer words especially will be broken into multiple subword units. One way to tokenize text into subword units is _byte pair encoding_ [@Gage1994ANA].
This algorithm has been repurposed to work on text by iteratively merging frequently occurring subword pairs.
Methods such as [BERT](https://github.com/google-research/bert) and [GPT-2](https://openai.com/blog/better-language-models/) use subword units for text with great success.
The byte pair encoding algorithm has a hyperparameter controlling the size of the vocabulary. Setting it to higher values allows the models to find more rarely used character sequences in the text.
Byte pair encoding offers a good trade-off between character-level and word-level information, and can also encode unknown words. For example, suppose that the model is aware of the word "woman". A simple tokenizer would have to put a word such as "womanhood" into an unknown bucket or ignore it completely, whereas byte pair encoding should be able to pick up on the subwords "woman" and "hood" (or "woman", "h", and "ood", depending on whether the model found "hood" as a common enough subword).
Using a subword tokenizer such as byte pair encoding should let us see the text with more granularity since we will have more and smaller tokens for each observation.
```{block2, type = "rmdnote"}
Character-level CNNs have also proven successful in some contexts. They have been explored by @Zhang2015 and work quite well on some shorter texts such as headlines and tweets [@Vosoughi2016].
```
We need to remind ourselves that these models don't contain any linguistic knowledge at all; they only "learn" the morphological\index{morphology} patterns of sequences of characters (Section \@ref(morphology)) in the training set. This does not make the models useless, but it should set our expectations about what any given model is capable of.
Since we are using a completely different preprocessing approach, we need to specify a new feature engineering recipe.
```{block, type = "rmdpackage"}
The **textrecipes** package has a tokenization engine to perform byte pair encoding, but we need to determine the vocabulary size and the appropriate sequence length.
```
Let's write a function that takes a character vector and a vocabulary size and returns a dataframe with the number of tokens in each observation.
```{r}
library(textrecipes)
get_bpe_token_dist <- function(vocab_size, x) {
recipe(~text, data = tibble(text = x)) %>%
step_mutate(text = tolower(text)) %>%
step_tokenize(text,
engine = "tokenizers.bpe",
training_options = list(vocab_size = vocab_size)) %>%
prep() %>%
bake(new_data = NULL) %>%
transmute(n_tokens = lengths(textrecipes:::get_tokens(text)),
vocab_size = vocab_size)
}
```
We can use `map()` to try a handful of different vocabulary sizes.
```{r}
bpe_token_dist <- map_dfr(
c(2500, 5000, 10000, 20000),
get_bpe_token_dist,
kickstarter_train$blurb
)
bpe_token_dist
```
If we compare with the word count distribution we saw in Figure \@ref(fig:kickstarterwordlength), then we see in Figure \@ref(fig:kickstartersubwordlength) that any of these choices for vocabulary size will result in more tokens overall.
```{r kickstartersubwordlength, fig.cap="Distribution of subword count for Kickstarter campaign blurbs for different vocabulary sizes"}
bpe_token_dist %>%
ggplot(aes(n_tokens)) +
geom_bar() +
facet_wrap(~vocab_size) +
labs(x = "Number of subwords per campaign blurb",
y = "Number of campaign blurbs")
```
Let's pick a vocabulary size of 10,000 and a corresponding sequence length of 40. To use byte pair encoding as a tokenizer in textrecipes set `engine = "tokenizers.bpe"`; the vocabulary size can be denoted using the `training_options` argument. Everything else in the recipe stays the same.
```{r}
max_subwords <- 10000
bpe_max_length <- 40
bpe_rec <- recipe(~blurb, data = kickstarter_train) %>%
step_mutate(blurb = tolower(blurb)) %>%
step_tokenize(blurb,
engine = "tokenizers.bpe",
training_options = list(vocab_size = max_subwords)) %>%
step_sequence_onehot(blurb, sequence_length = bpe_max_length)
bpe_prep <- prep(bpe_rec)
bpe_analysis <- bake(bpe_prep, new_data = analysis(kick_val$splits[[1]]),
composition = "matrix")
bpe_assess <- bake(bpe_prep, new_data = assessment(kick_val$splits[[1]]),
composition = "matrix")
```
Our model will be very similar to the baseline CNN model from Section \@ref(firstcnn); we'll use a larger kernel size of 7 to account for the finer detail in the tokens.
```{r}
cnn_bpe <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = bpe_max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 7, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
cnn_bpe
```
We can compile and train like we have done so many times now.
```{r}
cnn_bpe %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
bpe_history <- cnn_bpe %>% fit(
bpe_analysis,
state_analysis,
epochs = 10,
validation_data = list(bpe_assess, state_assess),
batch_size = 512
)
bpe_history
```
The performance is doing quite well, which is a pleasant surprise! This is what we hoped would happen if we switched to a higher-detail tokenizer.
The \index{matrix!confusion}confusion matrix in Figure \@ref(fig:bpeheatmap) also clearly shows that there isn't much bias between the two classes with this new tokenizer.
```{r bpeheatmap, fig.cap="Confusion matrix for CNN model using byte pair encoding tokenization"}
val_res_bpe <- keras_predict(cnn_bpe, bpe_assess, state_assess)
val_res_bpe %>%
conf_mat(state, .pred_class) %>%
autoplot(type = "heatmap")
```
What are the subwords being used in this model? We can extract them from `step_sequence_onehot()` using `tidy()` on the prepped recipe. All the tokens that start with an `"h"` are seen here.
```{r}
bpe_rec %>%
prep() %>%
tidy(3) %>%
filter(str_detect(token, "^h")) %>%
pull(token)
```
Notice how some of these subword tokens\index{tokenization!subword} are full words, and some are parts of words. This is what allows the model to be able to "read" long unknown words by combining many smaller subwords.
We can also look at common long words.
```{r}
bpe_rec %>%
prep() %>%
tidy(3) %>%
arrange(desc(nchar(token))) %>%
slice_head(n = 25) %>%
pull(token)
```
These 25 words were common enough to get their own subword token, and helps us understand the nature of these Kickstarter crowdfunding campaigns.
```{block, type = "rmdwarning"}
Examining the longest subword tokens gives you a good sense of the data you are working with!
```
## Case study: explainability with LIME {#lime}
\index{models!explainability}
\index{models!interpretability|see {models, explainability}}
We noted in Section \@ref(dllimitations) that one of the significant limitations of deep learning models is that they are hard to reason about. One of the ways to understand a predictive model, even a "black box"\index{black box} one, is using an algorithm for observation-level variable importance like the *Local Interpretable Model-Agnostic Explanations* [@ribeiro2016why] algorithm, or **LIME** for short.
```{block, type = "rmdnote"}
As indicated by its name, LIME is an approach to compute local feature importance, or explainability at the individual observation level. It does not offer global feature importance, or explainability for the model as a whole.
```
```{block2, type = "rmdpackage"}
The **lime** package in R [@R-lime] implements the LIME algorithm; it can take a prediction from a model and determine a small set of features in the original data that drives the outcome of the prediction.
```
To use this package we need to write a helper function to get the data in the format we want. The `lime()` function takes two mandatory arguments, `x` and `model`. The `model` argument is the trained model we are trying to explain. The `lime()` function works out of the box with Keras models so we should be good to go there. The `x` argument is the training data used for training the model. This is where we need to to create a helper function; the lime package is expecting `x` to be a character vector so we'll need a function that takes a character vector as input and returns the matrix the Keras model is expecting.
```{r}
kick_prepped_rec <- prep(kick_rec)
text_to_matrix <- function(x) {
bake(
kick_prepped_rec,
new_data = tibble(blurb = x),
composition = "matrix"
)
}
```
```{block, type = "rmdnote"}
Since the function needs to be able to work with just the `x` parameter alone, we need to put `prepped_recipe` inside the function rather than passing it in as an argument. This will work with R's scoping rules but does require you to create a new function for each recipe.
```
Let's select a couple of training observations to explain.
```{r, linewidth=80}
sentence_to_explain <- kickstarter_train %>%
slice(c(1, 5)) %>%
pull(blurb)
sentence_to_explain
```
We now load the lime package and pass observations into `lime()` along with the model we are trying to explain and the preprocess function.
```{block, type = "rmdwarning"}
Be sure that the preprocessing function _matches_ the preprocessing that was used to train the model.
```
```{r}
library(lime)
explainer <- lime(
x = sentence_to_explain,
model = simple_cnn_model,
preprocess = text_to_matrix
)
```
This `explainer` object can now be used with `explain()` to generate explanations for the sentences. We set `n_labels = 1` to only get explanations for the first label, since we are working with a binary classification model^[The explanations of the second label would just be the inverse of the first label. If you have more than two labels, it makes sense to explore some or all of them.]. We set `n_features = 12` to return the 12 most important features. If we were dealing with longer text, we might want to change `n_features` to return more features (tokens).
```{r}
explanation <- explain(
x = sentence_to_explain,
explainer = explainer,
n_labels = 1,
n_features = 12
)
explanation
```
The output comes in a tibble format where `feature` and `feature_weight` are included, but fortunately lime contains some functions to visualize these weights. Figure \@ref(fig:limeplotfeatures) shows the result of using `plot_features()`, with each facet containing an observation-label pair and the bars showing the weight of the different tokens. Bars in the positive direction (darker) indicate that the weights _support_ the prediction and bars in the negative direction (lighter) indicate _contradictions_. This chart is great for finding the most prominent features in an observation.
```{r, eval=FALSE}
plot_features(explanation)
```
```{r limeplotfeatures, echo=FALSE, fig.cap="Plot of most important features for a CNN model predicting two observations."}
suppressMessages(
plot_features(explanation) +
scale_fill_manual(values = discrete_colors[1:2], drop = FALSE)
)
```
\index{models!explainability}Figure \@ref(fig:limeplottextexplanations) shows the weights by highlighting the words directly in the text. This gives us a way to see if any local patterns contain a lot of weight.
```{r, eval=FALSE}
plot_text_explanations(explanation)
```
```{r limeplottextexplanations, echo=FALSE, fig.cap="Feature highlighting of words for two examples explained by a CNN model.", out.width="90%"}
if (knitr:::is_html_output()) {
positive_colors <- prismatic::clr_lighten(
rep(discrete_colors[1], 6),
c(-0.1, 0.1, 0.3, 0.5, 0.7, 0.9)
)
negative_colors <- prismatic::clr_lighten(
rep(discrete_colors[2], 6),
c(-0.1, 0.1, 0.3, 0.5, 0.7, 0.9)
)
all_styles <- paste(collapse = "\n",
paste0(".match_positive, .positive_1, .positive_2, .positive_3, .positive_4, .positive_5
{ border: 1px solid ", positive_colors[1], ";}"),
paste0(".match_negative, .negative_1, .negative_2, .negative_3, .negative_4, .negative_5
{ border: 1px solid ", negative_colors[1], ";}"),
paste0(".plot_text_explanations .", "positive", "_", 1:5, " {
background-color: ", positive_colors[-1], ";}"),
paste0(".plot_text_explanations .", "negative", "_", 1:5, " {
background-color: ", negative_colors[-1], ";}")
)
plot_text_explanations(explanation) %>%
htmlwidgets::prependContent(htmltools::tags$style(all_styles))
} else {
knitr::include_graphics("images/plot_text_explanations_1.png")
}
```
```{block, type = "rmdnote"}
The `interactive_text_explanations()` function can be used to launch an interactive Shiny app where you can explore the model weights.
```
\index{models!explainability}One of the ways a deep learning model is hard to explain is that changes to a part of the input can affect how the input is being used as a whole. Remember that in bag-of-words models adding another token when predicting would just add another unit in the weight; this is not always the case when using deep learning models.
The following example shows this effect. We have created two very similar sentences in `fake_sentences`.
```{r}
fake_sentences <- c(
"Fun and exciting dice game for the whole family",
"Fun and exciting dice game for the family"
)
explainer <- lime(
x = fake_sentences,
model = simple_cnn_model,
preprocess = text_to_matrix
)
explanation <- explain(
x = fake_sentences,
explainer = explainer,
n_labels = 1,
n_features = 12
)
```
Explanations based on these two sentences are fairly similar as we can see in Figure \@ref(fig:robustlimeplottextexplanations). However, notice how the removal of the word "whole" affects the weights of the other words in the examples, in some cases switching the sign from supporting to contradicting.
```{r, eval=FALSE}
plot_text_explanations(explanation)
```
```{r robustlimeplottextexplanations, echo=FALSE, fig.cap="Feature highlighting of words in two examples explained by a CNN model.", out.width="90%"}
if (knitr:::is_html_output()) {
plot_text_explanations(explanation) %>%
htmlwidgets::prependContent(htmltools::tags$style(all_styles))
} else {
knitr::include_graphics("images/plot_text_explanations_2.png")
}
```
\index{models!explainability}It is these kinds of correlated patterns that can make deep learning models hard to reason about and can deliver surprising results.
```{block, type = "rmdnote"}
The LIME algorithm and **lime** R package are not limited to explaining CNNs. This approach can be used with any of the models we have used in this book, even the ones trained with **parsnip**.
```
## Case study: hyperparameter search {#keras-hyperparameter}
\index{models!tuning}So far in all our deep learning models, we have only used one configuration of hyperparameters. Sometimes we want to try different hyperparameters out and find what works best for our model like we did in Sections \@ref(mlregressionfull) and \@ref(mlclassificationfull) using the **tune** package. We can use the [**tfruns**](https://tensorflow.rstudio.com/tools/tfruns/overview/) package to run multiple Keras models and compare the results.
This workflow will be a little different than what we have seen in the book so far since we will have to create a `.R` file that contains the necessary modeling steps and then use that file to fit multiple models. Such an example file named `cnn-spec.R` used for the following models is available [on GitHub](https://raw.githubusercontent.com/EmilHvitfeldt/smltar/master/cnn-spec.R). The first thing we need to do is specify what hyperparameters we want to vary. By convention, this object is named `FLAGS` and it is created using the `flags()` function. For each parameter we want to tune, we add a corresponding `flag_*()` function, which can be `flag_integer()`, `flag_boolean()`, `flag_numeric()`, or `flag_string()` depending on what we need to tune.
```{block, type = "rmdwarning"}
Be sure you are using the right type for each of these flags; Keras is quite picky! If Keras is expecting an integer and gets a numeric then you will get an error.
```
```{r, eval=FALSE}
FLAGS <- flags(
flag_integer("kernel_size1", 5),
flag_integer("strides1", 1)
)
```
Notice how we are giving each flag a name and a possible value. The value itself isn't important, as it is not used once we start running multiple models, but it needs to be the right type for the model we are using.
Next, we specify the Keras model we want to run.
```{r, eval=FALSE}
model <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32,
kernel_size = FLAGS$kernel_size1,
strides = FLAGS$strides1,
activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
```
We target the hyperparameters we want to change by marking them as `FLAGS$name`. So in this model, we are tuning different values of `kernel_size` and `strides`, which are denoted by the `kernel_size1` and `strides1` flag, respectively.
Lastly, we must specify how the model is trained and evaluated.
```{r, eval=FALSE}
history <- model %>%
fit(
x = kick_analysis,
y = state_analysis,
batch_size = 512,
epochs = 10,
validation_data = list(kick_assess, state_assess)
)
plot(history)
score <- model %>% evaluate(
kick_assess, state_assess
)
cat("Test accuracy:", score["accuracy"], "\n")
```
This is mostly the same as what we have seen before. When we are running these different models, the scripts will be run in the environment they are initialized from, so the models will have access to objects like `prepped_training` and `kickstarter_train`, and we don't have to create them inside the file.
Now that we have the file set up we need to specify the different hyperparameters we want to try. Three different values for the kernel size and two different values for the stride length give us `3 * 2 = 6` different runs.
```{r}
hyperparams <- list(
kernel_size1 = c(3, 5, 7),
strides1 = c(1, 2)
)
```
```{block, type = "rmdnote"}
This is a small selection of hyperparameters and ranges. There is much more room for experimentation.
```
Now we have everything we need for hyperparameter searching. Load up **tfruns** and pass the name of the file we just created along with `hyperparams` to the `tuning_run()` function.
```{r, results='hide', cache=FALSE, message=FALSE, eval=FALSE}
library(tfruns)
runs <- tuning_run(
file = "cnn-spec.R",
runs_dir = "_tuning",
flags = hyperparams
)
runs_results <- as_tibble(ls_runs())
```
```{r, echo=FALSE, message=FALSE}
runs_results <- readr::read_csv("inst/runs_results.csv")
runs_results
```
You don't have to, but we have manually specified the `runs_dir` argument, which is where the results of the tuning will be saved.
A summary of all the runs in the folder can be retrieved with `ls_runs()`; here we use `as_tibble()` to get the results as a tibble.
```{r, cache=FALSE}
runs_results
```
We can condense the results down a little bit by only pulling out the flags we are looking at and arranging them according to their performance.
```{r, cache=FALSE}
best_runs <- runs_results %>%
select(metric_val_accuracy, flag_kernel_size1, flag_strides1) %>%
arrange(desc(metric_val_accuracy))
best_runs
```
There isn't much performance difference between the different choices but using kernel size of `r best_runs$flag_kernel_size1[1]` and stride length of `r best_runs$flag_strides1[1]` narrowly came out on top.
## Cross-validation for evaluation
In Section \@ref(dnncross), we saw how we can use resampling to create cross-validation folds for evaluation. The Kickstarter data set we are using is big enough that we have ample data for a single training set, validation set, and testing set that all contain enough observations in them to give reliable performance metrics. However, it is important to understand how to implement other resampling strategies for situations when your data budget may not be as plentiful or when you need to compute performance metrics that are more precise.
```{r}
set.seed(345)
kick_folds <- vfold_cv(kickstarter_train, v = 5)
kick_folds
```
Each of these folds has an analysis or training set and an assessment or validation set. Instead of training our model one time and getting one measure of performance, we can train our model `v` times and get `v` measures (five, in this case), for more reliability.
Last time we saw how to create a custom function to handle preprocessing, fitting, and evaluation. We will use the same approach of creating the function, but this time use the model specification from Section \@ref(firstcnn).
```{r}
fit_split <- function(split, prepped_rec) {
## preprocessing
x_train <- bake(prepped_rec, new_data = analysis(split),
composition = "matrix")
x_val <- bake(prepped_rec, new_data = assessment(split),
composition = "matrix")
## create model
y_train <- analysis(split) %>% pull(state)
y_val <- assessment(split) %>% pull(state)
mod <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 5, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid") %>%
compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
## fit model
mod %>%
fit(
x_train,
y_train,
epochs = 10,
validation_data = list(x_val, y_val),
batch_size = 512,
verbose = FALSE
)
## evaluate model
keras_predict(mod, x_val, y_val) %>%
metrics(state, .pred_class, .pred_1)
}
```
We can `map()` this function across all our cross-validation folds. This takes longer than our previous models to train, since we are training for 10 epochs each on 5 folds.
```{r}
cv_fitted <- kick_folds %>%
mutate(validation = map(splits, fit_split, kick_prep))
cv_fitted
```
Now we can use `unnest()` to find the metrics we computed.
```{r}
cv_fitted %>%
unnest(validation)
```
We can summarize the unnested results to match what we normally would get from `collect_metrics()`
```{r}
cv_fitted %>%
unnest(validation) %>%
group_by(.metric) %>%
summarize(
mean = mean(.estimate),
n = n(),
std_err = sd(.estimate) / sqrt(n)
)
```
The metrics have little variance just like they did last time, which is reassuring; our model is robust with respect to the evaluation metrics.
## The full game: CNN {#cnnfull}
We've come a long way in this chapter, and looked at the many different modifications to the simple CNN model we started with. Most of the alterations didn't add much so this final model is not going to be much different than what we have seen so far.
\index{models!challenges}
```{block, type = "rmdwarning"}
There are an incredible number of ways to change a deep learning network architecture, but in most realistic situations, the benefit in model performance from such changes is modest.
```
### Preprocess the data {#cnnfullpreprocess}
For this final model, we are not going to use our separate validation data again, so we only need to \index{preprocess}preprocess the training data.
```{r}
max_words <- 2e4
max_length <- 30
kick_rec <- recipe(~ blurb, data = kickstarter_train) %>%
step_tokenize(blurb) %>%
step_tokenfilter(blurb, max_tokens = max_words) %>%
step_sequence_onehot(blurb, sequence_length = max_length)
kick_prep <- prep(kick_rec)
kick_matrix <- bake(kick_prep, new_data = NULL, composition = "matrix")
dim(kick_matrix)
```
### Specify the model {#cnnfullmodel}
Instead of using specific validation data that we can then compute performance metrics for, let's go back to specifying `validation_split = 0.1` and let the Keras model choose the validation set.
```{r cnnfinalmod}
final_mod <- keras_model_sequential() %>%
layer_embedding(input_dim = max_words + 1, output_dim = 16,
input_length = max_length) %>%
layer_conv_1d(filter = 32, kernel_size = 7,
strides = 1, activation = "relu") %>%
layer_global_max_pooling_1d() %>%
layer_dense(units = 64, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
final_mod %>%
compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
final_history <- final_mod %>%
fit(
kick_matrix,
kickstarter_train$state,
epochs = 10,
validation_split = 0.1,
batch_size = 512,
verbose = FALSE
)
final_history
```
This looks promising! Let's finally turn to the testing set, for the first time during this chapter, to evaluate this last model on data that has never been touched as part of the fitting process.
```{r}
kick_matrix_test <- bake(kick_prep, new_data = kickstarter_test,
composition = "matrix")
final_res <- keras_predict(final_mod, kick_matrix_test, kickstarter_test$state)
final_res %>% metrics(state, .pred_class, .pred_1)
```
This is our best-performing model in this chapter on CNN models, although not by much. We can again create an ROC curve, this time using the test data in Figure \@ref(fig:cnnfinalroc).
```{r cnnfinalroc, opts.label = "fig.square", fig.cap="ROC curve for final CNN model predictions on testing set of Kickstarter campaign success"}
final_res %>%
roc_curve(state, .pred_1) %>%
autoplot()
```
We have been able to incrementally improve our model by adding to the structure and making good choices about \index{preprocessing}preprocessing. We can visualize this final CNN model's performance using a \index{matrix!confusion}confusion matrix as well, in Figure \@ref(fig:cnnheatmapfinal).
```{r cnnheatmapfinal, fig.cap="Confusion matrix for final CNN model predictions on testing set of Kickstarter campaign success"}
final_res %>%
conf_mat(state, .pred_class) %>%
autoplot(type = "heatmap")
```
Notice that this final model performs better then any of the models we have tried so far in this chapter, Chapter \@ref(dldnn), and Chapter \@ref(dllstm).
```{block, type = "rmdnote"}
For this particular data set of short text blurbs, a CNN model able to learn local features performed the best, better than either a densely connected neural network or an LSTM.
```
## Summary {#dlcnnsummary}
CNNs are a type of neural network that can learn local spatial patterns. They essentially perform feature extraction\index{feature engineering}, which can then be used efficiently in later layers of a network. Their simplicity and fast running time, compared to models like LSTMs, makes them excellent candidates for supervised models for text.
### In this chapter, you learned:
- how to preprocess text data for CNN models
- about CNN network architectures
- how CNN layers can be stacked to extract patterns of varying detail
- how byte pair encoding can be used to tokenize for finer detail
- how to do hyperparameter search in Keras with **tfruns**
- how to evaluate CNN models for text