forked from kbenoit/ITAUR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpreparingtexts.Rmd
245 lines (199 loc) · 9.79 KB
/
preparingtexts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
---
title: "Preparing and processing texts"
author: "Kenneth Benoit and Paul Nulty"
date: "18 October 2015"
output: html_document
---
Here we will step through the basic elements of preparing a text for analysis. These are tokenization, conversion to lower case, stemming, removing or selecting features, and defining equivalency classes for features, including the use of dictionaries.
### 1. Tokenization
Tokenization in quanteda is very *conservative*: by default, it only removes separator characters.
```{r}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!",
text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt)
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)
```
There are several options to the `what` argument:
```{r}
# sentence level
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.",
"Today is Thursday in Canberra: It is yesterday in London.",
"Today is Thursday in Canberra: \nIt is yesterday in London.",
"To be? Or\not to be?"),
what = "sentence")
tokenize(inaugTexts[2], what = "sentence")
# character level
tokenize("My big fat text package.", what="character")
tokenize("My big fat text package.", what="character", removeSeparators=FALSE)
```
Two other options, for really fast and simple tokenization are `"fastestword"` and `"fasterword"`, if performance is a key issue. These are less intelligent than the boundary detection used in the default `"word"` method, which is based on stringi\ICU boundary detection.
### 2. Conversion to lower case
This is a tricky one in our workflow, since it is a form of equivalency declaration, rather than a tokenization step. It turns out that it is more efficient to perform at the pre-tokenization stage.
As a result, the method `toLower()` is defined for many classes of quanteda objects.
```{r}
methods(toLower)
```
We include options designed to preserve acronyms.
```{r}
test1 <- c(text1 = "England and France are members of NATO and UNESCO",
text2 = "NASA sent a rocket into space.")
toLower(test1)
toLower(test1, keepAcronyms = TRUE)
test2 <- tokenize(test1, removePunct=TRUE)
toLower(test2)
toLower(test2, keepAcronyms = TRUE)
```
toLower is based on stringi, and is therefore nicely Unicode compliant.
```{r}
# Russian
cat(iconv(encodedTexts[8], "windows-1251", "UTF-8"))
cat(toLower(iconv(encodedTexts[8], "windows-1251", "UTF-8")))
head(toLower(stopwords("russian")), 20)
# Arabic
cat(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8"))
cat(toLower(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8")))
head(toLower(stopwords("arabic")), 20)
```
**Note**: dfm, the Swiss army knife, converts to lower case by default, but this can be turned off using the `toLower = FALSE` argument.
### 3. Removing and selecting features
This can be done when creating a dfm:
```{r}
# with English stopwords and stemming
dfmsInaug2 <- dfm(subset(inaugCorpus, Year>1980),
ignoredFeatures = stopwords("english"), stem = TRUE)
```
Or can be done **after** creating a dfm:
```{r}
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
toLower = FALSE, verbose = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")
```
More examples:
```{r}
# removing stopwords
testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
the newspaper from a boy named Seamus, in his mouth."
testCorpus <- corpus(testText)
# note: "also" is not in the default stopwords("english")
features(dfm(testCorpus, ignoredFeatures = stopwords("english")))
# for ngrams
features(dfm(testCorpus, ngrams = 2, ignoredFeatures = stopwords("english")))
features(dfm(testCorpus, ngrams = 1:2, ignoredFeatures = stopwords("english")))
## removing stopwords before constructing ngrams
tokensAll <- tokenize(toLower(test^#Text), removePunct = TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
tokensNgramsNoStopwords <- ngrams(tokensNoStopwords, 2)
features(dfm(tokensNgramsNoStopwords, ngrams = 1:2))
# keep only certain words
dfm(testCorpus, keptFeatures = "*s", verbose = FALSE) # keep only words ending in "s"
dfm(testCorpus, keptFeatures = "s$", valuetype = "regex", verbose = FALSE)
# testing Twitter functions
testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
"2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
"Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(testTweets, keptFeatures = "#*", removeTwitter = FALSE) # keep only hashtags
dfm(testTweets, keptFeatures = "^#.*$", valuetype = "regex", removeTwitter = FALSE)
```
One very nice feature, recently added, is the ability to create a new dfm with the same feature set as the old. This is very useful, for instance, if we train a model on one dfm, and need to predict on counts from another, but need the feature set to be equivalent.
```{r}
# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
features(dfm2a <- dfm(textVec2))
(dfm2b <- selectFeatures(dfm2a, dfm1))
identical(features(dfm1), features(dfm2b))
```
### 4. Applying equivalency classes: dictionaries, thesaruses
Dictionary creation is done through the `dictionary()` function, which classes a named list of characters as a dictionary.
```{r}
# import the Laver-Garry dictionary from http://bit.ly/1FH2nvf
lgdict <- dictionary(file="http://www.kenbenoit.net/courses/essex2014qta/LaverGarry.cat",
format="wordstat")
dfm(inaugTexts, dictionary=lgdict)
# import a LIWC formatted dictionary
liwcdict <- dictionary(file = "http://www.kenbenoit.net/files/LIWC2001_English.dic",
format = "LIWC")
dfm(inaugTexts, dictionary=liwcdict)
```
We apply dictionaries to a dfm using the `applyDictionary()` function. Through the `valuetype`, argument, we can match patterns of one of three types: `"glob"`, `"regex"`, or `"fixed"`.
```{r}
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?"),
ignoredFeatures = stopwords("english"), verbose = FALSE)
myDfm
# glob format
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)
# regex v. glob format: note that "united_states" is a regex match for "tax*"
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)
# fixed format: no pattern matching
applyDictionary(myDfm, myDict, valuetype = "fixed")
applyDictionary(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
```
It is also possible to pass through a dictionary at the time of `dfm()` creation.
```{r}
# dfm with dictionaries
mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- dictionary(list(christmas=c("Christmas", "Santa", "holiday"),
opposition=c("Opposition", "reject", "notincorpus"),
taxing="taxing",
taxation="taxation",
taxregex="tax*",
country="united states"))
dictDfm <- dfm(mycorpus, dictionary=mydict)
head(dictDfm)
```
Finally, there is a related "thesaurus" feature, which collapses words in a dictionary but is not exclusive.
```{r}
mytexts <- c("British English tokenises differently, with more colour.",
"American English tokenizes color as one word.")
mydict <- dictionary(list(color = "colo*r", tokenize = "tokeni?e*"))
dfm(mytexts, thesaurus = mydict)
```
### 5. Stemming
Stemming relies on the `SnowballC` package's implementation of the Porter stemmer, and is available for the following languages:
```{r}
SnowballC::getStemLanguages()
```
It's not perfect:
```{r}
wordstem(c("win", "winning", "wins", "won", "winner"))
```
but it's fast.
Stemmed objects must be tokenized, but can be of many different quanteda classes:
```{r, error = TRUE}
methods(wordstem)
wordstem("This is a winning package, of many packages.")
wordstem(tokenize("This is a winning package, of many packages."))
head(wordstem(dfm(inaugTexts[1:2], verbose = FALSE)))
# same as
head(dfm(inaugTexts[1:2], stem = TRUE, verbose = FALSE))
```
### 6. `dfm()` and its many options
Operates on `character` (vectors), `corpus`, or `tokenizedText` objects,
```{r, eval=FALSE}
## S3 method for class 'character'
dfm(x, verbose = TRUE, toLower = TRUE,
removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL,
keptFeatures = NULL, matrixType = c("sparse", "dense"),
language = "english", thesaurus = NULL, dictionary = NULL,
valuetype = c("glob", "regex", "fixed"), dictionary_regex = FALSE, ...)
```