forked from pjsio/ME114
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathME114_assignment8_solution.Rmd
402 lines (312 loc) · 16.1 KB
/
ME114_assignment8_solution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
---
title: "Assignment 8 - Working With Textual Data"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---
Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are largely based on the material covered in James et al. (2013). You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.
You will be asked to submit your assignments via Moodle by 7pm on the day of the class. We will subsequently open up solutions to the problem sets.
### Exercise summary
This exercise is designed to get you working with [quanteda](http://github.com/kbenoit/quanteda). The focus will be on exploring the package and getting some texts into the **corpus** object format. [quanteda](http://github.com/kbenoit/quanteda) package has several functions for creating a corpus of texts which we will use in this exercise.
1. Getting Started.
You can use R or Rstudio for these exercises. You will first need to install the package,
using the commands below. Also see the instructions for installation from the dev branch page of
http://github.com/kbenoit/quanteda.
```{r, eval=FALSE}
# needs the devtools package for this to work
if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
# PREFERABLY: install the latest version from GitHub
# see the installation instructions on http://github.com/kbenoit/quanteda
devtools::install_github("kbenoit/quanteda")
# and quantedaData
devtools::install_github("quantedaData", username="kbenoit")
```
1. Exploring **quanteda** functions.
Look at the Quick Start vignette, and browse the manual for quanteda. You can use
`example()` function for any
function in the package, to run the examples and see how the function works. Of course
you should also browse the documentation, especially `?corpus` to see the structure
and operations of how to construct a corpus.
```{r, eval = FALSE}
help(package = "quanteda")
```
```{r, echo=FALSE}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
```
1. Making a corpus and corpus structure
1. From a vector of texts already in memory.
The simplest way to create a corpus is to use a vector of texts already present in
R's global environment. Some text and corpus objects are built into the package,
for example `inaugTexts` is the UTF-8 encoded set of 57 presidential inaugural
addresses. Try using `corpus()` on this set of texts to create a corpus.
Once you have constructed this corpus, use the `summary()` method to see a brief
description of the corpus. The names of the character vector `inaugTexts` should
have become the document names.
```{r}
str(inaugTexts)
myInaugCorpus <- corpus(inaugTexts, notes = "Corpus created in lab")
summary(myInaugCorpus)
```
1. From a directory of text files.
The `textfile()` function can read (almost) any set of files into an object
that you can then call the corpus()` function on, to create a corpus. (See `?textfile`
for an example.)
Here you are encouraged to select any directory of plain text files of your own.
How did it work? Try using `docvars()` to assign a set of document-level variables.
If you do not have a set of text files to work with, then you can use the UK 2010 manifesto texts on immigration, in the Day 8 folder, like this:
```{r, eval=FALSE}
require(quanteda)
manfiles <- textfile("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
mycorpus <- corpus(manfiles)
```
1. From `.csv` or `.json` files --- see the documentation with `?textfile`.
Here you can try one of your own examples, or just file this in your mental catalogue for future reference.
1. Explore some phrases in the text.
You can do this using the `kwic` (for "key-words-in-context") to explore a specific word
or phrase.
```{r}
kwic(inaugCorpus, "terror", 3)
```
Try substituting your own search terms, or working with your own corpus.
```{r}
kwic(ie2010Corpus, "Christmas", 3)
kwic(ie2010Corpus, "euro")
kwic(ie2010Corpus, "euro", wholeword = TRUE)
```
1. Create a document-feature matrix, using `dfm`. First, read the documentation using
`?dfm` to see the available options.
```{r}
mydfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
mydfm
topfeatures(mydfm, 20)
```
Experiment with different `dfm` options, such as `stem=TRUE`. The function `trim()`
allows you to reduce the size of the dfm following its construction.
```{r}
mydfmStemmed1 <- wordstem(mydfm)
topfeatures(mydfmStemmed1, 20)
mydfmStemmed2 <- dfm(inaugCorpus,
ignoredFeatures = stopwords("english"), stem = TRUE)
topfeatures(mydfmStemmed2, 20)
```
Grouping on a variable is an excellent feature of `dfm()`, in fact one of my favorites.
For instance, if you want to aggregate the inaugural speeches by presidential name, you can execute
```{r}
mydfm <- dfm(inaugCorpus, groups = "President")
mydfm
docnames(mydfm)
```
Note that this groups Theodore and Franklin D. Roosevelt together -- to separate them we
would have needed to add a firstname variable using `docvars()` and grouped on that as well.
Do this to aggregate the Irish budget corpus (`ie2010Corpus`) by political party, when
creating a dfm.
```{r}
docvars(ie2010Corpus)
partyDfm <- dfm(ie2010Corpus, group = "party")
partyDfm[, 1:10]
```
1. Explore the ability to subset a corpus.
There is a `subset()` method defined for a corpus, which works just like R's normal
`subset()` command. For instance if you want a wordcloud of just Obama's two inagural addresses, you would need
to subset the corpus first:
```{r}
obamaCorpus <- subset(inaugCorpus, President=="Obama")
obamadfm <- dfm(obamaCorpus)
plot(obamadfm)
```
Try producing that plot without the stopwords. See `removeFeatures()` to remove stopwords from the dfm object directly, or supply
the `ignoredFeatures` argument to `dfm()`.
```{r}
obamadfm2 <- removeFeatures(obamadfm, stopwords("english"))
plot(obamadfm)
# other method:
obamadfm2a <- dfm(subset(inaugCorpus, President=="Obama"),
ignoredFeatures = stopwords("english"))
all.equal(obamadfm2, obamadfm2a)
```
1. **Preparing and pre-processing texts**
1. "Cleaning"" texts
It is common to "clean" texts before processing, usually by removing
punctuation, digits, and converting to lower case. Look at the
documentation for `toLower()` and use the
command on the `exampleString` text (you can load this from
**quantedaData** using `data(exampleString)`. Can you think of cases
where cleaning could introduce homonymy?
```{r}
exampleString
toLower(exampleString)
```
1. Tokenizing texts
In order to count word frequencies, we first need to split the text
into words through a process known as **tokenization**. Look at the
documentation for **quanteda**'s `tokenize()` function. Use the
`tokenize` command on `exampleString`, and examine the results. Are
there cases where it is unclear where the boundary between two words lies?
You can experiment with the options to `tokenize`.
```{r}
tokenize(toLower(exampleString))
tokenize(toLower(exampleString), removePunct = TRUE, removeNumbers = TRUE)
```
Try tokenizing the sentences from `exampleString` into sentences, using
`tokenize(x, what = "sentence")`.
```{r}
tokenize(exampleString, what = "sentence")
```
1. Stemming.
Stemming removes the suffixes using the Porter stemmer, found in the
**SnowballC** library. The **quanteda** function to invoke the stemmer is `wordstem`. Apply stemming to the `exampleString` and examine the results. Why does it not appear to work, and what do you need to do to make it work? How would you apply this to the sentence-segmented vector?
```{r}
# wordstem(exampleString) # fails
wordstem(tokenize(toLower(exampleString), removePunct = TRUE))
```
1. Applying "pre-processing" to the creation of a **dfm**.
**quanteda**'s `dfm()` function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by `dfm()`. Compare the steps required in a similar text preparation package, [**tm**](http://cran.r-project.org/package=tm):
```{r}
require(tm)
data(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, stemDocument)
tdm <- TermDocumentMatrix(crude)
# same in quanteda
crudeCorpus <- corpus(crude)
crudeDfm <- dfm(crudeCorpus)
```
Inspect the dimensions of the resulting objects, including the names of the words extracted as features. It is also worth comparing the structure of the document-feature matrixes returned by each package. **tm** uses the [slam](http://cran.r-project.org/web/packages/slam/index.html) *simple triplet matrix* format for representing a [sparse matrix](http://en.wikipedia.org/wiki/Sparse_matrix).
It is also -- in fact almost always -- useful to inspect the structure of this object:
```{r}
str(tdm)
```
THis indicates that we can extract the names of the words from the **tm** TermDocumentMatrix object by getting the rownames from inspecting the tdm:
```{r}
head(tdm$dimnames$Terms, 20)
```
Compare this to the results of the same operations from **quanteda**. To get the "words" from a quanteda object, you can use the `features()` function:
```{r}
features_quanteda <- features(crudeDfm)
head(features_quanteda, 20)
str(crudeDfm)
```
What proportion of the `crudeDfm` are zeros? Compare the sizes of `tdm` and `crudeDfm` using the `object.size()` function.
```{r}
# note that length() applied to a matrix gives the total number of cells
sum(crudeDfm == 0) / length(crudeDfm)
```
1. **Keywords-in-context**
1. **quanteda** provides a keyword-in-context
function that is easily usable and configurable to explore texts
in a descriptive way. Type `?kwic` to view the documentation.
1. For the Irish budget debate speeches corpus for the year 2010, called `ie2010Corpus`,
experiment with the
`kwic` function, following the syntax specified on the help page
for `kwic`. `kwic` can be used either on a character vector or a
corpus object. What class of object is returned? Try assigning the
return value from `kwic` to a new object and then examine the
object by clicking on it in the environment
pane in RStudio (or using the inspection method of your choice).
```{r}
myKwic <- kwic(inaugCorpus, "Christmas")
str(myKwic)
```
4. Use the `kwic` function to discover the context of the word
"clean". Is this associated with environmental policy?
```{r}
kwic(inaugCorpus, "clean")
str(myKwic)
```
5. By default, kwic explores all words related to the word, since it interprets the
pattern as a "regular expression". What if we wanted to see only the literal,
entire word "disaster"? Hint: Look at the arguments using `?kwic`.
```{r}
kwic(inaugCorpus, "disaster", wholeword = TRUE)
str(myKwic)
```
1. **Descriptive statistics**
1. We can extract basic descriptive statistics from a corpus from
its document feature matrix. Make a dfm from the 2010 Irish budget
speeches corpus.
```{r}
ieDfm <- dfm(inaugCorpus)
```
1. Examine the most frequent word features using `topfeatures()`. What are
the five most frequent word in the corpus?
```{r}
topfeatures(ieDfm)
```
5. **quanteda** provides a function to count syllables in a
word — `syllables`. Try the function at the prompt. The
code below will apply this function to all the words in the
corpus, to give you a count of the total syllables in the
corpus.
```{r}
# count syllables from texts in the 2010 speech corpus
textSyls <- syllables(texts(ie2010Corpus))
# sum the syllable counts
totalSyls <- sum(textSyls)
```
How would you get the total syllables per text?
```{r}
textSyls
```
**`textSylls` is already the vector of syllables from the vector of budget texts.**
3. **Lexical Diversity over Time**
1. We can plot the type-token ratio of the Irish budget speeches
over time. To do this, begin by extracting a subset of iebudgets
that contains only the first speaker from each year:
```{r}
data(iebudgetsCorpus, package = "quantedaData")
finMins <- subset(iebudgetsCorpus, number=="01")
tokeninfo <- summary(finMins)
```
Note the quotation marks around the value for `number`. Why are these required here?
**Because it is a character data type, not numeric.**
2. Get the type-token ratio for each text from this subset, and
plot the resulting vector of TTRs as a function of the year. Hint: See `?lexdiv`.
```{r}
(TTR <- lexdiv(dfm(finMins, verbose = FALSE), measure = "TTR"))
plot(docvars(finMins, "year"), TTR, xlab = "", type = "b")
```
4. **Document and word associations**
1. Load the presidential inauguration corpus selecting from 1900-1950,
and create a dfm from this corpus.
```{r}
presDfm <- dfm(subset(inaugCorpus, Year >= 1900 & Year <= 1950))
```
2. Measure the term similarities for the following words: *economy*, *health*,
*women*.
```{r}
simils <- similarity(presDfm, c("economy", "health", "women"))
head(simils)
```
1. **Working with dictionaries**
Dictionaries are named lists, consisting of a "key" and a set of entries defining
the equivalence class for the given key. To create a simple dictionary of parts of
speech, for instance we could define a dictionary consisting of articles and conjunctions,
using:
```{r}
posDict <- dictionary(list(articles = c("the", "a", "an"),
conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
```
To let this define a set of features, we can use this dictionary when we create a `dfm`,
for instance:
```{r}
posDfm <- dfm(inaugCorpus, dictionary = posDict, groups=(c('President')))
posDfm[1:10,]
```
Weight the `posDfm` by term frequency using `weight()`, and plot the values of articles
and conjunctions (actually, here just the coordinating conjunctions) across the speeches.
(**Hint:** you can use `docvars(inaugCorpus, "Year"))` for the *x*-axis.)
```{r}
posDfm <- weight(posDfm, "relFreq")
posDfm[1:10,]
```
Is the distribution of normalized articles and conjunctions relatively constant across
years, as you would expect?
```{r, out.height = 10}
dotchart(as.matrix(posDfm))
```
1. Settle a Scrabble word value dispute.
Look up the Scrabble values of "aerie" and "queue". And ask yourself how can an English word have five letters and just one consonant?? It's downright **eerie**.
```{r}
scrabble(c("aerie", "queue", "eerie"))
```