-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathworkflow.Rmd
99 lines (74 loc) · 4.11 KB
/
workflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Example workflow"
author: Kenneth Benoit and Paul Nulty
date: 27 October 2016
output: html_document
---
This file demonstrates a basic workflow to take some pre-loaded texts and perform elementary text analysis tasks quickly. The **quanteda** package comes with a built-in set of inaugural addresses from US Presidents. We begin by loading quanteda and examining these texts. The `summary` command will output the name of each text along with the number of types, tokens and sentences contained in the text. Below we use R's indexing syntax to selectivly use the summary command on the first five texts.
```{r}
require(quanteda)
?devtools::install_github
summary(inaugTexts)
summary(inaugTexts[1:5])
```
Indexing a single text from our character vector of inaugural addresses returns a single character vector -- note that unlike strings in some other languages, in R this is an atomic unit of length one which cannot be sub-indexed further with bracket notation. Use the `substr` function to extract or replace substrings, and the `nchar` function to count the number of characters in a character vector of length one.
```{r}
oneText <- inaugTexts[1]
length(inaugTexts)
length(oneText)
nchar(oneText)
nchar(inaugTexts[5:7])
```
One of the most fundamental text analysis tasks is tokenization. To tokenize a text is to split it into smaller units, most commonly words, which can be counted and to form the basis of a quantitative analysis. The quanteda package has a function for tokenization: `tokenize`. Examine the manual page for this function.
```{r}
?tokenize
```
**quanteda**'s tokenize function can be used on a simple character vector, a vector of character vectors, or a corpus. Here are some examples:
```{r}
tokenize("Today is Thursday in Canberra. It is yesterday in London.")
vec <- c(one = "This is text one", two = "This, however, is the second text")
tokenize(vec)
```
Consider the default arguments to the tokenize function. To remove punctuation, you should set the `removePunct` argument to be TRUE. We can combine this with the `toLower` function to get a cleaned and tokenized version of our text.
```{r}
tokenize(toLower(vec), removePunct = TRUE)
```
Using this function with the inaugural addresses:
```{r}
require(dplyr)
inaugTokens <- tokenize(toLower(inaugTexts))
inaugTokens[2]
```
Once each text has been split into words, we can use the `dfm` function to create a matrix of counts of the occurrences of each word in each document:
```{r}
inaugDfm <- dfm(inaugTokens)
trimmedInaugDfm <- trim(inaugDfm, minDoc = 5, minCount = 10)
weightedTrimmedDfm <- weight(trimmedInaugDfm, type = "tfidf")
require(magrittr)
inaugDfm2 <- dfm(inaugTokens) %>% trim(minDoc=5, minCount=10) %>%
weight(type = "tfidf")
```
Note that `dfm()` works on a variety of object types, including character vectors, corpus objects, and tokenized text objects. This gives the user maximum flexibility and power, while also making it easy to achieve similar results by going directly from texts to a document-by-feature matrix.
To see what objects for which any particular *method* (function) is defined, you can use the `methods()` function:
```{r}
methods(dfm)
```
Likewise, you can also figure out what methods are defined for any given *class* of object, using the same function:
```{r}
methods(class = "tokenizedTexts")
```
If we are interested in analysing the texts with respect to some other variables, we can create a corpus object to associate the texts with this metadata. For example, consider the last six inaugural addresses:
```{r}
summary(inaugTexts[52:57])
```
We can use the `docvars` option to the `corpus` command to record the party with which each text is associated:
```{r}
dv <- data.frame(Party = c("dem", "dem", "rep", "rep", "dem", "dem"))
recentCorpus <- corpus(inaugTexts[52:57], docvars = dv)
summary(recentCorpus)
```
We can use this metadata to combine features across documents when creating a document-feature matrix:
```{r fig.width=10, fig.height=8, warning=FALSE}
partyDfm <- dfm(recentCorpus, groups = "Party", ignoredFeatures = stopwords("english"), verbose = FALSE)
plot(partyDfm, comparison = TRUE)
```