forked from kbenoit/ITAUR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdescriptive.Rmd
124 lines (102 loc) · 3.96 KB
/
descriptive.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
title: "Descriptive Analysis of Texts"
author: "Kenneth Benoit and Paul Nulty"
date: "18 October 2015"
output: html_document
---
quateda has a number of descriptive statistics available for reporting on texts. The **simplest of these** is through the `summary()` method:
```{r}
require(quanteda)
txt <- c(sent1 = "This is an example of the summary method for character objects.",
sent2 = "The cat in the hat swung the bat.")
summary(txt)
```
This also works for corpus objects:
```{r}
summary(corpus(ukimmigTexts, notes = "Created as a demo."))
```
To access the **syllables** of a text, we use `syllables()`:
```{r}
syllables(c("Superman.", "supercalifragilisticexpialidocious", "The cat in the hat."))
```
We can even compute the **Scabble value** of English words, using `scrabble()`:
```{r}
scrabble(c("cat", "quixotry", "zoo"))
```
We can analyze the **lexical diversity** of texts, using `lexdiv()` on a dfm:
```{r}
myDfm <- dfm(subset(inaugCorpus, Year > 1980), verbose = FALSE)
lexdiv(myDfm, "R")
dotchart(sort(lexdiv(myDfm, "R")))
```
We can analyze the **readability** of texts, using `readability()` on a vector of texts or a corpus:
```{r}
readab <- readability(subset(inaugCorpus, Year > 1980), measure = "Flesch.Kincaid")
dotchart(sort(readab))
```
We can **identify documents and terms that are similar to one another**, using `similarity()`:
```{r}
## Presidential Inaugural Address Corpus
presDfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
# compute some document similarities
similarity(presDfm, "1985-Reagan", n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "cosine")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "Hellinger")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "eJaccard")
# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")
```
And this can be used for **clustering documents**:
```{r, fig.height=6, fig.width=10}
data(SOTUCorpus, package="quantedaData")
presDfm <- dfm(subset(SOTUCorpus, lubridate::year(Date)>1981), verbose=FALSE, stem=TRUE,
ignoredFeatures=stopwords("english", verbose=FALSE))
presDfm <- trim(presDfm, minCount=5, minDoc=3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- dist(as.matrix(weight(presDfm, "relFreq")))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster)
```
Or we could look at **term clustering** insteadd:
```{r, fig.height=8, fig.width=12}
# word dendrogram with tf-idf weighting
wordDfm <- sort(weight(presDfm, "tfidf"))
wordDfm <- t(wordDfm)[1:100,] # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="tf-idf Frequency weighting")
```
Finally, there are number of helper functions to extract information from quanteda objects:
```{r}
myCorpus <- subset(inaugCorpus, Year > 1980)
# return the number of documents
ndoc(myCorpus)
ndoc(dfm(myCorpus, verbose = FALSE))
# how many tokens (total words)
ntoken(myCorpus)
ntoken("How many words in this sentence?")
# arguments to tokenize can be passed
ntoken("How many words in this sentence?", removePunct = TRUE)
# how many types (unique words)
ntype(myCorpus)
ntype("Yada yada yada. (TADA.)")
ntype("Yada yada yada. (TADA.)", removePunct = TRUE)
ntype(toLower("Yada yada yada. (TADA.)"), removePunct = TRUE)
# can count documents and features
ndoc(inaugCorpus)
myDfm1 <- dfm(inaugCorpus, verbose = FALSE)
ndoc(myDfm1)
nfeature(myDfm1)
myDfm2 <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"), stem = TRUE, verbose = FALSE)
nfeature(myDfm2)
# can extract feature labels and document names
head(features(myDfm1), 20)
head(docnames(myDfm1))
# and topfeatures
topfeatures(myDfm1)
topfeatures(myDfm2) # without stopwords
```