5_descriptive/descriptive.Rmd

---
title: "Descriptive Analysis of Texts"
author: "Kenneth Benoit and Paul Nulty"
date: "18 October 2015"
output: html_document
---

quateda has a number of descriptive statistics available for reporting on texts.  The **simplest of these** is through the `summary()` method:
```{r}
require(quanteda)
txt <- c(sent1 = "This is an example of the summary method for character objects.",
         sent2 = "The cat in the hat swung the bat.")
summary(txt)
```

This also works for corpus objects:
```{r}
summary(corpus(ukimmigTexts, notes = "Created as a demo."))
```

To access the **syllables** of a text, we use `syllables()`:
```{r}
syllables(c("Superman.", "supercalifragilisticexpialidocious", "The cat in the hat."))
```

We can even compute the **Scabble value** of English words, using `scrabble()`:
```{r}
scrabble(c("cat", "quixotry", "zoo"))
```

We can analyze the **lexical diversity** of texts, using `lexdiv()` on a dfm:
```{r}
myDfm <- dfm(subset(inaugCorpus, Year > 1980), verbose = FALSE)
lexdiv(myDfm, "R")
dotchart(sort(lexdiv(myDfm, "R")))
```

We can analyze the **readability** of texts, using `readability()` on a vector of texts or a corpus:
```{r}
readab <- readability(subset(inaugCorpus, Year > 1980), measure = "Flesch.Kincaid")
dotchart(sort(readab))
```

We can **identify documents and terms that are similar to one another**, using `similarity()`:
```{r}
## Presidential Inaugural Address Corpus
presDfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
# compute some document similarities
similarity(presDfm, "1985-Reagan", n=5, margin="documents")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "cosine")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "Hellinger")
similarity(presDfm, c("2009-Obama" , "2013-Obama"), n=5, margin="documents", method = "eJaccard")

# compute some term similarities
similarity(presDfm, c("fair", "health", "terror"), method="cosine")
```

And this can be used for **clustering documents**:
```{r, fig.height=6, fig.width=10}
data(SOTUCorpus, package="quantedaData")
presDfm <- dfm(subset(SOTUCorpus, lubridate::year(Date)>1981), verbose=FALSE, stem=TRUE,
               ignoredFeatures=stopwords("english", verbose=FALSE))
presDfm <- trim(presDfm, minCount=5, minDoc=3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- dist(as.matrix(weight(presDfm, "relFreq")))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster)
```

Or we could look at **term clustering** insteadd:
```{r, fig.height=8, fig.width=12}
# word dendrogram with tf-idf weighting
wordDfm <- sort(weight(presDfm, "tfidf"))
wordDfm <- t(wordDfm)[1:100,]  # because transposed
wordDistMat <- dist(wordDfm)
wordCluster <- hclust(wordDistMat)
plot(wordCluster, xlab="", main="tf-idf Frequency weighting")
```

Finally, there are number of helper functions to extract information from quanteda objects:
```{r}
myCorpus <- subset(inaugCorpus, Year > 1980)

# return the number of documents
ndoc(myCorpus)           
ndoc(dfm(myCorpus, verbose = FALSE))

# how many tokens (total words)
ntoken(myCorpus)
ntoken("How many words in this sentence?")
# arguments to tokenize can be passed 
ntoken("How many words in this sentence?", removePunct = TRUE)

# how many types (unique words)
ntype(myCorpus)
ntype("Yada yada yada.  (TADA.)")
ntype("Yada yada yada.  (TADA.)", removePunct = TRUE)
ntype(toLower("Yada yada yada.  (TADA.)"), removePunct = TRUE)

# can count documents and features
ndoc(inaugCorpus)
myDfm1 <- dfm(inaugCorpus, verbose = FALSE)
ndoc(myDfm1)
nfeature(myDfm1)
myDfm2 <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"), stem = TRUE, verbose = FALSE)
nfeature(myDfm2)

# can extract feature labels and document names
head(features(myDfm1), 20)
head(docnames(myDfm1))

# and topfeatures
topfeatures(myDfm1)
topfeatures(myDfm2) # without stopwords
```