4_preparing_texts/preparingtexts.Rmd

---
title: "Preparing and processing texts"
author: "Kenneth Benoit and Paul Nulty"
date: "18 October 2015"
output: html_document
---

Here we will step through the basic elements of preparing a text for analysis.  These are tokenization, conversion to lower case, stemming, removing or selecting features, and defining equivalency classes for features, including the use of dictionaries.


### 1. Tokenization

Tokenization in quanteda is very *conservative*: by default, it only removes separator characters.

```{r}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
txt <- c(text1="This is $10 in 999 different ways,\n up and down; left and right!",
         text2="@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokenize(txt)
tokenize(txt, verbose=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=TRUE)
tokenize(txt, removeNumbers=FALSE, removePunct=TRUE)
tokenize(txt, removeNumbers=TRUE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE)
tokenize(txt, removeNumbers=FALSE, removePunct=FALSE, removeSeparators=FALSE)
```

There are several options to the `what` argument: 
```{r}
# sentence level
tokenize(c("Kurt Vongeut said; only assholes use semi-colons.",
           "Today is Thursday in Canberra:  It is yesterday in London.",
           "Today is Thursday in Canberra:  \nIt is yesterday in London.",
           "To be?  Or\not to be?"),
          what = "sentence")
tokenize(inaugTexts[2], what = "sentence")

# character level
tokenize("My big fat text package.", what="character")
tokenize("My big fat text package.", what="character", removeSeparators=FALSE)
```

Two other options, for really fast and simple tokenization are `"fastestword"` and `"fasterword"`, if performance is a key issue.  These are less intelligent than the boundary detection used in the default `"word"` method, which is based on stringi\ICU boundary detection.

### 2. Conversion to lower case

This is a tricky one in our workflow, since it is a form of equivalency declaration, rather than a tokenization step.  It turns out that it is more efficient to perform at the pre-tokenization stage. 

As a result, the method `toLower()` is defined for many classes of quanteda objects.
```{r}
methods(toLower)
```

We include options designed to preserve acronyms.
```{r}
test1 <- c(text1 = "England and France are members of NATO and UNESCO",
           text2 = "NASA sent a rocket into space.")
toLower(test1)
toLower(test1, keepAcronyms = TRUE)

test2 <- tokenize(test1, removePunct=TRUE)
toLower(test2)
toLower(test2, keepAcronyms = TRUE)
```

toLower is based on stringi, and is therefore nicely Unicode compliant.
```{r}
# Russian
cat(iconv(encodedTexts[8], "windows-1251", "UTF-8"))
cat(toLower(iconv(encodedTexts[8], "windows-1251", "UTF-8")))
head(toLower(stopwords("russian")), 20)

# Arabic
cat(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8"))
cat(toLower(iconv(encodedTexts[6], "ISO-8859-6", "UTF-8")))
head(toLower(stopwords("arabic")), 20)
```

**Note**: dfm, the Swiss army knife, converts to lower case by default, but this can be turned off using the `toLower = FALSE` argument.

### 3. Removing and selecting features

This can be done when creating a dfm:
```{r}
# with English stopwords and stemming
dfmsInaug2 <- dfm(subset(inaugCorpus, Year>1980),
                  ignoredFeatures = stopwords("english"), stem = TRUE)
```


Or can be done **after** creating a dfm:
```{r}
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
               "Does the United_States or Sweden have more progressive taxation?"),
             toLower = FALSE, verbose = FALSE)
selectFeatures(myDfm, c("s$", ".y"), "keep", valuetype = "regex")
selectFeatures(myDfm, c("s$", ".y"), "remove", valuetype = "regex")
selectFeatures(myDfm, stopwords("english"), "keep", valuetype = "fixed")
selectFeatures(myDfm, stopwords("english"), "remove", valuetype = "fixed")
```

More examples:
```{r}

# removing stopwords
testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with
             the newspaper from a boy named Seamus, in his mouth."
testCorpus <- corpus(testText)
# note: "also" is not in the default stopwords("english")
features(dfm(testCorpus, ignoredFeatures = stopwords("english")))
# for ngrams
features(dfm(testCorpus, ngrams = 2, ignoredFeatures = stopwords("english")))
features(dfm(testCorpus, ngrams = 1:2, ignoredFeatures = stopwords("english")))

## removing stopwords before constructing ngrams
tokensAll <- tokenize(toLower(test^#Text), removePunct = TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
tokensNgramsNoStopwords <- ngrams(tokensNoStopwords, 2)
features(dfm(tokensNgramsNoStopwords, ngrams = 1:2))

# keep only certain words
dfm(testCorpus, keptFeatures = "*s", verbose = FALSE)  # keep only words ending in "s"
dfm(testCorpus, keptFeatures = "s$", valuetype = "regex", verbose = FALSE)

# testing Twitter functions
testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers",
                "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber",
                "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber")
dfm(testTweets, keptFeatures = "#*", removeTwitter = FALSE)  # keep only hashtags
dfm(testTweets, keptFeatures = "^#.*$", valuetype = "regex", removeTwitter = FALSE)
```


One very nice feature, recently added, is the ability to create a new dfm with the same feature set as the old.  This is very useful, for instance, if we train a model on one dfm, and need to predict on counts from another, but need the feature set to be equivalent.
```{r}
# selecting on a dfm
textVec1 <- c("This is text one.", "This, the second text.", "Here: the third text.")
textVec2 <- c("Here are new words.", "New words in this text.")
features(dfm1 <- dfm(textVec1))
features(dfm2a <- dfm(textVec2))
(dfm2b <- selectFeatures(dfm2a, dfm1))
identical(features(dfm1), features(dfm2b))
```

### 4. Applying equivalency classes: dictionaries, thesaruses

Dictionary creation is done through the `dictionary()` function, which classes a named list of characters as a dictionary.
```{r}
# import the Laver-Garry dictionary from http://bit.ly/1FH2nvf
lgdict <- dictionary(file="http://www.kenbenoit.net/courses/essex2014qta/LaverGarry.cat",
                     format="wordstat")
dfm(inaugTexts, dictionary=lgdict)

# import a LIWC formatted dictionary
liwcdict <- dictionary(file = "http://www.kenbenoit.net/files/LIWC2001_English.dic",
                       format = "LIWC")
dfm(inaugTexts, dictionary=liwcdict)
```

We apply dictionaries to a dfm using the `applyDictionary()` function.  Through the `valuetype`, argument, we can match patterns of one of three types: `"glob"`, `"regex"`, or `"fixed"`.
```{r}
myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                          opposition = c("Opposition", "reject", "notincorpus"),
                          taxglob = "tax*",
                          taxregex = "tax.+$",
                          country = c("United_States", "Sweden")))
myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.",
               "Does the United_States or Sweden have more progressive taxation?"),
             ignoredFeatures = stopwords("english"), verbose = FALSE)
myDfm

# glob format
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)

# regex v. glob format: note that "united_states" is a regex match for "tax*"
applyDictionary(myDfm, myDict, valuetype = "glob")
applyDictionary(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)

# fixed format: no pattern matching
applyDictionary(myDfm, myDict, valuetype = "fixed")
applyDictionary(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
```

It is also possible to pass through a dictionary at the time of `dfm()` creation.
```{r}
# dfm with dictionaries
mycorpus <- subset(inaugCorpus, Year>1900)
mydict <- dictionary(list(christmas=c("Christmas", "Santa", "holiday"),
                          opposition=c("Opposition", "reject", "notincorpus"),
                          taxing="taxing",
                          taxation="taxation",
                          taxregex="tax*",
                          country="united states"))
dictDfm <- dfm(mycorpus, dictionary=mydict)
head(dictDfm)
```

Finally, there is a related "thesaurus" feature, which collapses words in a dictionary but is not exclusive.
```{r}
mytexts <- c("British English tokenises differently, with more colour.",
             "American English tokenizes color as one word.")
mydict <- dictionary(list(color = "colo*r", tokenize = "tokeni?e*"))
dfm(mytexts, thesaurus = mydict)
```

### 5. Stemming

Stemming relies on the `SnowballC` package's implementation of the Porter stemmer, and is available for the following languages:
```{r}
SnowballC::getStemLanguages()
```

It's not perfect:
```{r}
wordstem(c("win", "winning", "wins", "won", "winner"))
```
but it's fast.

Stemmed objects must be tokenized, but can be of many different quanteda classes:
```{r, error = TRUE}
methods(wordstem)
wordstem("This is a winning package, of many packages.")
wordstem(tokenize("This is a winning package, of many packages."))
head(wordstem(dfm(inaugTexts[1:2], verbose = FALSE)))
# same as 
head(dfm(inaugTexts[1:2], stem = TRUE, verbose = FALSE))
```

### 6. `dfm()` and its many options

Operates on `character` (vectors), `corpus`, or `tokenizedText` objects,

```{r, eval=FALSE}
## S3 method for class 'character'
dfm(x, verbose = TRUE, toLower = TRUE,
  removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE,
  removeTwitter = FALSE, stem = FALSE, ignoredFeatures = NULL,
  keptFeatures = NULL, matrixType = c("sparse", "dense"),
  language = "english", thesaurus = NULL, dictionary = NULL,
  valuetype = c("glob", "regex", "fixed"), dictionary_regex = FALSE, ...)
```