day8/ME114_assignment8_solution.Rmd

---
title: "Assignment 8 - Working With Textual Data"
author: "Ken Benoit and Slava Mikhaylov"
output: html_document
---

Assignments for the course focus on practical aspects of the concepts covered in the lectures. Assignments are largely based on the material covered in James et al. (2013). You will start working on the assignment in the lab sessions after the lectures, but may need to finish them after class.

You will be asked to submit your assignments via Moodle by 7pm on the day of the class. We will subsequently open up solutions to the problem sets. 

### Exercise summary

This exercise is designed to get you working with [quanteda](http://github.com/kbenoit/quanteda).  The focus will be on exploring the package  and getting some texts into the **corpus** object format.  [quanteda](http://github.com/kbenoit/quanteda) package has several functions for creating a corpus of texts which we will use in this exercise.

1.  Getting Started.

    You can use R or Rstudio for these exercises.  You will first need to install the package,   
    using the commands below.  Also see the instructions for installation from the dev branch page of
    http://github.com/kbenoit/quanteda.
   
    ```{r, eval=FALSE}
    # needs the devtools package for this to work
    if (!require(devtools)) install.packages("devtools", dependencies=TRUE)
    # PREFERABLY: install the latest version from GitHub
    # see the installation instructions on http://github.com/kbenoit/quanteda
    devtools::install_github("kbenoit/quanteda")
    # and quantedaData
    devtools::install_github("quantedaData", username="kbenoit")
    ```
1.  Exploring **quanteda** functions.

    Look at the Quick Start vignette, and browse the manual for quanteda.  You can use
    `example()` function for any 
    function in the package, to run the examples and see how the function works.  Of course
    you should also browse the documentation, especially `?corpus` to see the structure
    and operations of how to construct a corpus.
    
    
    ```{r, eval = FALSE}
    help(package = "quanteda")
    ```
```{r, echo=FALSE}
require(quanteda, quietly = TRUE, warn.conflicts = FALSE)
```


1.  Making a corpus and corpus structure

    1.  From a vector of texts already in memory. 
    
        The simplest way to create a corpus is to use a vector of texts already present in 
        R's global environment. Some text and corpus objects are built into the package,
        for example `inaugTexts` is the UTF-8 encoded set of 57 presidential inaugural
        addresses.  Try using `corpus()` on this set of texts to create a corpus.  
      
        Once you have constructed this corpus, use the `summary()` method to see a brief
        description of the corpus.  The names of the character vector `inaugTexts` should
        have become the document names.
        
        ```{r}
        str(inaugTexts)
        myInaugCorpus <- corpus(inaugTexts, notes = "Corpus created in lab")
        summary(myInaugCorpus)
        ```
      
    1.  From a directory of text files.
   
        The `textfile()` function can read (almost) any set of files into an object
        that you can then call the corpus()` function on, to create a corpus.  (See `?textfile`
        for an example.)
      
        Here you are encouraged to select any directory of plain text files of your own.  
        How did it work?  Try using `docvars()` to assign a set of document-level variables.
        If you do not have a set of text files to work with, then you can use the UK 2010 manifesto texts on immigration, in the Day 8 folder, like this:
      

        ```{r, eval=FALSE}
        require(quanteda)
        manfiles <- textfile("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
        mycorpus <- corpus(manfiles)
        ```
   
    1.  From `.csv` or `.json` files --- see the documentation with `?textfile`.
    
        Here you can try one of your own examples, or just file this in your mental catalogue for future reference.
    
 
1.  Explore some phrases in the text.  

    You can do this using the `kwic` (for "key-words-in-context") to explore a specific word
    or phrase.
      
    ```{r}
    kwic(inaugCorpus, "terror", 3)
    ```

    Try substituting your own search terms, or working with your own corpus.
    
    ```{r}
    kwic(ie2010Corpus, "Christmas", 3)
    kwic(ie2010Corpus, "euro")
    kwic(ie2010Corpus, "euro", wholeword = TRUE)
    ```

1.  Create a document-feature matrix, using `dfm`.  First, read the documentation using
    `?dfm` to see the available options.
   
    ```{r}
    mydfm <- dfm(inaugCorpus, ignoredFeatures = stopwords("english"))
    mydfm
    topfeatures(mydfm, 20)
    ```
   
    Experiment with different `dfm` options, such as `stem=TRUE`.  The function `trim()` 
    allows you to reduce the size of the dfm following its construction.
    
    ```{r}
    mydfmStemmed1 <- wordstem(mydfm)
    topfeatures(mydfmStemmed1, 20)
    mydfmStemmed2 <- dfm(inaugCorpus, 
                         ignoredFeatures = stopwords("english"), stem = TRUE)
    topfeatures(mydfmStemmed2, 20)
    ```
   
    Grouping on a variable is an excellent feature of `dfm()`, in fact one of my favorites.  
    For instance, if you want to aggregate the inaugural speeches by presidential name, you can execute
    ```{r}
    mydfm <- dfm(inaugCorpus, groups = "President")
    mydfm
    docnames(mydfm)
    ```
    Note that this groups Theodore and Franklin D. Roosevelt together -- to separate them we
    would have needed to add a firstname variable using `docvars()` and grouped on that as well.
    
    Do this to aggregate the Irish budget corpus (`ie2010Corpus`) by political party, when
    creating a dfm.
    
    ```{r}
    docvars(ie2010Corpus)
    partyDfm <- dfm(ie2010Corpus, group = "party")
    partyDfm[, 1:10]
    ```
   
1.  Explore the ability to subset a corpus.  

    There is a `subset()` method defined for a corpus, which works just like R's normal
    `subset()` command.  For instance if you want a wordcloud of just Obama's two inagural addresses, you would need
    to subset the corpus first:
   
    ```{r}
    
    obamaCorpus <- subset(inaugCorpus, President=="Obama")
    obamadfm <- dfm(obamaCorpus)
    
    
    plot(obamadfm)
    ```

    Try producing that plot without the stopwords.  See `removeFeatures()` to remove stopwords from the dfm object directly, or supply
    the `ignoredFeatures` argument to `dfm()`.
    
    ```{r}
    obamadfm2 <- removeFeatures(obamadfm, stopwords("english"))
    plot(obamadfm)
    # other method:
    obamadfm2a <- dfm(subset(inaugCorpus, President=="Obama"), 
                      ignoredFeatures = stopwords("english"))
    all.equal(obamadfm2, obamadfm2a)
    ```
    

1.  **Preparing and pre-processing texts**

    1. "Cleaning"" texts
    
        It is common to "clean" texts before processing, usually by removing
        punctuation,  digits, and converting to lower case. Look at the 
        documentation for `toLower()` and use the
        command on the `exampleString` text (you can load this from 
        **quantedaData** using `data(exampleString)`. Can you think of cases 
        where cleaning could introduce homonymy?
        ```{r}
        exampleString
        toLower(exampleString)
        ```
        
    1.  Tokenizing texts

        In order to count word frequencies, we first need to split the text 
        into words through a process known as **tokenization**.  Look at the
        documentation for **quanteda**'s `tokenize()` function.  Use the 
        `tokenize` command on `exampleString`, and examine the results.  Are 
        there cases where it is unclear where the boundary between two words lies?
        You can experiment with the options to `tokenize`.  
        ```{r}
        tokenize(toLower(exampleString))
        tokenize(toLower(exampleString), removePunct = TRUE, removeNumbers = TRUE)
        ```

        Try tokenizing the sentences from `exampleString` into sentences, using
        `tokenize(x, what = "sentence")`. 
        ```{r}
        tokenize(exampleString, what = "sentence")
        ```

    1.  Stemming.
    
        Stemming removes the suffixes using the Porter stemmer, found in the
        **SnowballC** library.  The **quanteda** function to invoke the stemmer is `wordstem`.  Apply stemming to the `exampleString` and examine the results.  Why does it not appear to work, and what do you need to do to make it work?  How would you apply this to the sentence-segmented vector?
        ```{r}
        # wordstem(exampleString) # fails
        wordstem(tokenize(toLower(exampleString), removePunct = TRUE))
        ```
    
    1.  Applying "pre-processing" to the creation of a **dfm**.
    
        **quanteda**'s `dfm()` function makes it wasy to pass the cleaning arguments to clean, which are executed as part of the tokenization implemented by `dfm()`.  Compare the steps required in a similar text preparation package, [**tm**](http://cran.r-project.org/package=tm):
        
        ```{r}
        require(tm)
        data(crude)
        crude <- tm_map(crude, content_transformer(tolower))
        crude <- tm_map(crude, removePunctuation)
        crude <- tm_map(crude, removeNumbers)
        crude <- tm_map(crude, stemDocument)
        tdm <- TermDocumentMatrix(crude)

        # same in quanteda
        crudeCorpus <- corpus(crude)
        crudeDfm <- dfm(crudeCorpus)
        ```
        
        Inspect the dimensions of the resulting objects, including the names of the words extracted as features.  It is also worth comparing the structure of the document-feature matrixes returned by each package.  **tm** uses the [slam](http://cran.r-project.org/web/packages/slam/index.html) *simple triplet matrix* format for representing a [sparse matrix](http://en.wikipedia.org/wiki/Sparse_matrix).
        
        It is also -- in fact almost always -- useful to inspect the structure of this object:
        ```{r}
        str(tdm)
        ```

        THis indicates that we can extract the names of the words from the **tm** TermDocumentMatrix object by getting the rownames from inspecting the tdm:
        ```{r}
        head(tdm$dimnames$Terms, 20)
        ```
        Compare this to the results of the same operations from **quanteda**.  To get the "words" from a quanteda object, you can use the `features()` function:
        ```{r}
        features_quanteda <- features(crudeDfm)
        head(features_quanteda, 20)
        str(crudeDfm)
        ```        
        What proportion of the `crudeDfm` are zeros?  Compare the sizes of `tdm` and `crudeDfm` using the `object.size()` function.
        ```{r}
        # note that length() applied to a matrix gives the total number of cells
        sum(crudeDfm == 0) / length(crudeDfm)
        ```
        

1.  **Keywords-in-context**

    1.  **quanteda** provides a keyword-in-context
        function that is easily usable and configurable to explore texts
        in a descriptive way. Type `?kwic` to view the documentation.

    1.  For the Irish budget debate speeches corpus for the year 2010, called `ie2010Corpus`,
        experiment with the
        `kwic` function, following the syntax specified on the help page
        for `kwic`. `kwic` can be used either on a character vector or a
        corpus object.  What class of object is returned?  Try assigning the
        return value from `kwic` to a new object and then examine the
        object by clicking on it in the environment
        pane in RStudio (or using the inspection method of your choice).

        ```{r}
        myKwic <- kwic(inaugCorpus, "Christmas")
        str(myKwic)
        ```

    4.  Use the `kwic` function to discover the context of the word
        "clean".  Is this associated with environmental policy?
        ```{r}
        kwic(inaugCorpus, "clean")
        str(myKwic)
        ```

    5.  By default, kwic explores all words related to the word, since it interprets the
        pattern as a "regular expression".  What if we wanted to see only the literal, 
        entire word "disaster"?  Hint: Look at the arguments using `?kwic`.
        ```{r}
        kwic(inaugCorpus, "disaster", wholeword = TRUE)
        str(myKwic)
        ```

1.  **Descriptive statistics**
    
    1.  We can extract basic descriptive statistics from a corpus from
        its document feature matrix.  Make a dfm from the 2010 Irish budget 
        speeches corpus.
        ```{r}
        ieDfm <- dfm(inaugCorpus)
        ```

    1.  Examine the most frequent word features using `topfeatures()`.  What are
        the five most frequent word in the corpus? 
        ```{r}
        topfeatures(ieDfm)
        ```

    5.  **quanteda** provides a function to count syllables in a
        word — `syllables`. Try the function at the prompt. The
        code below will apply this function to all the words in the
        corpus, to give you a count of the total syllables in the
        corpus.
        ```{r}
        # count syllables from texts in the 2010 speech corpus 
        textSyls <- syllables(texts(ie2010Corpus))
        # sum the syllable counts 
        totalSyls <- sum(textSyls)                           
        ```
        
        How would you get the total syllables per text?
        ```{r}
        textSyls
        ```
        **`textSylls` is already the vector of syllables from the vector of budget texts.**
        
3.  **Lexical Diversity over Time**

    1.  We can plot the type-token ratio of the Irish budget speeches
        over time. To do this, begin by extracting a subset of iebudgets
        that contains only the first speaker from each year:

        ```{r}
        data(iebudgetsCorpus, package = "quantedaData")
        finMins <- subset(iebudgetsCorpus, number=="01")
        tokeninfo <- summary(finMins)
        ```
        
        Note the quotation marks around the value for `number`.  Why are these required here?  
        **Because it is a character data type, not numeric.**

    2.  Get the type-token ratio for each text from this subset, and
        plot the resulting vector of TTRs as a function of the year.  Hint: See `?lexdiv`.
        ```{r}
        (TTR <- lexdiv(dfm(finMins, verbose = FALSE), measure = "TTR"))
        plot(docvars(finMins, "year"), TTR, xlab = "", type = "b")
        ```
        

4.  **Document and word associations**

    1.  Load the presidential inauguration corpus selecting from 1900-1950,
        and create a dfm from this corpus.
        ```{r}
        presDfm <- dfm(subset(inaugCorpus, Year >= 1900 & Year <= 1950))
        ```
    
    2.  Measure the term similarities for the following words: *economy*, *health*,
        *women*.
        ```{r}
        simils <- similarity(presDfm, c("economy", "health", "women"))
        head(simils)
        ```
        

1.  **Working with dictionaries**

    Dictionaries are named lists, consisting of a "key" and a set of entries defining
    the equivalence class for the given key.  To create a simple dictionary of parts of
    speech, for instance we could define a dictionary consisting of articles and conjunctions,
    using:
    ```{r}
    posDict <- dictionary(list(articles = c("the", "a", "an"),
                               conjunctions = c("and", "but", "or", "nor", "for", "yet", "so")))
    ```
    
    To let this define a set of features, we can use this dictionary when we create a `dfm`, 
    for instance:
    ```{r}
    posDfm <- dfm(inaugCorpus, dictionary = posDict, groups=(c('President')))
    posDfm[1:10,]
    ```
    
    Weight the `posDfm` by term frequency using `weight()`, and plot the values of articles
    and conjunctions (actually, here just the coordinating conjunctions) across the speeches.
    (**Hint:** you can use `docvars(inaugCorpus, "Year"))` for the *x*-axis.)
    ```{r}
    posDfm <- weight(posDfm, "relFreq")
    posDfm[1:10,]
    ```
    Is the distribution of normalized articles and conjunctions relatively constant across
    years, as you would expect?
    ```{r, out.height = 10}
    dotchart(as.matrix(posDfm))
    ```


1.  Settle a Scrabble word value dispute.

    Look up the Scrabble values of "aerie" and "queue".  And ask yourself how can an English word have five letters and just one consonant?? It's downright **eerie**.
    ```{r}
    scrabble(c("aerie", "queue", "eerie"))
    ```