Tutorial_8_NER_POS.Rmd

---
title: "Tutorial 8: Part-of-Speech tagging / Named Entity Recognition"
author: "Andreas Niekler, Gregor Wiedemann"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output: 
  pdf_document:
    toc: yes
  html_document:
    toc: true
    theme: united
    toc_float: true
    number_sections: yes
highlight: tango
bibliography: references.bib
csl: springer.csl
---
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy()
options(width = 98)
```
This tutorial shows how to use Part-o-Speech-tagging (POS) with the openNLP package. The openNLP package relies on the `rjava` package. For this to work properly, you need a version of Java installed (e.g. open-jdk) which matches your R-version w.r.t either the 32- or 64-bit installation. Also the `JAVA_HOME` environment variable needs to be set, pointing to your Java installation directory.

# Annotate POS

We extract proper nouns (tag NNP for singular and tag NNPS for plural proper nouns) from paragraphs of president's speeches.

```{r initalisierung, message=FALSE, warning=FALSE}
options(stringsAsFactors = FALSE)
library(quanteda)
library(NLP)

# read suto paragraphs
textdata <- read.csv("data/sotu_paragraphs.csv", sep = ";", encoding = "UTF-8")
english_stopwords <- readLines("resources/stopwords_en.txt", encoding = "UTF-8")

# Create corpus object
sotu_corpus <- corpus(textdata$text, docnames = textdata$doc_id)

require(openNLP)
require(openNLPdata)

# openNLP annotator objects
sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
annotator_pipeline <- Annotator_Pipeline(
    sent_token_annotator,
    word_token_annotator,
    pos_tag_annotator
)

# function for annotation
annotateDocuments <- function(doc, pos_filter = NULL) {
  doc <- as.String(doc)
  doc_with_annotations <- NLP::annotate(doc, annotator_pipeline)
  tags <- sapply(subset(doc_with_annotations, type=="word")$features, `[[`, "POS")
  tokens <- doc[subset(doc_with_annotations, type=="word")]
  if (!is.null(pos_filter)) {
    res <- tokens[tags %in% pos_filter]
  } else {
    res <- paste0(tokens, "_", tags)
  }
  res <- paste(res, collapse = " ")
  return(res)
}

# run annotation on a sample of the corpus
annotated_corpus <- lapply(texts(sotu_corpus)[1:10], annotateDocuments)

# Have a look into the first annotated documents
annotated_corpus[1]
annotated_corpus[2]
```

# Filter NEs for further applications

We annotate the first paragraphs of the corpus, extract proper nouns, also referred to as Named Entities (NEs) such as person names, locations etc., and compute significance of co-occurrence of them.

```{r cooc, message=FALSE, warning=FALSE, cache=T}
sample_corpus <- sapply(texts(sotu_corpus)[1:1000], annotateDocuments, pos_filter = c("NNP", "NNPS", ""))

# Binary term matrix
require(Matrix)
minimumFrequency <- 2
filtered_corpus <- corpus(sample_corpus)

binDTM <- filtered_corpus %>% 
  tokens(what = "fastestword") %>% 
  tokens_tolower() %>%
  dfm() %>% 
  dfm_weight(scheme = "boolean")

# Matrix multiplication for cooccurrence counts
coocCounts <- t(binDTM) %*% binDTM

source("calculateCoocStatistics.R")
# Definition of a parameter for the representation of the co-occurrences of a concept
# Determination of the term of which co-competitors are to be measured.
coocTerm <- "spain"
coocs <- calculateCoocStatistics(coocTerm, binDTM, measure="LOGLIK")
print(coocs[1:20])
```


# German language support

For German language support run

```{r German, eval = F}
# install.packages("openNLPmodels.de", repos = "http://datacube.wu.ac.at")
# require("openNLPmodels.de")
```

# Optional exercises

1. Plot a co-occurrence graph for the term "family_NN" and its collocates, such as in tutorial 5.
2. Merging tokens by identical consecutive POS-tags can be a useful approach to identification of multi-word-units (MWU). Modify the function `annotateDocuments` in a way, that consecutive POS-tags get merged into a single token (e.g. "United_NNP States_NNP" becomes "United_States_NNP"). 
3. Bring it all together: Create a topic model visualization (topic distribution per decade, Tutorial: Topic Models) based only on paragraphs related to Foreign Policy (Tutorial: Text Classification). Just use nouns (NN, NNS) and proper nouns (NNP, NNPS) for the model (Tutorial: POS-tagging).