-
Notifications
You must be signed in to change notification settings - Fork 45
/
Copy pathTutorial_8_NER_POS.Rmd
118 lines (95 loc) · 4.25 KB
/
Tutorial_8_NER_POS.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
title: "Tutorial 8: Part-of-Speech tagging / Named Entity Recognition"
author: "Andreas Niekler, Gregor Wiedemann"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
pdf_document:
toc: yes
html_document:
toc: true
theme: united
toc_float: true
number_sections: yes
highlight: tango
bibliography: references.bib
csl: springer.csl
---
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy()
options(width = 98)
```
This tutorial shows how to use Part-o-Speech-tagging (POS) with the openNLP package. The openNLP package relies on the `rjava` package. For this to work properly, you need a version of Java installed (e.g. open-jdk) which matches your R-version w.r.t either the 32- or 64-bit installation. Also the `JAVA_HOME` environment variable needs to be set, pointing to your Java installation directory.
# Annotate POS
We extract proper nouns (tag NNP for singular and tag NNPS for plural proper nouns) from paragraphs of president's speeches.
```{r initalisierung, message=FALSE, warning=FALSE}
options(stringsAsFactors = FALSE)
library(quanteda)
library(NLP)
# read suto paragraphs
textdata <- read.csv("data/sotu_paragraphs.csv", sep = ";", encoding = "UTF-8")
english_stopwords <- readLines("resources/stopwords_en.txt", encoding = "UTF-8")
# Create corpus object
sotu_corpus <- corpus(textdata$text, docnames = textdata$doc_id)
require(openNLP)
require(openNLPdata)
# openNLP annotator objects
sent_token_annotator <- openNLP::Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
annotator_pipeline <- Annotator_Pipeline(
sent_token_annotator,
word_token_annotator,
pos_tag_annotator
)
# function for annotation
annotateDocuments <- function(doc, pos_filter = NULL) {
doc <- as.String(doc)
doc_with_annotations <- NLP::annotate(doc, annotator_pipeline)
tags <- sapply(subset(doc_with_annotations, type=="word")$features, `[[`, "POS")
tokens <- doc[subset(doc_with_annotations, type=="word")]
if (!is.null(pos_filter)) {
res <- tokens[tags %in% pos_filter]
} else {
res <- paste0(tokens, "_", tags)
}
res <- paste(res, collapse = " ")
return(res)
}
# run annotation on a sample of the corpus
annotated_corpus <- lapply(texts(sotu_corpus)[1:10], annotateDocuments)
# Have a look into the first annotated documents
annotated_corpus[1]
annotated_corpus[2]
```
# Filter NEs for further applications
We annotate the first paragraphs of the corpus, extract proper nouns, also referred to as Named Entities (NEs) such as person names, locations etc., and compute significance of co-occurrence of them.
```{r cooc, message=FALSE, warning=FALSE, cache=T}
sample_corpus <- sapply(texts(sotu_corpus)[1:1000], annotateDocuments, pos_filter = c("NNP", "NNPS", ""))
# Binary term matrix
require(Matrix)
minimumFrequency <- 2
filtered_corpus <- corpus(sample_corpus)
binDTM <- filtered_corpus %>%
tokens(what = "fastestword") %>%
tokens_tolower() %>%
dfm() %>%
dfm_weight(scheme = "boolean")
# Matrix multiplication for cooccurrence counts
coocCounts <- t(binDTM) %*% binDTM
source("calculateCoocStatistics.R")
# Definition of a parameter for the representation of the co-occurrences of a concept
# Determination of the term of which co-competitors are to be measured.
coocTerm <- "spain"
coocs <- calculateCoocStatistics(coocTerm, binDTM, measure="LOGLIK")
print(coocs[1:20])
```
# German language support
For German language support run
```{r German, eval = F}
# install.packages("openNLPmodels.de", repos = "http://datacube.wu.ac.at")
# require("openNLPmodels.de")
```
# Optional exercises
1. Plot a co-occurrence graph for the term "family_NN" and its collocates, such as in tutorial 5.
2. Merging tokens by identical consecutive POS-tags can be a useful approach to identification of multi-word-units (MWU). Modify the function `annotateDocuments` in a way, that consecutive POS-tags get merged into a single token (e.g. "United_NNP States_NNP" becomes "United_States_NNP").
3. Bring it all together: Create a topic model visualization (topic distribution per decade, Tutorial: Topic Models) based only on paragraphs related to Foreign Policy (Tutorial: Text Classification). Just use nouns (NN, NNS) and proper nouns (NNP, NNPS) for the model (Tutorial: POS-tagging).