forked from kbenoit/ITAUR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfile_import.Rmd
105 lines (80 loc) · 4.36 KB
/
file_import.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "File import"
author: Kenneth Benoit and Paul Nulty
date: October 18th 2015
output: html_document
---
In this section we will show how to load texts from different file sources. The `quanteda` package loads a corpus from a `corpusSource', which is created using the `textfile` command.
### Three ways to create a `corpus` object
**quanteda can construct a `corpus` object** from several input sources:
1. a character vector object
```{r}
require(quanteda)
myTinyCorpus <- corpus(inaugTexts[1:2], notes = "Just G.W.")
summary(myTinyCorpus)
```
2. a `VCorpus` object from the **tm** package, and
```{r}
require(tm)
data(crude, package = "tm")
myTmCorpus <- corpus(crude)
summary(myTmCorpus, 5)
detach()
```
3. a `corpusSource` object, created by `textfile()`.
In most cases you will need to load input files from outside of R, so you will use this third method. The remainder of this tutorial focuses on `textfile()`, which is designed to be a simple, powerful, and all-purpose method to load texts.
### Using `textfile()` to import texts
In the simplest case, we would like to load a set of texts in plain text files from a single directory. To do this, we use the `textfile` command, and use the 'glob' operator '*' to indicate that we want to load multiple files:
```{r message=FALSE}
myCorpus <- corpus(textfile(file='inaugural/*.txt'))
myCorpus <- corpus(textfile(file='sotu/*.txt'))
```
Often, we have metadata encoded in the names of the files. For example, the inaugural addresses contain the year and the president's name in the name of the file. With the `docvarsfrom` argument, we can instruct the `textfile` command to consider these elements as document variables.
```{r}
mytf <- textfile("inaugural/*.txt", docvarsfrom="filenames", dvsep="-", docvarnames=c("Year", "President"))
inaugCorpus <- corpus(mytf)
summary(inaugCorpus, 5)
```
If the texts and document variables are stored separately, we can easily add document variables to the corpus, as long as the data frame containing them is of the same length as the texts:
```{r}
SOTUdocvars <- read.csv("SOTU_metadata.csv", stringsAsFactors = FALSE)
SOTUdocvars$Date <- as.Date(SOTUdocvars$Date, "%B %d, %Y")
SOTUdocvars$delivery <- as.factor(SOTUdocvars$delivery)
SOTUdocvars$type <- as.factor(SOTUdocvars$type)
SOTUdocvars$party <- as.factor(SOTUdocvars$party)
SOTUdocvars$nwords <- NULL
sotuCorpus <- corpus(textfile(file='sotu/*.txt'), encodingFrom = "UTF-8-BOM")
docvars(sotuCorpus) <- SOTUdocvars
```
Another common case is that our texts are stored alongside the document variables in a structured file, such as a json, csv or excel file. The textfile command can read in the texts and document variables simultaneously from these files when the name of the field containing the texts is specified.
```{r}
tf1 <- textfile(file='inaugTexts.csv', textField = 'inaugSpeech')
myCorpus <- corpus(tf1)
tf2 <- textfile("text_example.csv", textField = "Title")
myCorpus <- corpus(tf2)
head(docvars(tf2))
```
Once the we have loaded a corpus with some document level variables, we can subset the corpus using these variables, create document-feature matrices by aggregating on the variables, or extract the texts concatenated by variable.
```{r}
recentCorpus <- subset(inaugCorpus, Year > 1980)
oldCorpus <- subset(inaugCorpus, Year < 1880)
require(dplyr)
demCorpus <- subset(sotuCorpus, party == 'Democratic')
demFeatures <- dfm(demCorpus, ignoredFeatures=stopwords('english')) %>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
repCorpus <- subset(sotuCorpus, party == 'Republican')
repFeatures <- dfm(repCorpus, ignoredFeatures=stopwords('english')) %>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
```
The `quanteda` corpus objects can be combined using the `+` operator:
```{r}
inaugCorpus <- demCorpus + repCorpus
allFeatures <- dfm(inaugCorpus, ignoredFeatures=stopwords('english'))%>%
trim(minDoc=3, minCount=5) %>% weight(type='tfidf') %>% topfeatures
```
It should also be possible to load a zip file containing texts directly from a url. However, whether this operation succeeds or not can depend on access permission settings on your particular system (i.e. fails on Windows):
```{r eval=FALSE}
immigfiles <- textfile("https://github.com/kbenoit/ME114/raw/master/day8/UKimmigTexts.zip")
mycorpus <- corpus(immigfiles)
summary(mycorpus)
```