forked from paulcbauer/apis_for_social_scientists_a_review
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path09-Google_nlp_api.Rmd
223 lines (154 loc) · 12.3 KB
/
09-Google_nlp_api.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# Google Natural Language API
<chauthors>Paul C. Bauer, Camille Landesvatter, Malte Söhren</chauthors>
<br><br>
```{r google-nlp-1, include=FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache=TRUE)
```
You will need to install the following packages for this chapter (run the code):
```{r google-nlp-2, echo=FALSE, comment=NA}
# Run this and copy output into the p_load function in next chunk
file_name <- dir()[stringr::str_detect(dir(), "Google_nlp_api.Rmd")]
lines_text <- readLines(file_name)
packages <- gsub("library\\(|\\)", "", unlist(str_extract_all(lines_text, "library\\([a-zA-z0-9]*\\)|p_load\\([a-zA-z0-9]*\\)")))
packages <- packages[packages!="pacman"]
packages <- paste("# install.packages('pacman')", "library(pacman)", "p_load('", paste(packages, collapse="', '"), "')",sep="")
packages <- str_wrap(packages, width = 80)
packages <- gsub("install.packages\\('pacman'\\)", "install.packages\\('pacman'\\)\n", packages)
packages <- gsub("library\\(pacman\\)", "library\\(pacman\\)\n", packages)
cat(packages)
```
## Provided services/data
* *What data/service is provided by the API?*
The API is provided by Google.
Google Cloud offers two Natural Language Products: AutoML Natural Language and Natural Language API. See [here]("https://cloud.google.com/natural-language#section-3") to read about which of the two products is the one more useful to you.
In short, option 1, the Auto Machine Learning (ML) Natural Language allows you to train a new, custom model to either classify text, extract entities or detect sentiment. For instance, you could provide an already pre-labeled subset of your data which the API will then use to train a custom classifier. With this classifier at hand you could then classify and analyze further similar data of yours.
This API review focuses on option 2, the Natural Language API. This API uses pre-trained models to analyze your data. Put differently, instead of providing only a pre-labeled subset of your data, here you normally provide the API with your complete (unlabeled) data which it will then analyze.
The following requests are available:
* Analyzing Sentiment (`analyzeSentiment`)
* Analyzing Entities (`analyzeEntities`)
* Analyzing Syntax (`analyzeSyntax`)
* Analyzing Entity Sentiment (`analyzeEntitySentiment`)
* Classifying Content (`classifyText`)
A demo of the API that allows you to input text and explore the different classification capabilities can be found [here]("https://cloud.google.com/natural-language#section-2").
## Prerequisites
* *What are the prerequisites to access the API (authentication)? *
The prerequisite to access Google Natural Language API is a Google Cloud Project. To create this you will need a Google account to log into the Google Cloud Platform (GCP). Within your Google Cloud Platform, you must enable the Natural Language API for your respective Google Cloud Project [here]("https://console.cloud.google.com/marketplace/product/google/language.googleapis.com").
Additionally, if you are planning to request the Natural Language API from outside a Google Cloud environment (e.g., R) you will be required to use a private (service account) key. This can be achieved by creating a [service account]("https://cloud.google.com/docs/authentication/production#create_service_account") which in turn will allow you to download your private key as a JSON file. To create your API key for authentication from within the GCP, go to [APIs & Services > Credentials]("https://console.cloud.google.com/apis/credentials"). Below we provide an example of how to authenticate from within the Google Cloud Platform (Cloud Shell + API key) and how to authenticate from within R (authentication via JSON key file).
## Simple API call
* *What does a simple API call look like?*
Here we describe how a simple API call can be made from within the Google Cloud Platform environment via the Google Cloud Shell:
* To activate your Cloud Shell, inspect the upper right-hand corner of your Google Cloud Platform Console and click the icon called “Activate Shell”. Google Cloud Shell is a command line environment running in the cloud.
* Via the Cloud Shell command line, add your individual API key to the environment variables, so it is not required to be called for each request.
```{r google-nlp-3, echo=TRUE, eval=FALSE}
export API_KEY=<YOUR_API_KEY>
```
* Via the built-in Editor in Cloud Shell create a JSON file (call it for instance ‘request.json’) with the text that you would like to perform analysis on. Consider that text can be uploaded in the request (shown below) or integrated with Cloud Storage. Supported types of your text are PLAIN_TEXT (shown below) or HTML.
```{r google-nlp-4, echo=TRUE, eval=FALSE}
{
"document":{
"type":"PLAIN_TEXT",
"content":"Enjoy your vacation!"
},
"encodingType": "UTF8"
}
```
* For sending your data, pass a curl command to your Cloud Shell command line where you refer (via @) to your request.json file from the previous step.
```{r google-nlp-5, echo=TRUE, eval=FALSE}
curl "https://language.googleapis.com/v1/documents:analyzeEntities?key=${API_KEY}" -s -X POST -H "Content-Type: application/json" --data-binary @request.json
```
* Depending on to which endpoint you send the request (here: analyzeEntities) you will receive your response with many different insights into your text data.
## API access
* *How can we access the API from R (httr + other packages)?*
The input (i.e., text data) one provides to the API most often will go beyond a single word or sentence. The most convenient way which also produces the most insightful and structured results (on which you can directly perform further analysis on) are achieved when using the [‘googleLanguageR’ R package]("https://cran.r-project.org/web/packages/googleLanguageR/index.html") - a package which among other options (there are other examples in this review) allows calling the Natural Language API:
In this small example we demonstrate how to..
* .. authenticate with your Google Cloud Account within R
* .. how to analyze the syntax of exemplary twitter data (we are using twitter data from two popular german politicians, which we (via the Google Translation API)
beforehand also translated to english)
* .. how to extract terms that are nouns only
* .. plot your nouns in a word cloud
*Step 1: Load package*
```{r google-nlp-6, message=FALSE, warning=FALSE}
library(googleLanguageR)
library(tidyverse)
library(tm)#stopwords
```
*Step 2: Authentication*
```{r google-nlp-7, eval=FALSE}
gl_auth("./your-key.json")
```
*Step 3: Analysis*
Start with loading your text data. For this example, we retrieve data inherit to the quanteda.corpora R package which in a broader sense is associated with the famous [quanteda]("https://quanteda.io/") package.
```{r google-nlp-8, echo=FALSE}
# Load package
library(quanteda.corpora)
```
The data we choose to download ('data_corpus_guardian') contains Guardian newspaper articles in politics, economy, society and international sections from 2012 to 2016. See [here]("https://github.com/quanteda/quanteda.corpora") for a list of even more publicy available text corpora from the quanteda.corpora package.
<!-- A cache version of df is available in "figures/rds/google_nlp_data.RDS" -->
```{r google-nlp-9, eval=F}
# Download and store corpus
guardian_corpus <- quanteda.corpora::download("data_corpus_guardian")
# Keep text only from the corpus
text <- guardian_corpus[["documents"]][["texts"]]
# For demonstration purposes, subset the text data to 20 observations only
text <- text[1:20]
# Turn text into a data frame and add an identifier
df <- as.data.frame(text)
df <- tibble::rowid_to_column(df, "ID")
```
```{r, google-nlp-10, echo = FALSE, message = FALSE}
df <- readRDS("figures/rds/google_nlp_data.RDS")
```
*Note*: Whenever you choose to work with textual data, a very common procedure is to pre-process the data via a set of certain transformations. For instance, you will convert all letters to lower case, remove numbers and punctuation, trim words to their word stem and eventually remove so-called stopwords. There are many tutorials (for example [here]("https://datamathstat.wordpress.com/2019/10/25/text-preprocessing-for-nlp-and-machine-learning-using-r/") or [here]("http://rstudio-pubs-static.s3.amazonaws.com/256588_57b585da6c054349825cba46685d8464.html")).
After having retrieved and prepared our to-be-analyzed (text) data, we can now call the API via the function `gl_nlp()`. Here you will have to specify the quantity of interest (here: `analyzeSyntax`). Depending on what specific argument you make use of (e.g., `analyzeSyntax`, `analyzeSentiment`, etc.) a list with information on different characteristics of your text is returned, e.g., sentences, tokens, tags of tokens.
<!-- A cache version of syntax_analysis is available in "figures/rds/google_nlp_syntax_analysis.RDS" -->
```{r google-nlp-11, eval=F}
syntax_analysis <- gl_nlp(df$text, nlp_type = "analyzeSyntax")
```
```{r, google-nlp-12, echo = FALSE, message = FALSE}
syntax_analysis <- readRDS("figures/rds/google_nlp_syntax_analysis.RDS")
```
Importantly, find the list `tokens` inherit to the large list `syntax_analysis`. This list stores two variables: `content` (contains the token) and `tag` (contains the tag, e.g., verb or noun). Let's have a look at the first document.
```{r google-nlp-13, message=FALSE, warning=FALSE}
head(syntax_analysis[["tokens"]][[1]][,1:3])
```
Now imagine you are interested in all the nouns that were used in the Guardian Articels while removing all other types of words (e.g., adjectives, verbs, etc.). We can simply filter for those using the "tag"-list.
```{r google-nlp-14, message=FALSE, warning=FALSE}
# Add tokens from syntax analysis to original dataframe
df$tokens <- syntax_analysis[["tokens"]]
# Keep nouns only
df <- df %>% dplyr::mutate(nouns = map(tokens,
~ dplyr::filter(., tag == "NOUN")))
```
*Step 4: Visualization*
Finally, we can also plot our nouns in a wordcloud using the [ggwordcloud]("https://cran.r-project.org/web/packages/ggwordcloud/vignettes/ggwordcloud.html") package.
```{r google-nlp-15, message=FALSE, warning=FALSE}
# Load package
library(ggwordcloud)
```
```{r google-nlp-16, message=FALSE, warning=FALSE, fig.cap = "Wordcloud of nouns found within guardian articles"}
# Create the data for the plot
data_plot <- df %>%
# only keep content variable
mutate(nouns = map(nouns,
~ select(., content))) %>%
# Write tokens in all rows into a single string
unnest(nouns) %>% # unnest tokens
# Unnest tokens
tidytext::unnest_tokens(output = word, input = content) %>% # generate a wordcloud
anti_join(tidytext::stop_words) %>%
dplyr::count(word) %>%
filter(n > 10) #only plot words that appear more than 10 times
# Visualize in a word cloud
data_plot %>%
ggplot(aes(label = word,
size = n)) +
geom_text_wordcloud() +
scale_size_area(max_size = 10) +
theme_minimal()
```
## Social science examples
* *Are there social science research examples using the API?*
Text-as-data has become quite a common approach in the social sciences (see e.g., @Grimmer2013-xe for an overview). For the usage of Google’s Natural Language API we however have the impression that it is relatively unknown in NLP and among social scientists. Hence, we want to emphasize the usefulness and importance the usage of Google’s NLP API could have in many research projects.
However, if you are considering making use of it, keep two things in mind:
* some might interpret using the API as a "blackbox" approach (see @Dobbrick2021-iz for very recent developments of "glass-box machine learning appraoches" for text analysis) potentially standing in way of transparency and replication (two important criteria of good research?). However it is always possible to perform robustness and sensitivity analysis and to add the version of the API one was using.
* depending on how large your corpus of text data, Google might charges you some money. However for up to 5,000 units (i.e., terms) the different variants of sentiment and syntax analysis are free. Check this overview by Google to learn about prices for more units [here]("https://cloud.google.com/natural-language/pricing"). Generally, also consider that if you are pursuing a CSS-related project in which the GCP Products would come in useful, there is the possibility to achieve Google Cloud Research credits (see [here]("https://edu.google.com/programs/credits/research/?modal_active=none")).