forked from koheiw/seededlda
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.RMD
69 lines (51 loc) · 2.56 KB
/
README.RMD
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
output: github_document
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/"
)
```
# Semisupervised LDA for theory-driven text analysis
**NOTICE:** This R package is renamed from **quanteda.seededlda** to **seededlda** for CRAN submission.
**seededlda** is an R package that implements the seeded-LDA for semisupervised topic modeling using **quanteda**. The seeded-LDA model was proposed by [Lu et al. (2010)](https://dl.acm.org/citation.cfm?id=2119585). Until version 0.3, that packages has been a simple wrapper around the **topicmodels** package, but the LDA estimator is newly implemented in C++ using the [GibbsLDA++](http://gibbslda.sourceforge.net/) library to be submitted to CRAN in August 202. The author believes this package implements the seeded-LDA model more closely to the original proposal.
Please see [*Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches*](https://journals.sagepub.com/doi/full/10.1177/0894439320907027) for the overview of semisupervised topic classification techniques and their advantages in social science research.
[**keyATM**](https://github.com/keyATM/keyATM) is the latest addition to the semisupervised topic models. The users of seeded-LDA are also encouraged to use that package.
## Install
```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/seededlda")
```
## Example
The corpus and seed words in this example are from [*Conspiracist propaganda: How Russia promotes anti-establishment sentiment online?*](https://koheiw.net/wp-content/uploads/2019/06/Sputnik-05-ECPR.pdf).
```{r message=FALSE}
require(quanteda)
require(seededlda) # changed from quanteda.seededlda to seededlda
```
Users of seeded-LDA has to construct a small dictionary of keywords (seed words) to define the desired topics.
```{r}
dict <- dictionary(file = "tests/data/topics.yml")
print(dict)
```
```{r}
corp <- readRDS("tests/data/data_corpus_sputnik.RDS")
toks <- tokens(corp) %>%
tokens_select("^[A-Za-z]+$", valuetype = "regex", min_nchar = 2) %>%
tokens_compound(dict) # multi-word expressions
dfmt <- dfm(toks) %>%
dfm_remove(stopwords('en')) %>%
dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile",
max_docfreq = 0.2, docfreq_type = "prop")
```
Many of the top terms of the seeded-LDA are seed words but other topic words are also identified.
```{r}
set.seed(1234)
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
print(terms(slda, 20))
```
```{r}
topic <- table(topics(slda))
print(topic)
```