README.RMD

---
output: github_document
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Semisupervised LDA for theory-driven text analysis

**NOTICE:** This R package is renamed from **quanteda.seededlda** to **seededlda** for CRAN submission.

**seededlda** is an R package that implements the seeded-LDA for semisupervised topic modeling using **quanteda**. The seeded-LDA model was proposed by [Lu et al. (2010)](https://dl.acm.org/citation.cfm?id=2119585). Until version 0.3, that packages has been a simple wrapper around the **topicmodels** package, but the LDA estimator is newly implemented in C++ using the [GibbsLDA++](http://gibbslda.sourceforge.net/) library to be submitted to CRAN in August 202. The author believes this package implements the seeded-LDA model more closely to the original proposal.

Please see [*Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches*](https://journals.sagepub.com/doi/full/10.1177/0894439320907027) for the overview of semisupervised topic classification techniques and their advantages in social science research.

[**keyATM**](https://github.com/keyATM/keyATM) is the latest addition to the semisupervised topic models. The users of seeded-LDA are also encouraged to use that package.

## Install

```{r eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/seededlda") 
```

## Example

The corpus and seed words in this example are from [*Conspiracist propaganda: How Russia promotes anti-establishment sentiment online?*](https://koheiw.net/wp-content/uploads/2019/06/Sputnik-05-ECPR.pdf).

```{r message=FALSE}
require(quanteda)
require(seededlda) # changed from quanteda.seededlda to seededlda
```

Users of seeded-LDA has to construct a small dictionary of keywords (seed words) to define the desired topics.

```{r}
dict <- dictionary(file = "tests/data/topics.yml")
print(dict)
```

```{r}
corp <- readRDS("tests/data/data_corpus_sputnik.RDS")
toks <- tokens(corp) %>%
        tokens_select("^[A-Za-z]+$", valuetype = "regex", min_nchar = 2) %>% 
        tokens_compound(dict) # multi-word expressions
dfmt <- dfm(toks) %>% 
    dfm_remove(stopwords('en')) %>% 
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile", 
             max_docfreq = 0.2, docfreq_type = "prop")
```

Many of the top terms of the seeded-LDA are seed words but other topic words are also identified.

```{r}
set.seed(1234)
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
print(terms(slda, 20))
```

```{r}
topic <- table(topics(slda))
print(topic)
```