Semisupervised LDA for theory-driven text analysis

NOTICE: This R package is renamed from quanteda.seededlda to seededlda for CRAN submission.

seededlda is an R package that implements the seeded-LDA for semisupervised topic modeling using quanteda. The seeded-LDA model was proposed by Lu et al. (2010). Until version 0.3, that packages has been a simple wrapper around the topicmodels package, but the LDA estimator is newly implemented in C++ using the GibbsLDA++ library to be submitted to CRAN in August 2020. The author believes this package implements the seeded-LDA model more closely to the original proposal.

Please see Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches for the overview of semisupervised topic classification techniques and their advantages in social science research.

keyATM is the latest addition to the semisupervised topic models. The users of seeded-LDA are also encouraged to use that package.

Install

install.packages("devtools")
devtools::install_github("koheiw/seededlda")

Example

The corpus and seed words in this example are from Conspiracist propaganda: How Russia promotes anti-establishment sentiment online?.

require(quanteda)
require(seededlda) # changed from quanteda.seededlda to seededlda

Users of seeded-LDA has to construct a small dictionary of keywords (seed words) to define the desired topics.

dict <- dictionary(file = "tests/data/topics.yml")
print(dict)
## Dictionary object with 5 key entries.
## - [economy]:
##   - market*, money, bank*, stock*, bond*, industry, company, shop*
## - [politics]:
##   - parliament*, congress*, white house, party leader*, party member*, voter*, lawmaker*, politician*
## - [society]:
##   - police, prison*, school*, hospital*
## - [diplomacy]:
##   - ambassador*, diplomat*, embassy, treaty
## - [military]:
##   - military, soldier*, terrorist*, air force, marine, navy, army

corp <- readRDS("tests/data/data_corpus_sputnik.RDS")
toks <- tokens(corp) %>%
        tokens_select("^[A-Za-z]+$", valuetype = "regex", min_nchar = 2) %>% 
        tokens_compound(dict) # multi-word expressions
dfmt <- dfm(toks) %>% 
    dfm_remove(stopwords('en')) %>% 
    dfm_trim(min_termfreq = 0.90, termfreq_type = "quantile", 
             max_docfreq = 0.2, docfreq_type = "prop")

Many of the top terms of the seeded-LDA are seed words but other topic words are also identified.

set.seed(1234)
slda <- textmodel_seededlda(dfmt, dict, residual = TRUE)
print(terms(slda, 20))
##       economy    politics        society         diplomacy    military    
##  [1,] "company"  "parliament"    "police"        "embassy"    "army"      
##  [2,] "money"    "congress"      "school"        "diplomatic" "terrorist" 
##  [3,] "market"   "white_house"   "hospital"      "ambassador" "navy"      
##  [4,] "bank"     "politicians"   "prison"        "treaty"     "terrorists"
##  [5,] "industry" "parliamentary" "media"         "diplomat"   "air_force" 
##  [6,] "banks"    "lawmakers"     "reported"      "diplomats"  "soldiers"  
##  [7,] "markets"  "voters"        "information"   "north"      "marine"    
##  [8,] "banking"  "lawmaker"      "local"         "trump"      "syria"     
##  [9,] "china"    "politician"    "video"         "nuclear"    "defense"   
## [10,] "chinese"  "uk"            "women"         "korea"      "syrian"    
## [11,] "percent"  "european"      "found"         "south"      "forces"    
## [12,] "year"     "minister"      "public"        "sanctions"  "weapons"   
## [13,] "economic" "eu"            "investigation" "iran"       "nato"      
## [14,] "trade"    "party"         "news"          "korean"     "israel"    
## [15,] "oil"      "political"     "court"         "foreign"    "daesh"     
## [16,] "project"  "prime"         "report"        "security"   "turkish"   
## [17,] "billion"  "german"        "group"         "meeting"    "turkey"    
## [18,] "india"    "germany"       "department"    "relations"  "air"       
## [19,] "million"  "british"       "children"      "donald"     "iraq"      
## [20,] "system"   "world"         "man"           "moscow"     "saudi"     
##       other   
##  [1,] "like"  
##  [2,] "now"   
##  [3,] "just"  
##  [4,] "even"  
##  [5,] "think" 
##  [6,] "trump" 
##  [7,] "way"   
##  [8,] "going" 
##  [9,] "many"  
## [10,] "years" 
## [11,] "say"   
## [12,] "want"  
## [13,] "really"
## [14,] "back"  
## [15,] "made"  
## [16,] "get"   
## [17,] "world" 
## [18,] "come"  
## [19,] "need"  
## [20,] "much"

topic <- table(topics(slda))
print(topic)
## 
## diplomacy   economy  military     other  politics   society 
##       140       137       145       164       166       248

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
R		R
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.RMD		README.RMD
README.md		README.md
cran-comments.md		cran-comments.md
seededlda.Rproj		seededlda.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semisupervised LDA for theory-driven text analysis

Install

Example

About

Releases

Packages

Languages

carlosantagiustina/seededlda

Folders and files

Latest commit

History

Repository files navigation

Semisupervised LDA for theory-driven text analysis

Install

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages