-
Notifications
You must be signed in to change notification settings - Fork 1
/
DO-LDA
45 lines (29 loc) · 2.08 KB
/
DO-LDA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
To run LDA using Mallet:
1. To generate the raw input for Mallet:
For the full LDA over all spans:
./generate-lda --input wotr-spans.jun-10-616pm --output wotr.jun-10-616pm.combined.input
For the LDA over "colored" spans:
./filter-colored-generate-lda --input wotr-spans.jun-10-616pm --output wotr.jun-10-616pm.colored.combined.input
Note that this uses the directory of predicted spans produced by the steps
in GENERATE-ARTICLE-SPANS. It's also possible to use a TextDB as input by
specifying the TextDB data file using '--input' and adding the argument
'--textdb'.
2. Then "cook" the input into a format Mallet understands:
E.g. for the full LDA over all spans:
~/devel/mallet-2.0.7/bin/mallet import-file --input wotr.jun-10-616pm.combined.input --output wotr.jun-10-616pm.mallet --keep-sequence --remove-stopwords
3. To run Mallet:
E.g. for the full LDA over all spans, 40 topics:
./run-mallet-lda --input wotr.jun-10-616pm.mallet --num-topics 40
This outputs various files named, in this case, 'wotr.jun-10-616pm.mallet.40.*',
e.g. 'wotr.jun-10-616pm.mallet.40.topic-keys' listing the top words in each
topic, and 'wotr.jun-10-616pm.mallet.40.doc-topics' listing the topic
proportions for each document.
4. To generate region-specific topics for predicted locations of all spans:
./split-spans-geographically --spans ~cowr/volspans-predicted-wotr-jun-10-616pm-predicted-deg1/volspans-predicted-training.data.txt.bz2 --geojson CivilWar_AnalysisRegions.geojson --output topic-props.volspans-predicted-wotr-jun-10-616pm-predicted-deg1.tex --doc-topics wotr.jun-10-616pm.mallet.40.doc-topics --lda --topic-keys wotr.jun-10-616pm.mallet.40.topic-keys --latex --ntop 4 --nreg 6
This saves the resulting top 4 topics for each region, for the top 6 regions
by number of spans, into the file
'topic-props.volspans-predicted-wotr-jun-10-616pm-predicted-deg1.tex'.
This results as saved in LaTeX format; you can also get a more raw,
human-readable format by omitting '--latex'. This assumes on input a TextDB
in which the spans have predicted locations, as generated by the instructions
in GENERATE-SPAN-PREDICTIONS.