-
Notifications
You must be signed in to change notification settings - Fork 1
/
GENERATE-SPAN-PREDICTIONS
112 lines (88 loc) · 6.07 KB
/
GENERATE-SPAN-PREDICTIONS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
To generate span predictions:
First, you need the following:
1. Text of pages 100-200 of all volumes of War of the Rebellion (WOTR), e.g. in
'../wotr-text-100'. This should contain only the '*.txt.joined' files
(use symlinks if necessary).
2. Files containing manually annotated docgeo spans, e.g. in
'docgeo-spans-jun-10-616pm'.
3. Files containing predicted spans, including the text, e.g. in
'wotr-spans.jun-10-616pm' (see GENERATE-ARTICLE-SPANS).
The steps below produce the following output files/directories:
1. 'wotr-jun-10-616pm-100-0-0' or similar: Directory containing the manually
annotated docgeo spans in TextDB format, split with 100% in the training
set.
2. 'wotr-jun-10-616pm-60-20-20' or similar: Directory containing the manually
annotated docgeo spans in TextDB format, split 60/20/20 in
training/dev/test.
3. 'volspans-predicted-wotr-jun-10-616pm' or similar: Directory containing
the predicted WOTR spans in TextDB format.
4. 'volspans-predicted-wotr-jun-10-616pm-predicted-deg1' or similar (and
likewise for '...-deg1.5', '...-deg2', '...-deg2.5', '...-deg3.5',
'...-kd5', '...-kd10', '...-kd15', '...-kd20', '...-kd30', '...-kd40'):
Directory containing the predicted WOTR spans in TextDB format, with
predicted coordinates at various grid sizes using Naive Bayes.
Then:
1. Generate the manually annotated docgeo spans in TextDB format, both
60/20/20 split and 100/0/0 split.
~tgpy/wotr_to_corpus.py --spans docgeo-spans-jun-10-616pm --text ../wotr-text-100 -o ~co/wotr/wotr-jun-10-616pm-60-20-20/wotr-jun-10-616pm-60-20-20 --fractions 60:20:20
~tgpy/wotr_to_corpus.py --spans docgeo-spans-jun-10-616pm --text ../wotr-text-100 -o ~co/wotr/wotr-jun-10-616pm-100-0-0/wotr-jun-10-616pm-100-0-0 --fractions 100:0:0
2. Generate the predicted WOTR spans in TextDB format, 0/100/0 split (this
puts everything in the dev set so that we can geolocate it), with the
manually annotated docgeo spans as training data.
~tgpy/wotr_to_corpus.py --predicted-spans wotr-spans.jun-10-616pm -o ~co/wotr/volspans-predicted-wotr-jun-10-616pm/volspans-predicted --include-non-coord-paras --fractions 0:100:0
cp ~co/wotr/wotr-jun-10-616pm-100-0-0/*-training.* ~co/wotr/volspans-predicted-wotr-jun-10-616pm
rm ~co/wotr/volspans-predicted-wotr-jun-10-616pm/volspans-predicted-{training,test}*
bzip2 ~co/wotr/volspans-predicted-wotr-jun-10-616pm/volspans-predicted-dev.data.txt
3. Geolocate the predicted WOTR spans.
To do this locally:
mkdir ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
cd ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
for x in 1 1.5 2 2.5 3.5; do nohup-tg-geolocate wotr/volspans-predicted-wotr-jun-10-616pm --ranker nb --dpc $x; done
for x in 5 10 15 20 25 30 40; do nohup-tg-geolocate wotr/volspans-predicted-wotr-jun-10-616pm --ranker nb --kd-tree --kd-bucket-size $x; done
To do this on Maverick:
rsync -avz ~co/wotr/volspans-predicted-wotr-jun-10-616pm maverick:/work/01683/benwing/corpora/wotr
ssh maverick
mkdir ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
cd ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
Then create a file job.gpu.24hr.volspans-predicted-wotr-may-18-15pm.nb as follows:
---------------------------- cut -------------------------------
#!/bin/bash
#
#SBATCH -J benwing1 # Job name
#SBATCH -o benwing-%j.out # Name of stdout output file (%j expands to jobId)
#SBATCH -p gpu # Queue name
#SBATCH -N 1 # Total number of nodes requested (20 cores/node)
#SBATCH -n 20 # Total number of mpi tasks requested
#SBATCH -t 24:00:00 # Run time (hh:mm:ss) - 24 hours
# Run Naive Bayes or other simple ranking method (including possible
# IGR filtering) on combined WOTR May 18 715pm in-domain as training data,
# full WOTR as dev data. See the file '$WORK/exps/jobscripts/job.corpus.nb'
# for more documentation.
corpusdir=wotr
corpus=volspans-predicted-wotr-jun-10-616pm
igrdir=$WORK/exps/$corpusdir/$corpus/igr
jobscripts=$WORK/exps/jobscripts
. $jobscripts/job.corpus.nb
# vi:filetype=sh
---------------------------- cut -------------------------------
sbatch job.gpu.24hr.volspans-predicted-wotr-jun-10-616pm.nb uniform "1 1.5 2 2.5 3.5"
sbatch job.gpu.24hr.volspans-predicted-wotr-jun-10-616pm.nb kd "5 10 15 20 25 30 40"
Then wait for them to finish (in 20 minutes or so).
exit
mkdir ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
cd ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
rsync -avz maverick:/work/01683/benwing/exps/wotr/volspans-predicted-wotr-jun-10-616pm/\* .
4. Then combine the geolocated results back into the predicted spans.
cd ~tge/wotr/volspans-predicted-wotr-jun-10-616pm
for x in 1 1.50 2 2.50 3.50; do textgrounder run opennlp.textgrounder.preprocess.FrobTextDB --coords-from-results results.volspans-predicted-wotr-jun-10-616pm.nbayes.deg$x.2015*.data.txt -i ~cowr/volspans-predicted-wotr-jun-10-616pm -s dev --output-dir ~cowr/volspans-predicted-wotr-jun-10-616pm-predicted-deg$x &; done
for x in 1 1.50 2 2.50 3.50; do bzip2 ~cowr/volspans-predicted-wotr-jun-10-616pm-predicted-deg$x/*.data.txt &; done
for x in 5 10 15 20 25 30 40; do textgrounder run opennlp.textgrounder.preprocess.FrobTextDB --coords-from-results results.volspans-predicted-wotr-jun-10-616pm.nbayes.Kd.bucketsz$x.2015*.data.txt -i ~cowr/volspans-predicted-wotr-jun-10-616pm -s dev --output-dir ~cowr/volspans-predicted-wotr-jun-10-616pm-predicted-kd$x &; done
for x in 5 10 15 20 25 30 40; do bzip2 ~cowr/volspans-predicted-wotr-jun-10-616pm-predicted-kd$x/*.data.txt &; done
5. Then, move the dev data to become training data so that further work
can be done on it (e.g. creating KML graphs or heatmaps).
cd ~co/wotr
for x in 1 1.50 2 2.50 3.50; do cd volspans-predicted-wotr-jun-10-616pm-predicted-deg$x; mmv *-dev.* *-training.*; gr dev training *.schema.txt; cd ..; done
for x in 5 10 15 20 25 30 40; do cd volspans-predicted-wotr-jun-10-616pm-predicted-kd$x; mmv *-dev.* *-training.*; gr dev training *.schema.txt; cd ..; done
6. Finally, copy it back to local.
cd ~co/wotr
rsync -avz maverick:/work/01683/benwing/corpora/wotr/volspans-predicted-wotr-jun-10-616pm-predicted-\* .