-
Notifications
You must be signed in to change notification settings - Fork 1
/
PREPROCESSING
76 lines (47 loc) · 2.51 KB
/
PREPROCESSING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Process to generate War of the Rebellion files:
1. Text comes in the form of separate HTML pages, e.g. 001/0100 - 001/0199,
in the following directory:
~/devel/war-of-the-rebellion/ehistory.osu.edu/books/official-records
2. Run process-wotr-html-downloaded on each directory:
for x in 0*; do if [ -d "$x" ]; then ~/devel/GeoAnnotate/process-wotr-html-downloaded $x; fi done
This generates e.g. 001.html and 001.txt.
(Alternatively, use process-wotr-html-combined on the HTML files.)
3. Copy each *.txt file to *.txt.orig:
for x in *.txt; do cp $x $x.orig; done
4. Go through each *.txt file, removing trailing spaces:
perl -pi -e 's/ +$//' *.txt
5. Go through each *.txt file by hand, looking for occurrences of MAP and
removing them, fixing up the paragraphs as necessary. (FIXME, this should
probably be done automatically.)
6. Also in each *.txt file, delete cases where you have "Page ###" at a
page break; mostly happens in volume 004.
7. Also check for and fix up extraneous text breaks, e.g. in volume 019, 021;
look for "CHAP.". Text of this sort at the beginning of a page break will
automatically be handled.
8. Other fixups:
gr 'PAGEs ' 'pages ' *.txt
gr 'PAGE([a-z])' 'page $1' *.txt
9. Run get-description on each HTML file:
~/devel/GeoAnnotate/get-description *.html
This generates *.desc files.
(At this point, I removed newlines from the end of each *.desc file using
perl, and edited some of the descriptions by hand, but this entire
step doesn't need to be repeated.)
10. Run fix-page-breaks on each *.txt file:
~/devel/GeoAnnotate/fix-page-breaks *.txt
This generates *.txt.joined-pagebreak files.
11. Copy each *.txt.joined-pagebreak file to *.txt.joined-pagebreak.orig.
12. Go through and delete footnotes and extraneous R- lines in each
*.txt.joined-pagebreak file, cleaning up the surrounding text, grepping
as follows:
grep 'R *-' *.joined-pagebreak |m
grep -C 3 '[-]-----------' *.txt.joined-pagebreak |m
13. Copy each *.txt.joined-pagebreak file to
*.txt.joined-pagebreak.remove-preface. Edit the latter files and remove
the preface from each one, generally up to where it says "CORRESPONDENCE"
or "REPORTS". Beware of the fact that sometimes there is a letter or a
few letters (sometimes partial) stuck at the beginning before the preface.
14. Copy each *.txt.joined-pagebreak.remove-preface file to *.txt.joined.
15. Remove the pagebreaks as follows:
gr '\nPAGEBREAK.*\n' '' *.txt.joined
gr '\Z' '\n' *.txt.joined