-
Notifications
You must be signed in to change notification settings - Fork 14
/
README.poligrounder
87 lines (70 loc) · 4.46 KB
/
README.poligrounder
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
See `README.textgrounder` for an introduction to TextGrounder and to the
Poligrounder subproject. This file describes how exactly to run the
applications in Poligrounder.
=====================================
Running FindPolitical in Poligrounder
=====================================
DOCUMENT ME.
=======================================
Running PoligrounderApp in Poligrounder
=======================================
PoligrounderApp compares tweet corpora for two different time periods,
looking for differences in the distribution of individual words or n-grams.
Unlike ParseTweets and FindPolitical, this application does not currently
use Scoobi or support running under Hadoop. In practice, this has not yet
proved to be a major issue in terms of speed. However, it may require
running the application on a machine with a large amount of memory (e.g.
possibly 20 GB or more, depending on the size of the input corpus and the
length of the time periods involved, since the entire distribution of
words or n-grams in the two periods in question must currently be read
into memory).
The two time periods are specified using '--from' and '--to' (or '-f' and
'-t'). Each time period is specified in the form "TIME/OFFSET", where
TIME specifies an individual point of time and OFFSET an offset to
another point of time. (This is similar to specifying a starting time
and a length, except that the offset can be negative, which allows
specifying an interval by its ending time.)
Points of time are specified by the general format YYYYMMDDhhmmsszzz
(year - month - day - hour - minute - second - time zone), where any of the
lowercase portions can be omitted and colons can optionally be inserted for
legibility. For example, "2012061510EDT" specifies June 15, 2012, 10:00AM,
Eastern Daylight Time; equivalent times are "20120615:10:00EDT",
"2012:06:15:10:00:00:EDT", "20120615:10amEDT", etc., as well as "2012051510"
if the current time zone is Eastern Daylight.
Offsets can be given using the abbreviations "s" = second, "m" = minute,
"h" = hour, "d" = day, "w" = week, e.g. "5m2s" for "5 minutes, 2 seconds"
or "3w2d12h" for "3 weeks, 2 days, 12 hours". Negative values are
possible, e.g. "5h-2m" for "5 hours, less 2 minutes" (i.e. 4 hours,
58 minutes). Overall negative offsets, e.g. "-6h", lead to intervals
backwards in time from the starting point.
PoligrounderApp can operate in two modes:
* 2-way (or "combined") mode simply compares the two time periods directly,
looking for words whose distributions differ most significantly across the
periods, using the log-likelihood statistic. The words with the highest
log-likelihood values are output.
* 4-way (or "ideo-users") mode additionally divides the tweets into separate
categories according to the ideology of the users tweeting them --
liberal, conservative or centrist -- and throws away the centrist tweets,
leading to two axes of comparison (before vs. after and liberal vs.
conservative), for a total of four subcorpora to compare. It then does a
4-way log-likelihood test, looking for words whose distribution differs
the most from what a simple assumption of independence between the two
axes would predict: i.e. words where the difference between before and
after is most correlated with the difference between liberal and
conservative.
4-way mode is chosen automatically when a list of "ideological users"
(Twitter users with associated ideologies on a scale from 0 to 1, where 0
means liberal and 1 conservative) is given using '--ideological-user-corpus'
(or '--iuc' for short). This should be in textdb format, and is typically
the output of a run of FindPolitical.
A sample command line is as follows:
$ textgrounder run opennlp.textgrounder.poligrounder.PoligrounderApp -i parsed-all-spritzer-immigration-jun-8-to-22 --iuc copy-in/out-all-spritzer-find-political-ideo-users --from '2012061200EDT/3d' --to '2012061510EDT/1d'
This operates in 4-way mode on the tweet corpus located in
'parsed-all-spritzer-immigration-jun-8-to-22', using the list of ideological
users in 'copy-in/out-all-spritzer-find-political-ideo-users', comparing two
time periods, a before period running 3 days starting June 12, 2012 at
midnight EDT, and an after period running exactly 1 day starting at 10 AM EDT
on June 15, 2012 (right around the initial press releases of Obama's imminent
announcement of an implementation of the "Dream Act", allowing for young
illegal immigrants who have been in this country since childhood to get
reprieves from deportation).