-
Notifications
You must be signed in to change notification settings - Fork 7
Expand file tree
/
Copy pathtask-assignment.txt
More file actions
198 lines (162 loc) · 6.53 KB
/
task-assignment.txt
File metadata and controls
198 lines (162 loc) · 6.53 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
update 28.04.2014:
IMPORTANT: everyone is required to document theoretical aspects of their tasks
* Data collection
* Twitter
* Web archive filtering of keywords and region detection to see
if there's anything of value there
[Gabriel, Ching-Chia (Indonesia keywords)]
-- ready
* distribution of tasks to azure nodes
[Gabriel, Joseph (use/adapt for user id download)]
-- ready
* write simple script for inserting tweets into cassandra
[Gabriel]
-- ready
* setup Spark, Shark, Cassandra cluster
[Gabriel]
-- ready
* db schema for cassandra
[Gabriel, Anton, Fabian]
-- ready
* Get user ids of twitter users from India with many tweets
[Joseph, Stefan&Gabriel helps]
-- >100k users ready, still runnin
* Download tweets from previously fetched users
[Joseph]
-- > approx. 180M tweets downloaded
* storing features of filtered tweets in Cassandra
[Gabriel, Anton, Fabian]
-- ready
* storing tweets with MongoDB
[Duy, contact Joseph and Gabriel]
* Crawl daily and weekly price data and stocks; sources in data/india/sources.txt
[Anton, request help where needed]
* additional data (India+Indonesia): food production, food stocks, weather, Consumer Price Index (CPI), currency exchange rate (w.r.t $), crude oil price
[Alex, Joseph, Duy]
IMPORTANT:
- Joseph scrape Indian weather data history from TuTiempo.net, data is located in html tables (Duy helps)
- Alex: analyze food production against prices before 2011
* Twitter analysis
- Categories for Tweets
[Anton, Gabriel, Fabian]
- Normalize tweet indicators and write template query
[Gabriel]
* Association Rule Mining
- implement with the orange package
- price analysis
[Alex]
done
* Data processing
- All data sources must be tabular, csv with the columns discussed on
github
[Anton]
- Provide a data source in wide format and with categorial values
[Alex]
done
* Compute statistics and analyze the data to have more information on
which models can work
[Ching-Chia, Stefan]
done
* Data Cleaning and interpolating the missing values in each (R,C,P,SP) time-series
[Ching-Chia]
done
* Run PCA (or aggregate with averages) the (R,C,P,SP) time-series into
(R,P) time-series
discarded
[Ching-Chia, Stefan helps]
* Prediction model
* Time Series Analysis (stationarity, autocorrelations, seasonal patterns)
[Ching-Chia]
* Choosing input params (how to relate series):
Price transmission analysis, Association rule learning
[Alex, Fabian]
IMPORTANT: we need to do more work on this part
* Regression model with ARMA error shock
[Ching-Chia]
* FF neural networks
[Fabian, Stefan]
-- use pybrain to implement, train and test the model at
models/ffnn/model?.txt
-- high priority
[Fabian, Stefan]
* Echo-state neural networks
-- implemented
-- train and test the model with and without twitter indicators
-- add weather data and oil price as additional input
-- teacher force network on commodity price time series
-- high priority
[Fabian, Stefan]
* Support Vector Regression
-- train and test the model at models/svr/model1.txt
-- very low priority
* Maps, visualizations, displaying the data and results
[Duy]
-- All mapping and charts
-- Data reprocessing into consistent format from raw data in database
* Organizing price and Tweet indicator dataset (Shark output) into consistent formats for visualization
[Ching-Chia]
* Documentation:
* Data collection & processing
-- Time Series data
- Data types
[Fabian, Anton, Alex]
- Collection (short)
[Anton]
- Choice of commodities
[Ching-Chia, Fabian]
- Data Preprocessing
[Ching-Chia, Stefan]
-- Twitter
- Issue of localization - mapping tweets to user location
[Gabriel]
- Historical Tweets
Approach 1 - through API by user (extensive)
[Joseph]
Approach 2 - through webarchive
[Gabriel]
Filtering
[Anton, Gabriel]
Categorization, basic NLP (stemming, spellchecking)
[Anton]
- Infrastructure, Cassandra
[Gabriel]
* Data analysis and prediction
* Analysis
-- Time Series analysis
[Ching-Chia]
-- Association rules
[Alex]
* Prediction Models
-- Time Series Forecasting
[Ching-Chia]
-- (FeedForward NN)
[Stefan, Fabian]
-- Recurrent NN --> Echo State NN
[Fabian, Stefan]
Input
Training (Teacher forcing), Testing, Evaluation
* Visualization
[Duy]
* Observations
[Duy, Fabian]
* Presentation
[Joseph]
* Schema of Prediction Model (e.g. Echo State Network)
* Demo of Map
[Ching-Chia]
* Data-cleaning slide
[Duy]
* Video demo for visualization
Older assignments:
============================================================
assignments 25th of march --> milestone 6th april
- Stefan: Twitter Requests, Recurrent Neural Networks Models and Methods, "how to" incorporate social media indicators in model
- Fabian: same + finish setting up git repo
- Ching-Chia: Time Series Analysis Models and Methods
- Gabriel: Feature extraction from tweets and other textual content, Clustering of these features using Spark's MLlib
- Joseph: In charge of tweet aggregator (architecture), Scrapy
- Duy: storing tweets in database (NoSQL)
- Anton: @me send him pdfs, further check out sentiment packages, help Gabriel on feature extraction (on lexical level of NLP)
- Alexander: Research package for association rule learning
- Julien: Regression Models and Methods (@me send pdf), Inquire about Microsoft computing resources, set up environment in case of success
============================================================