This is the code for predicting geolocaton of tweets trainning on token frequencies using Decision Tree and Naïve Bayes.
In util/preprocessing/merge.py,
feature_filtershows it drops single character features like[a, b, ..., n]
mergeshows it intuitively merges similar features like[aha, ahah, ..., ahahahaha]and[taco, tacos]
mergealso shows it uses Recursive Feature Elimination to rank features and select best 300 features.
In preprocess/merge.py,
result_combinationshows it uses results of Decision Tree, Random Forest, Bernoulli Naïve Bayes, Complement Naïve Bayes and Multinomial Naïve Bayes to vote out the majority predictions while the following shows the models can be slightly different every time training.
In util/train.py,
complement_nbshows it uses bagging to generate multiple training datasets.complement_nbalso shows it uses 42-Fold Cross Validation to generate multiple training datasets.
In util/train.py,
complement_nbalso shows it uses GridSearchCV to generate multiple classifiers and select the best based onaccuracy.
- See Eisenstein, Jacob, et al.
Although some file's size in datasets is greater than 50.00 MB, it was still added in
datasetsfor convenience. (See, http://git.io/iEPt8g )
- python3+
pip install -r requirements.txt
Note: The code will remove the old models and results every time running. MAKE SURE you have saved your satisfying models..
python run.py -t datasets/train-best200.csv datasets/dev-best200.csv the output would be like:
INFO:root:[*] Merging datasets/train-best200.csv
42%|████████ | 1006/2396 [00:05<00:20, 92.03 users/s]
...
...
[*] Saved models/0.8126_2019-10-02_20:02
[*] Accuracy: 0.8125955095803455
precision recall f_scoreCalifornia 0.618944 0.835128 0.710966
NewYork 0.899371 0.854647 0.876439
Georgia 0.788070 0.622080 0.695305
weighted 0.827448 0.812596 0.814974
python run.py -p models/ datasets/dev-best200.csv the output would be like:
...
INFO:root:[*] Saved results/final_results.csv
INFO:root:[*] Time costs in seconds:
PredictTime_cost 11.98s
python run.py -s results/final_results.csv datasets/dev-best200.csv the output would be like:
[*] Accuracy: 0.8224697308099213
precision recall f_scoreCalifornia 0.653035 0.852199 0.739441
NewYork 0.747993 0.647940 0.694381
Georgia 0.909456 0.858296 0.883136
weighted 0.833854 0.822470 0.824577
INFO:root:[*] Time costs in seconds:
ScoreTime_cost 1.48s
python run.py \
-t datasets/train-best200.csv datasets/dev-best200.csv \
-p models/ datasets/dev-best200.csv \
-s results/final_results.csv datasets/dev-best200.csvpython run.py -h - sklearn for easily using Complement Naive Bayes, some feature selectors and other learning tools.
- pandas, numpy for easily handling data.
- tqdm for showing the process of loop.
- joblib for dumping/loading memory to/from disk.
nltk for capturing word types on the purpose of feature filtering
See LICENSE file.