File spec
・Balancing data.ipynb: to balance the data in terms of class: [00:49, 2/18/2018] This is code to generate 5 training files. Each file has the same percentage of toxic comments and non toxic comment. train1, train 2, train 3,train 4 and train 5 have different the same toxic comments but different nn-toxic comments.
・FeatureSelectionbyInformationGain.ipynb: code to calculate the frequency of words and information gain
・TextPreprocessing.ipynb: code to pre-process data
・RandomForest_validation_all(biased).ipynb: code to do primitive random forest using bag of words and confirm the confusion matrix and ROC curve.
Run these files as this order:
- Data Balancing
- Data Cleaning
- Models
- Submit to Kaggel