In this competition, we need to detect toxic comments, we use NLP technique and deep learning to build classification model
pip install -r requirement
We try to convert all sentecne to lower case in LSTM model, but normal case gets highly score in LB than lower case.
Preprocessing as follows:
- all http(url) were substituted to url
- all emoji substitute to ' '
- using flashtext to find mispell words and replace to true words
- all emoji were substituted by ' '
- \n\t were substituted by ' '
- \s{2,} were substituted by ' '
We try to get some statistics feature and put in LSTM model training
We get statistics feature as follows:
- swear word
- upper word
- uniqen word
- emoji
- characters
We used pretrainned word embedding as follows:
- Fasttext
- Glove
In our works, fasttext is a little bit better than Glove. We didn't concatenate fasttext and Glove due to time consuming. (However, in the nearly end of the competition, everyone used BERT model haha.)
Our lstm model are different with public version, it comsisted of lstm cells without gru cells.
- Attention didn't improve LB significantly.
- Spatial Dropout had improvement in LB.
- blending of three models, each lstm model got the LB 0.935x~0.938x. After blended three models, we got the LB 0.93963.
We used pretrained bert model from: pytorch-pretrained-BERT and BertForSequenceClassification
for sequence classification.
The result with text preprocessing or without text preprocessing are similar.
The batch size from 16 to 32 improve the LB. I think the batch size significantly influnces acurracy
The learning rate we set is
We got the single model with LB 0.9415x~0.94220
We ensemble five single BERT model and got LB 0.94294
- We only got LB around 0.938 in single GPT2. Therefore we focused on training BERT models.
- We used ensemble of 3 LSTM models and ensemble of 5 BERT models and blended them with weights 0.3 and 0.7 respectively.
- In the end, we got LB 0.9443