Open
Description
When using sklean imbalanced dataset (imbalanced ratio 99:1), our DFA job shows poor performance: overall_accuracy is 0.02
Step to reproduce:
On latest master build (Jul 19's)
-
On Data Visualizer, import the
imbalance.csv
file to index:imbalance
imbalance.csv -
During the import, change the mapping of column 30 from double to long:
"30": {
"type": "long"
}
- Create and start dfa job from dev console:
PUT _ml/data_frame/analytics/imbalance
{
"source": {
"index": [
"imbalance"
],
"query": {
"match_all": {}
}
},
"dest": {
"index": "dest-imbalance",
"results_field": "ml"
},
"analysis": {
"classification" : {
"dependent_variable" : "30",
"class_assignment_objective" : "maximize_minimum_recall",
"num_top_classes" : 2,
"prediction_field_name" : "30_prediction",
"training_percent" : 80.0,
"randomize_seed" : 4642014714383011104,
"early_stopping_enabled" : true
}
},
"model_memory_limit": "1gb",
"allow_lazy_start": false,
"max_num_threads": 1
}
POST _ml/data_frame/analytics/imbalance/_start
- Once job finishes, run the evaluation
POST _ml/data_frame/_evaluate
{
"index": "dest-imbalance",
"query": {
"term": {
"ml.is_training": {
"value": "false"
}
}
},
"evaluation": {
"classification": {
"actual_field": "30",
"predicted_field": "ml.30_prediction",
"metrics": {
"accuracy" : {}
}
}
}
}
Result:
{
"classification" : {
"accuracy" : {
"classes" : [
{
"class_name" : "0",
"value" : 0.02
},
{
"class_name" : "1",
"value" : 0.02
}
],
"overall_accuracy" : 0.02
}
}
}