-
Notifications
You must be signed in to change notification settings - Fork 103
Examlpe on how to use SDK for creating classifiers metrics #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
/* This is an example how to add additional evaluation metrics to GraphLab Create using the SDK. | ||
In this example, we developed seven evaluation metrics function for binary classifier. | ||
These functions get as input two SArrays with actual and predicted classifier values, and the label of the positive class, | ||
and are able to calculate the classifier's: true-positive (TP), False-Positive (FP),True-Negative (TN),false-negative (FN), precision score, recall score, and f1 score. | ||
|
||
The file can be complied by using the following command: | ||
g++ -std=c++11 metrics.cpp -I <graphlab-sdk directory> -shared -fPIC -o metrics.so | ||
|
||
After compilation the file can be loaded to Python using the following line: | ||
>>> import metrics | ||
|
||
Python Code Example: | ||
>>> import graphlab | ||
>>> import metrics | ||
|
||
|
||
>>> actual = graphlab.SArray([1,1,0,0,0]) | ||
>>> predicted = graphlab.SArray([0,1,0,0,1]) | ||
>>> metrics.true_positive(actual,predicted, 1) #1 | ||
>>> metrics.false_positive(actual,predicted, 1) # 1 | ||
>>> metrics.true_negative(actual,predicted, 1) # 2 | ||
>>> metrics.false_negative(actual,predicted, 1) # 2 | ||
|
||
>>> actual = graphlab.SArray(['a','b','a','a','a','b']) | ||
>>> predicted = graphlab.SArray(['a','a','b','b','b','a']) | ||
>>> metrics.true_positive(actual,predicted, 'a') # 1 | ||
>>> metrics.false_positive(actual,predicted, 'a') # 2 | ||
>>> metrics.true_negative(actual,predicted, 'a') # 0 | ||
>>> metrics.false_negative(actual,predicted, 'a') # 3 | ||
|
||
|
||
For more details please read: | ||
https://dato.com/products/create/sdk/docs/index.html | ||
https://github.com/dato-code/GraphLab-Create-SDK | ||
|
||
|
||
Written by: Michael Fire | ||
*/ | ||
|
||
#include <graphlab/sdk/toolkit_function_macros.hpp> | ||
#include <graphlab/sdk/gl_sarray.hpp> | ||
using namespace graphlab; | ||
int true_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
if(actual.size() != predicted.size()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be done in 1 line. (actual == pos_label) * (predicted == pos_label) |
||
{ | ||
throw "Error: Cannot calculate True-Positive. Input arrays of different size"; | ||
} | ||
int tp = 0; | ||
for(int i=0;i<actual.size(); i++) | ||
{ | ||
if(actual[i] == pos_label && actual[i] == predicted[i]) | ||
{ | ||
tp += 1; | ||
} | ||
|
||
} | ||
return tp; | ||
} | ||
|
||
int false_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as above. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally think that the code need to be kept "simple and stupid". On Thu, Mar 5, 2015 at 8:12 PM, Srikrishna Sridhar <[email protected]
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I generally agree but not in this case for 2 main reasons: (a) Using the array access (i.e actual[i]) is very slow and that defeats the purpose of using the C++ SDK. One only wants to write code in C++ for performance. (b) Making the code look like a lot of lines of code can possibly intimidate people by thinking its too much work. (c) Regarding separate Python code, I think its generally useful to have a python only version of our evaluation metrics in a How-To at least until we have them on our own in our toolkits. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition to the above, having a pure python version is much more portable than having a C++ SDK version because I can run this code anywhere. Another option to consider is to add your metrics to the C++ SDK examples while adding a pure python evaluation in the how-to? What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am far of being an expert in C++ but why using direct access to a array On Thu, Mar 5, 2015 at 8:57 PM, Srikrishna Sridhar <[email protected]
|
||
if(actual.size() != predicted.size()) | ||
{ | ||
throw "Error: Cannot calculate False-Positive. Input arrays of different size"; | ||
} | ||
int fp = 0; | ||
for(int i=0;i<actual.size(); i++) | ||
{ | ||
if(predicted[i] == pos_label && actual[i] != pos_label) | ||
{ | ||
fp += 1; | ||
} | ||
|
||
} | ||
return fp; | ||
} | ||
|
||
int true_negative(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
if(actual.size() != predicted.size()) | ||
{ | ||
throw "Error: Cannot calculate True-Negative. Input arrays of different size"; | ||
} | ||
int tn = 0; | ||
for(int i=0;i<actual.size(); i++) | ||
{ | ||
if(predicted[i] != pos_label && actual[i] != pos_label) | ||
{ | ||
tn += 1; | ||
} | ||
|
||
} | ||
return tn; | ||
} | ||
|
||
int false_negative(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
if(actual.size() != predicted.size()) | ||
{ | ||
throw "Error: Cannot calculate False-Negative. Input arrays of different size"; | ||
} | ||
int fn = 0; | ||
for(int i=0;i<actual.size(); i++) | ||
{ | ||
if(predicted[i] != pos_label && actual[i] == pos_label) | ||
{ | ||
fn += 1; | ||
} | ||
|
||
} | ||
return fn; | ||
} | ||
|
||
float precision_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
float tp = (float) true_positive(actual,predicted,pos_label); | ||
float fp = (float) false_positive(actual,predicted,pos_label); | ||
return tp/(tp + fp); | ||
} | ||
|
||
float recall_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
float tp = (float) true_positive(actual,predicted,pos_label); | ||
float fn = (float) false_negative(actual,predicted,pos_label); | ||
return tp/(tp + fn); | ||
} | ||
|
||
float f1_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) { | ||
float recall = recall_score(actual,predicted,pos_label); | ||
float precision = precision_score(actual,predicted,pos_label); | ||
return (2 * (precision * recall)) / (precision + recall); | ||
} | ||
|
||
|
||
BEGIN_FUNCTION_REGISTRATION | ||
REGISTER_FUNCTION(true_positive, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(false_positive, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(true_negative, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(false_negative, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(precision_score, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(recall_score, "actual", "predicted", "pos_label"); // provide named parameters | ||
REGISTER_FUNCTION(f1_score, "actual", "predicted", "pos_label"); // provide named parameters | ||
END_FUNCTION_REGISTRATION | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like you need to have a python version as well so people can get started more easily. You can have 2 how to's with SDK and python (same content)