Skip to content

Examlpe on how to use SDK for creating classifiers metrics #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions metrics.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
/* This is an example how to add additional evaluation metrics to GraphLab Create using the SDK.
In this example, we developed seven evaluation metrics function for binary classifier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you need to have a python version as well so people can get started more easily. You can have 2 how to's with SDK and python (same content)

These functions get as input two SArrays with actual and predicted classifier values, and the label of the positive class,
and are able to calculate the classifier's: true-positive (TP), False-Positive (FP),True-Negative (TN),false-negative (FN), precision score, recall score, and f1 score.

The file can be complied by using the following command:
g++ -std=c++11 metrics.cpp -I <graphlab-sdk directory> -shared -fPIC -o metrics.so

After compilation the file can be loaded to Python using the following line:
>>> import metrics

Python Code Example:
>>> import graphlab
>>> import metrics


>>> actual = graphlab.SArray([1,1,0,0,0])
>>> predicted = graphlab.SArray([0,1,0,0,1])
>>> metrics.true_positive(actual,predicted, 1) #1
>>> metrics.false_positive(actual,predicted, 1) # 1
>>> metrics.true_negative(actual,predicted, 1) # 2
>>> metrics.false_negative(actual,predicted, 1) # 2

>>> actual = graphlab.SArray(['a','b','a','a','a','b'])
>>> predicted = graphlab.SArray(['a','a','b','b','b','a'])
>>> metrics.true_positive(actual,predicted, 'a') # 1
>>> metrics.false_positive(actual,predicted, 'a') # 2
>>> metrics.true_negative(actual,predicted, 'a') # 0
>>> metrics.false_negative(actual,predicted, 'a') # 3


For more details please read:
https://dato.com/products/create/sdk/docs/index.html
https://github.com/dato-code/GraphLab-Create-SDK


Written by: Michael Fire
*/

#include <graphlab/sdk/toolkit_function_macros.hpp>
#include <graphlab/sdk/gl_sarray.hpp>
using namespace graphlab;
int true_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
if(actual.size() != predicted.size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done in 1 line. (actual == pos_label) * (predicted == pos_label)

{
throw "Error: Cannot calculate True-Positive. Input arrays of different size";
}
int tp = 0;
for(int i=0;i<actual.size(); i++)
{
if(actual[i] == pos_label && actual[i] == predicted[i])
{
tp += 1;
}

}
return tp;
}

int false_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that the code need to be kept "simple and stupid".
Making it one liner maybe nice, but it would take the average coder a lot
of time to figure what the actual one liner does...
Regarding seperate Python code, I think it would be lost among the other
files in the folder

On Thu, Mar 5, 2015 at 8:12 PM, Srikrishna Sridhar <[email protected]

wrote:

In metrics.cpp
#29 (comment):

  • {
  •   throw "Error: Cannot calculate True-Positive. Input arrays of different size";
    
  • }
  • int tp = 0;
  • for(int i=0;i<actual.size(); i++)
  • {
  •   if(actual[i] == pos_label && actual[i] == predicted[i])
    
  •   {
    
  •       tp += 1;
    
  •   }
    
  • }
  • return tp;
    +}

+int false_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {

Same as above.


Reply to this email directly or view it on GitHub
https://github.com/dato-code/how-to/pull/29/files#r25886660.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree but not in this case for 2 main reasons:

(a) Using the array access (i.e actual[i]) is very slow and that defeats the purpose of using the C++ SDK. One only wants to write code in C++ for performance.

(b) Making the code look like a lot of lines of code can possibly intimidate people by thinking its too much work.

(c) Regarding separate Python code, I think its generally useful to have a python only version of our evaluation metrics in a How-To at least until we have them on our own in our toolkits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the above, having a pure python version is much more portable than having a C++ SDK version because I can run this code anywhere.

Another option to consider is to add your metrics to the C++ SDK examples while adding a pure python evaluation in the how-to?

What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am far of being an expert in C++ but why using direct access to a array
element is very slow? (can you please give me an example on how to make it
faster?)
In order to make the python version work we also need to add the compiled
CPP version and this make this how-to messier...
it will take me minutes to add this, however, we need to decide if it the
advantages of this addition are bigger than its disadvantage

On Thu, Mar 5, 2015 at 8:57 PM, Srikrishna Sridhar <[email protected]

wrote:

In metrics.cpp
#29 (comment):

  • {
  •   throw "Error: Cannot calculate True-Positive. Input arrays of different size";
    
  • }
  • int tp = 0;
  • for(int i=0;i<actual.size(); i++)
  • {
  •   if(actual[i] == pos_label && actual[i] == predicted[i])
    
  •   {
    
  •       tp += 1;
    
  •   }
    
  • }
  • return tp;
    +}

+int false_positive(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {

In addition to the above, having a pure python version is much more
portable than having a C++ SDK version because I can run this code anywhere.

Another option to consider is to add your metrics to the C++ SDK examples
while adding a pure python evaluation in the how-to?

What do you think?


Reply to this email directly or view it on GitHub
https://github.com/dato-code/how-to/pull/29/files#r25890602.

if(actual.size() != predicted.size())
{
throw "Error: Cannot calculate False-Positive. Input arrays of different size";
}
int fp = 0;
for(int i=0;i<actual.size(); i++)
{
if(predicted[i] == pos_label && actual[i] != pos_label)
{
fp += 1;
}

}
return fp;
}

int true_negative(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
if(actual.size() != predicted.size())
{
throw "Error: Cannot calculate True-Negative. Input arrays of different size";
}
int tn = 0;
for(int i=0;i<actual.size(); i++)
{
if(predicted[i] != pos_label && actual[i] != pos_label)
{
tn += 1;
}

}
return tn;
}

int false_negative(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
if(actual.size() != predicted.size())
{
throw "Error: Cannot calculate False-Negative. Input arrays of different size";
}
int fn = 0;
for(int i=0;i<actual.size(); i++)
{
if(predicted[i] != pos_label && actual[i] == pos_label)
{
fn += 1;
}

}
return fn;
}

float precision_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
float tp = (float) true_positive(actual,predicted,pos_label);
float fp = (float) false_positive(actual,predicted,pos_label);
return tp/(tp + fp);
}

float recall_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
float tp = (float) true_positive(actual,predicted,pos_label);
float fn = (float) false_negative(actual,predicted,pos_label);
return tp/(tp + fn);
}

float f1_score(gl_sarray actual, gl_sarray predicted, flexible_type pos_label) {
float recall = recall_score(actual,predicted,pos_label);
float precision = precision_score(actual,predicted,pos_label);
return (2 * (precision * recall)) / (precision + recall);
}


BEGIN_FUNCTION_REGISTRATION
REGISTER_FUNCTION(true_positive, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(false_positive, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(true_negative, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(false_negative, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(precision_score, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(recall_score, "actual", "predicted", "pos_label"); // provide named parameters
REGISTER_FUNCTION(f1_score, "actual", "predicted", "pos_label"); // provide named parameters
END_FUNCTION_REGISTRATION