The code implements the Label Filter algorithm for speeding up prediction in Extreme Classification problems (i.e. multi-label classification problems with an extremely large labels set). Label Filters are a computationally efficient technique for pre-selecting a small set of candidate labels for each test example before applying a more expensive multi-label classifier. For details please see the following paper:
Alexandru Niculescu-Mizil, Ehsan Abbasnejad: "Label Filters for Large Scale Multilabel Classification" - Proceedings of the 20th International Conference on AI and Statistics (AISTATS `17).
The code requires the following libraries:
- boost for Dynamic Bitset and Program Options libraries
- gperftools for tcmalloc and profiler
- Eigen (included as a submodule)
- CRoaring (included as a submodule)
First modify src/Makefile to point to the correct path for the boost libraries then execute:
cd src
make CRoaring #if using the included submodule
make
This will compile the label filter library and put the result into the bin/ directory
The compilation will produce the following files:
- libmcfilter.so -- label filter library, for dynamic linking
- libmcfilter.a -- label filter library, for static linking
- mcsolve -- example program to learn a filter
- mcpoj -- example program to apply the filter with linear classifiers.
- convert_data -- program to convert data form XML to binary dense/sparse format
- convert_linearmodel -- program to convert model files from text to binary format
cd Mediamill
gunzip train_split1_Mediamill_data.txt.gz test_split1_Mediamill_data.txt.gz
# learn the label filters
../bin/mcsolve -o mediamill_C1_0.1_C2_0.1.filter --C1=0.1 --C2=0.1 --nfilters=2 -x train_split1_Mediamill_data.txt --maxiter=1000000
# classify using an SVM model
numactl --interleave=all -- ../bin/mcproj --nProj={0,2} -f mediamill_C1_0.1_C2_0.1.filter -x test_split1_Mediamill_data.txt --modelFiles svm_C10_split1_Mediamil.svmmodel
# list all options
../bin/mcsolve --help
../bin/mcproj --help
The data file uses the XML format.
First line is an optional header line:
nExamples nFeatures nClasses
Subsequently the file should contain one data point per line, in the following sparse format:
label1,label2,label3 feature:value feature:value ... feature:value #comment
Labels are consecutive integers from 0 to nClasses-1. Features are consecutive integers from 0 to nFeatures-1.
The code also supports a binary format with both dense and sparse storage. Use convert_data to convert from text to binary format.
The label filter file is composed of three matrices:
- The filter directions as a
nFilters x nFeatures
matrix - The lower bounds as a
nFilters x nClass
matrix - The upper bounds as a
nFilters x nClass
matrix
Each matrix starts with a header:
nRows nCols
followed by nRows rows of nCols space separated floats.
The linear classifiers are stored in an SVMLight like format.
The first row in the file is an optional header:
nClasses nFeatures
Next the weights of each classifier are stored, one classifier per line, in the order of the labels (i.e. classifier corresponding to label '0' is first, then the classifier corresponding to label '1' and so on. The format is:
intercept feature:weight feature:weight ... feature:weight
Features are consecutive integers from 0 to nFeatures-1.
The intercept is optional. If not present, it is treated as 0.
The code also supports a binary format with both dense and sparse storage. Binary model files are smaller and much faster to load. Use convert_linearmodel to convert from text to binary format.
-
MClearnFilter -- main class for training the label filters from data.
-
MClinearClass -- main class for testing using a linear model and optional label filters. Handles prediction and evaluation.
-
MCxyDataArgs -- defines parameters and command line options for handling the data
-
MCsolveArgs -- defines parameters and command line options for learning the label filter
-
MCprojectorArgs -- defines parameters and command line options for applying the label filters
-
MCclassifierArgs -- defines parameters and command line options for applying the classifier
-
MCxyData -- encapsulates the data for training/testing. Handles both dense and sparse data. Manages loading and saving data in various formats.
-
MCsoln -- Encapsulates the label filter parameters. Manages loading and saving label filters in text or binary formats.
-
PredictionSet -- encapsulates the predictions of the model.
-
linearModel -- encapsulates a linear classifier. Handles loading, saving and prediction (not training)