-
Notifications
You must be signed in to change notification settings - Fork 0
License
victoroliv2/LibOPF
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
LibOPF is a library of functions and programs for free usage in the design of optimum-path forest classifiers. This second version contains some additional resources related to the supervised OPF classifier reported in reference [PapaIJIST09], and also contains the unsupervised version of OPF reported in reference [RochaIJIST09]. A short explanation about the method can be found in http://www.ic.unicamp.br/~afalcao/LibOPF. Please read the COPYRIGHT file before using LibOPF. For large datasets (thousands/millions of samples), it is usually desirable to keep some maximum size for the training set. However, an evaluation set can improve the training samples during pseudo tests (learning procedure). Therefore, LibOPF provides a program to randomly split the dataset into training, evaluation and test sets (opf_split). One can project an OPF classifier by using the program 'opf_train' and test it by using the program 'opf_classify'. However, for large datasets, the program 'opf_learn' substitutes the 'opf_train' by learning from classification errors in the evaluation set without increasing the training set size. Afterwards, the classifier is tested by using 'opf_classify'. In the case of time-consuming distance functions, one can generate a precomputed distance file in the format specified in section 'opf_distance' Usage. LibOPF also provides a program (opf_distance) with some options of distance functions, which generates a precomputed distance file. Table of Contents ================= - Installation - OPF file format for datasets - Programs included SUPERVISED OPF - 'opf_split' Usage - 'opf_train' Usage - 'opf_learn' Usage - 'opf_classify' Usage - 'opf_accuracy' Usage UNSUPERVISED OPF - 'opf_cluster' Usage - 'opf_knn_classify' Usage COMMON (Auxiliary functions) - 'opf_distance' Usage - Datasets - Examples of scripts - Tools - Accuracy computation - Fixed bugs - Additional Information Installation ============ On Unix systems, decompress the LibOPF.tar.bz2 file (use tar -xvjf LibOPF.tar.bz2) into a directory. The system will create a directory named 'LibOPF'. Type 'cd LibOPF', and then type `make' to build the programs 'opf_split', 'opf_accuracy', 'opf_train', 'opf_learn', 'opf_distance', and 'opf_classify' into the /bin directory. Run them without arguments to show their usage. OPF file format for datasets ============================ The original dataset and its parts training, evaluation and test sets must be in the following BINARY file format: <# of samples> <# of labels> <# of features> <0> <label> <feature 1 from element 0> <feature 2 from element 0> ... <1> <label> <feature 1 from element 1> <feature 2 from element 1> ... . . <i> <label> <feature 1 from element i> <feature 2 from element i> ... <i+1> <label> <feature 1 from element i+1> <feature 2 from element i+1> ... . . <n-1> <label> <feature 1 from element n-1> <feature 2 from element n-1> ... The first number of each line, <0>, <1>, ... <n-1>, is a sample identifier (for n samples in the dataset), which is used in the case of precomputed distances. However, the identifier must be specified anyway. For unlabeled datasets, please use label 0 for all samples (unsupervised OPF). Example: Suppose that you have a dataset with 5 samples, distributed into 3 classes, with 2 elements from label 1, 2 elements from label 2 and 1 element from label 3. Each sample is represented by a feature vector of size 2. So, the OPF file format should look like as below: 5 3 2 0 1 0.21 0.45 1 1 0.22 0.43 2 2 0.67 1.12 3 2 0.60 1.11 4 3 0.79 0.04 Comment #1: Note that, the file must be binary with no blank spaces. This ASCII representation is just for illustration. Comment #2: The first line of the file, 5 3 2, contains, respectively, the dataset size, the number of labels (classes) and the number of features in the feature vectors. The remaining lines contain the sample identifier (integer from 0 to n-1, in which n is the dataset size), its label and the feature values for each sample. Programs included ================= In this package, we have the following programs: SUPERVISED OPF ------------- opf_split: This is a program to randomly split the dataset into training, evaluation and test sets. opf_train: This is a program to execute the training phase. opf_learn: This is a program to execute the learning phase from classification errors in the evaluation set. It substitutes 'opf_train'. opf_classify: This is a program to execute the test phase by classifying the test set. opf_accuracy: This is a program to compute the accuracy over training and/or test set. UNSUPERVISED OPF ------------- opf_cluster: This is a program to compute clusters by OPF. It assigns a consecutive number starting from 1 to N for N clusters, when the training set is unlabeled. Otherwise, it propagates the true labels of the roots to the labels of the nodes in their respective trees in order to evaluate the quality of the clustering. The resulting classifier is written in classifier.opf opf_knn_classify: This program uses the classifier from opf_cluster to assign labels to a test or training set. The accuracy of the process makes sense only when these sets have true labels. This accuracy can be measured by opf_accuracy and it represents the quality of the clustering process. COMMON (Auxiliary functions) ------------- opf_distance: This is a program to compute distance functions and store them into a precomputed distance file. 'opf_split' Usage ================= Usage: opf_split <P1> <P2> <P3> <P4> <P5> P1: dataset in the OPF file format P2: percentage of the training set size [0,1] P3: percentage of the evaluation set size [0,1] (leave 0 in the case of no learning) P4: percentage of the test set size [0,1]" P5: normalize features? 1 - Yes 0 - No The sum P2+P3+P4 must be 1. The features are normalized with the following equation: N_i = (F_i - M_i)/S_i, where F_i, M_i and S_i are, respectively, the feature i, the average of F_i and the standard deviation of F_i in the dataset. The program splits the dataset into two new files, training.opf and testing.opf, when P3=0, and it splits the dataset into three files, training.opf, evaluating.opf and testing.opf, otherwise. 'opf_train' Usage ================= Usage: opf_train <P1> <P2> P1: training set in the OPF file format P2: precomputed distance file (leave it in blank if you are not using this resource) The program designs a classifier from training.opf and outputs it in a file named classifier.opf, which is used by 'opf_classify' for testing. The opf_train also outputs the following files: -> .out: it contains the predicted labels (training phase) -> .time: it contains the execution time in seconds (training phase) -> .acc: it contains the accuracy (training phase) 'opf_learn' Usage ================= Usage: opf_learn <P1> <P2> <P3> P1: training set in the OPF file format P2: evaluation set in the OPF file format P3: precomputed distance file (leave it in blank if you are not using this resource) The program substitutes 'opf_learn' when there is evaluation set. It learns from the classification errors in the evaluation set without increasing the training set size, and outputs a final classifier in a file named classifier.opf, which is used for testing by the program 'opf_classify'. The opf_learning outputs the following file: -> .time: it contains the execution time in seconds (learning phase) 'opf_classify' Usage ==================== Usage: opf_classify <P1> <P2> P1: test/training set in the OPF file format P2: precomputed distance file (leave it in blank if you are not using this resource) The opf_classify outputs the following files: -> .out: it contains the predicted labels (test phase) -> .time: it contains the execution time in seconds (test phase) 'opf_accuracy' Usage ==================== Usage: opf_accuracy <P1> P1: data set in the OPF file format The opf_accuracy will look for a classified file with the same name of the data set file in P1 and extension ".out" in order to compute the accuracy of that classification. It outputs a text file with the same name and extension ".acc". 'opf_cluster' Usage ==================== Usage: opf_cluster <P1> <P2> <P3> <P4> <P5> P1: unlabeled data set in the OPF file format P2: kmax(maximum degree for the knn graph) P3: 0 (height), 1(area) and 2(volume) P4: value of parameter P3 (integer) in (0-1) P5: precomputed distance file (leave it in blank if you are not using this resource P3: allows to remove maxima from the pdf based on height, area or volume criteria. note: the opf_cluster outputs the k value that minimized the cut in the graph as well as the number of obtained clusters and a classifier written in a file classifier.opf. The labeled samples (predicted) are also outputed in a ".out file" 'opf_knn_classify' Usage ======================== Usage: opf_knn_classify <P1> <P2> P1: test/training set in the OPF file format P2: precomputed distance file (leave it in blank if you are not using this resource) The opf_knn_classify outputs the following files: -> .out: it contains the predicted labels (test phase) -> .time: it contains the execution time in seconds (test phase) 'opf_distance' Usage ==================== One of the most important characteristic of the OPF classifier is the possibility of working with any distance function. Its default is the Euclidean metric. The user can execute the program 'opf_distance' with the following options of distance functions. Usage: opf_distance <P1> <P2> <P3> P1: Dataset in the OPF file format P2: Distance ID 1 - Euclidean 2 - Chi-Square 3 - Manhattan (L1) 4 - Canberra 5 - Squared Chord 6 - Squared Chi-Squared 7 - BrayCurtis P3: Distance normalization? 1- yes 0 - no The program computes the selected distance function between every pair of samples in the dataset and outputs a precomputed distance file (distances.dat). The sample identifier in the dataset is used here. The distance values may be or not be normalized with P3. The user can also create his/her own distance file. The file BINARY format is: <# of samples> <Distance from sample 0 to sample 0> <Distance from sample 0 to sample 1> ... <Distance from sample 1 to sample 0> <Distance from sample 1 to sample 1> ... . . <Distance from sample n-1 to sample 0> <Distance from sample n-1 to sample 1> ... Comment #1: Note that, the file is an N x N matrix of distance values. It must be binary with no blank spaces. This ASCII representation is just for illustration. Datasets ================= In directory 'LibOPF/data', the user can find several artificial and synthetic datasets, which are divided into: Synthetic: datasets obtained from [Kuncheva] -> cone-torus.dat: dataset containg 400 samples distributed into 3 classes. Each sample is represented by 2 features -> petals.dat: dataset containg 100 samples distributed into 4 classes. Each sample is represented by 2 features -> boat.dat: dataset containing 100 samples distributed into 3 classes. Each sample is represented by 2 features -> saturn.dat: dataset containing 200 samples distributed into 2 classes. Each sample is represented by 2 features Artificial: MPEG-7 shape dataset [MPEG-7]. The MPEG-7 shape dataset contains 1400 samples distributed into 70 classes -> mpeg7_FOURIER.dat: Fourier shape descritor extracted from the whole dataset (126 features) -> mpeg7_BAS.dat: BAS (Beam Angle Statistics) shape descritor extracted from the whole dataset (180 features) Details about the features extraction procedure can be found in [PapaIJIST09] Examples of scripts ================= The LibOPF package contains a directory 'LibOPF/examples', in which you can find some scripts. To run a simple example, type # This command will split the cone-torus dataset into a training set # with 40% of the samples (training.dat) and a test set (testing.dat) # with 60% of the samples > bin/opf_split data/cone-torus.dat 0.4 0 0.6 0 # This command will execute the training phase and generates an output # file named classifier.opf, which represents the OPF classifier. > bin/opf_train training.dat # This command will execute the test phase using classifier.opf > bin/opf_classify testing.dat # This command will compute the accuracy over the test set > bin/opf_accuracy testing.dat testing.dat.out You can also execute the following scripts inside the directory LibOPF/examples: example1.sh: A script to train and test the classifier without the learning procedure. example2.sh: A script to learn and test the classifier. example3.sh: A script to train and test the classifier without the learning procedure and with precomputed distances. example4.sh: A script to learn and test the classifier with precomputed distances. example5.sh: A script to run the unsupervised OPF. Tools ================= The LibOPF package contains a directory 'LibOPF/tools', in which you can find some useful tools. At this moment, we have available: -> runOPF.sh: a script that executes the supervised OPF several times (defined by the user) and outputs the mean accuracy and the mean training and test execution times (seconds), with their respectively standard deviations. 'runOPF.sh' Usage ==================== Usage: run_sh <P1> <P2> <P3> <P4> <P5> P1: dataset file name P2: percentage for the training set size [0,1] P3: percentage for the test set size [0,1] P4: normalize features? 1 - Yes 0 - No P5: running times -> runLearning.sh: a script that executes the supervised OPF with and without learning several times (defined by the user) and outputs the mean accuracy and the mean training and test execution times (seconds), with their respectively standard deviations. This script allow you to compare the OPF performance with and without learning algorithm. 'runLearning.sh' Usage ==================== Usage: run_sh <P1> <P2> <P3> <P4> <P5> P1: dataset file name P2: percentage for the training set size [0,1] P3: percentage for the evaluating set size [0,1] P4: percentage for the test set size [0,1] P5: normalize features? 1 - Yes 0 - No P6: running times -> txt2opf: a program to convert OPF files written in ASCII format to binary format. Usage of 'txt2opf' ==================== Usage: txt2opf <P1> <P2> P1: input OPF file name in the ASCII format P2: output OPF file name in the binary format Accuracy computation ==================== If you intend to compare the OPF classifier using LibOPF with others in the literature, you need to pay attention to our measure of accuracy. The accuracy of a classifier can be measured in any set: training, evaluation, and test. The accuracy in LibOPF is measured by taking into account that the classes may have different sizes. If there are two classes, for example, with very different sizes and a classifier always assigns the label of the largest class, its accuracy will fall drastically due to the high error rate on the smallest class. The details about the formula of accuracy are given in http://www.ic.unicamp.br/~afalcao/LibOPF. Fixed bugs ====================== Some fixed bugs regarding LibOPF 1.0 - division by zero in "NormalizeFeatures" function used in "opf_split" program. Thanks to Ishihara Takashi, from University of Electro-Communications, Tokyo, Japan. Additional Information ====================== If you find LibOPF helpful, please cite the references @Manual{LibOPF09, author = {J.P. Papa and A.X. Falc{\~{a}}o and C.T.N. Suzuki}, title = {{LibOPF}: {A} library for the design of optimum-path forest classifiers}, institution = {Institute of Computing, University of Campinas}, year = {2009}, note = {Software version 2.0 available at \url{http://www.ic.unicamp.br/\~{}afalcao/LibOPF}} } @article{PapaIJIST09, author = {J. P. Papa and A. X. Falc{\~{a}}o and Celso T. N. Suzuki}, title = {Supervised Pattern Classification based on Optimum-Path Forest}, journal = {International Journal of Imaging Systems and Technology}, volume = {19}, issue = {2}, pages = {120--131}, publisher = {Wiley-Interscience}, year = {2009}, } @article{RochaIJIST09, author = {L.M. Rocha and F.A.M. Cappabianco and A.X. Falc{\~{a}}o}, title = {Data clustering as an optimum-path forest problem with applications in image analysis}, journal = {International Journal of Imaging Systems and Technology}, publisher = {Wiley Periodicals}, volume = {19}, number = {2}, pages = {50--68}, year = {2009} } If you used some synthetic dataset from LibOPF, please cite: @article{Kuncheva, Author = {L. Kuncheva}, title= {Artificial Data}, Journal = {School of Informatics, University of Wales, Bangor}, url = {http://www.informatics.bangor.ac.uk/{\~{}}kuncheva}; year = {1996} } If you used MPEG-7 dataset from LibOPF, please cite: @article{MPEG-7, author = {MPEG-7}, title = {MPEG-7: The Generic Multimedia Content Description Standard, Part 1}, journal = {IEEE MultiMedia}, volume = {09}, number = {2}, issn = {1070-986X}, pages = {78-87}, doi = {http://doi.ieeecomputersociety.org/10.1109/MMUL.2002.10016}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, year = {2002} } For any questions and comments, please send your email to [email protected] or [email protected].
About
No description, website, or topics provided.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published