Skip to content

TrainTestClassifier Tutorial

linfrank edited this page Aug 17, 2012 · 3 revisions

TrainTestClassifier Tutorial

Classification means learning labels for entire documents. TrainTestClassifier tasks take text data as input. For this example we will use sample3.train as the training data and sample3.test as the testing data. These samples are built into the code, so they require no additional setup. To see how to label and load your own data for this task, look at the Labeling and Loading Data Tutorial.

This experiment will train on one set of data and test on another set. The test set is determined either by specifying test data or by splitting the data. The experiment outputs statistic such as error rate, standard deviation, and kappa.

To run this type of task start with:

$ java –Xmx500M edu.cmu.minorthird.ui.TrainTestClassifier

Editing Parameters

Like all UI tasks, all the parameters for TrainTestClassifier may be specified either using the GUI or the command line. To use the GUI, simply type –gui on the command line. It is also possible to mix and match where the parameters are specified. For example, one can specify two parameters on the command line and use the GUI to select the rest. For this reason, the step-by-step process for this experiment will first explain how to select a parameter value in the GUI and then how to set the same parameter on the command line.

To view a list of parameters and their functions run:

$ java –Xmx500M edu.cmu.minorthird.ui.TrainTestClassifier –help

or

$ java –Xmx500M edu.cmu.minorthird.ui.TrainTestClassifier –gui

Click on the Parameters button next to Help or and click on the ? button next to each field in the Property Editor to see what it is used for. If you are using the GUI, click the Edit button next to TrainTestClassifier. A Property Editor window will appear:

There are five bunches of parameters to specify for this experiment. The only required parameters are labelsFilename (-labels) and spanType or spanProp.

  • baseParameters contains the options for loading the collection of documents.
  • GUI: enter sample3.train in the labelsFilename text field.
  • Command Line: use the –labels option followed by the repository key or the directory of files to load. For this tutorial specify –labels sample3.train.
  • saveParameters contains one parameter for specifying a file to save the result to. Saving is optional, but useful for using resulting classifier for TestClassifier and ApplyAnnotator experiments.
  • GUI: enter sample3.ann in the saveAs text field.
  • Command Line: -saveAs sample3.ann
  • signalParameters: either spanType or spanProp must be specified as the type to learn. For this experiment we will use span stype fun.
  • GUI: click the Edit button next to signalParameters. Select fun from the pull down menu next to spanType.
  • Command Line: –spanType fun
  • splitterParameters: either a splitter or a test file name may be specified. In this experiment, set the testFilename to sample3.test. Entering a test file name will tell MinorThird to ignore the splitter and use the test file. To use a splitter, simply do not specify a test file name and select the appropriate splitter from the pull down menu. The splitter is set to RandomSplitter by default and will run with that if no other splitter is selected.
  • GUI: enter sample3.test next to testFilename
  • Command Line: -test sample3.test
  • trainingParameters contains parameters for specifying learning options, most importantly the learner used. We will use the default learner, NaiveBayes, for this experiment, but feel free to change the learner for future experiments.
  • GUI: change the learner by selecting a new learner from the pull down menu
  • Command Line: selecting a different learner (or any other class) on the command line can be tricky. The full class must be specified. See the API Javadoc for learner classes Most learner may be specified on the command line like this: -learner "new Recommended.LEARNER_NAME()". See the Javadoc for possible initialization parameters.
  • Feel free to try changing any of the other parameters including the ones in advanced options.
  • GUI: click on the help buttons to get a feeling for what each parameter does and how changing it may affect your results. Once all the parameters are set, click the OK button on Property Editor.
  • Command Line: add other parameters to the command line (use –help option to see other parameter options). If there is an option that can be set in the GUI, but there is no specific parameter for setting it in the help parameter definition, the –other option may be used. To see how to use this option, look at the Command Line Other Option Tutorial.
  • If you are using the GUI, once finished editing parameters, save parameter modification by clicking the OK button on Property Editor.

Show Labeled Data

  • GUI: press the Show Labels button if you would like to view the input data for the classification task.
  • Command Line: add –showLabels to the command line.

Getting and Interpreting Results

  1. Opening the result window:
  • GUI: press Start Task under Execution Controls to run the experiment. The task will vary in the amount of time it takes depending on the size of the data set and what learner and splitter you choose. When the task is finished, the error rates will appear in the output text area along with the total time it took to run the experiment.
  • Command Line: specify –showResult (this is for seeing the graphical result, if this option is not set, only the basic statistics of the task will be seen).
  1. Once the experiment is completed, click the View Results button in the Execution Controls section to see detailed results in the GUI. The window will automatically appear if the –showResult option was specified on the command line. The Test Partition tab shows the testing examples in the top left, the classifier in the top right, the selected test example's features, source, and subpopulation in the bottom left, and the explanation for the classification of the selected test example in the bottom right (expand the tree to see the details of the explanation).

  2. Click on the Overall Evaluation tab at the top and the Summary tab below that to view your results. The summary tab shows you the results that were printed in the output window when you ran the experiment (it shows you the numbers like error rate and F1). The Precision/Recall tab shows you the graph of recall vs. precision for this experiment. The Confusion Matrix tab shows you how many things the classifier predicted as positive that are positive and how many that it predicted as positive that are negative and vice versa.

  3. Press the Clear Window button to clear all output from the output and error messages window.

Clone this wiki locally