-
Notifications
You must be signed in to change notification settings - Fork 16
TrainExtractor Tutorial
Extraction means extracting types within documents (such as names or places). TrainExtractor
tasks take text data as input. For this example we will use sample1.train
as the training data. This sample is built into the code, so it requires no additional setup. To see how to label and load your own data for this task, look at the Labeling and Loading Data Tutorial.
This experiment will train on a training set and output an extractor, which can be used to test another dataset or applied to an unlabeled dataset to add labels (such as extracted_name
).
- To run this type of task using the GUI type:
$ java –Xmx500M edu.cmu.minorthird.ui.TrainExtractor –gui
-
A window will appear. To view and change the parameters of the experiment press the
Edit
button located next toTrainExtractor
. AProperty Editor
dialog box will appear: -
To view what each parameter does and/or how to set it, click the
?
button next to each field. The parameters that must be entered for the experiment to run arebaseParameters
(e.g.,-labels
) andsignalParameters
(e.g.,-spanType
or–spanProp
). All other parameters have defaults or are not required. There are 4 bunches of parameters that can be modified when running a TrainExtractor experiment:
-
Options for how MinorThird learns from the training data are in
additionalParameters
. These options all have defaults, so do not need to be explicitly stated. Most importantly the learner can be changed by selected a learner from the pull down menu and edited by pressing theEdit
button next tolearner
. To view the Javadoc documentation for the currently selected learner, press the '?' button for a link to the Javadoc. The output parameter specifies how MinorThird labels extracted types. By default it is set to prediction, but it is useful to change this to something more informative such aspredicted_trueName
. -
First training data for the experiment must be entered by specifying a
labelsFilename
. Since the samples are built into the code,sample1.train
can simply be typed into the text field underlabelsFilename
to load the data. Note: data from a directory can be loaded by using theBrowse
button. -
To save the results from the experiment, enter a file to which to write the results in the
saveAs
text field. Note: this is optional, yet necessary in order to use the saved extractor in the future. -
Once
labelsFilename
is specified, click theEdit
button next tosignalParamters
. Important:labelsFilename
must be specified BEFORE clickingEdit
. AnotherProperty Editor
will appear. SelecttrueName
from the pull down menu. Then press theOK
button to close theProperty Editor for signalParameters
.
- Feel free to try changing any of the other parameters including the ones in advanced options. Click the help buttons to get a feeling for what each parameter does and how changing it may affect your results. Once all the parameters are set, click the
OK
button on theProperty Editor
window. - Press the
Show Labels
button if you would like to view the input data for the extraction task. This will pop up the sameTextBaseViewer
that you would see if you ranViewLabels
on the train data. - Now press
Start Task
under execution controls. The task will vary in the amount of time it takes depending on the size of the data set and what learner was chosen, but extraction tasks usually take a minute or two. When the task is finished, the error rates will appear in the output text area along with the total time it took to run the experiment. - When the experiment has finished running, click the
View Results
button to view extractor results. The features in the extractor may be sorted by name, weight (seen on the left), or absolute weight or be viewed in a tree (as seen on the right) where the root contains the highest value of the leaves below. Features with the largest weights are most highly correlated with have the specifiedSpanType
. In this case tokens withcharTypePattern
capital letter followed by lower case letter is most highly correlated withtrueName
since it is the feature with the largest weight in the extractor. - Press the
Clear Window
button to clear all output from the output and error messages window. This is useful if you would like to run another experiment.
- To get started using the command line for an extractor experiment type:
$ java –Xmx500M edu.cmu.minorthird.ui.TrainExtractor –help
Note: You can enter as many command line arguments as you like along with the –gui
argument. This way you can use the command line to specify the parameters that you would like and use and use the GUI to set any additional parameters or view the results.
2. Show options - specifying these options allow one to pop up informative windows from the command line:
-
-showData
– interactively show the dataset in a new window -
-showLabels
– view the training data and its labels -
-showResult
– displays the experiment result in a new window
- The first thing you probably want to enter on the command line is the data you would like to train or train/test on. To do this type
–labels
and the repository key of the dataset you would like to use. For this experiment you should use the following option:–labels sample1.train
. - The next required parameter to specify is either
spanProp
orspanType
. To specify this parameter, type–spanType TYPE
. For this datasetTYPE
can either be real or spam, so use the following option:-spanType trueName
. - Other parameters you may want to specify are:
-
-learner
for specifying the learning algorithm -
-saveAs
if you want to save the trained results -
–help
for descriptions and examples of options and parameters. If you are unsure of what learners to use, use the–gui
command so that you can see the list of learners and feature extractor available (undertrainingParameters
). For this tutorial, use:
-learner "new VPHMMLearner(new CollinsPerceptronLearner(1,5), new Recommended.TokenFE(), new InsideOutsideRedution())"
- As you can see from this example, the
sequenceClassifierLearner
,spanFeatureExtractor
, andtaggingReduction
are defined with the learner. If you would like to see the options for these variables, use the–gui
command. Once the parameter modification window pops up, clickEdit
underParameter Modification
and clickEdit
next totrainingParameters
. To see what learners are available, scroll through the pull down list next to learner. Once you have chosen a learner, click theEdit
button next to learner to choose yoursequenceClassifierLearner
,spanFeatureExtractor
, andtaggingReduction
. To edit any of these training parameters, press theEdit
button next to them. - Optional parameters to define include
–mixup
,-embed
, and–output
. Use the–help
command to learn more about these parameters.–output
is set to the default_prediction
, so you only need to set this parameter if you would like to name the property learned. - Specify other complex parameters on the command line using the
–other
option. See the Command Line Other Option Tutorial for details.