Skip to content
linfrank edited this page Aug 16, 2012 · 7 revisions

MinorThird Frequent Asked Questions

Basic Questions

  • What is MinorThird?

    MinorThird stands for "Methods for Identifying Names and Ontological Relationships in Text using Heuristics for Identifying Relationships in Data". It is a collection of Java classes for storing text, annotating text, categorizing text, and learning to extract entities from text.

  • Does it really work?

    Depends mostly on the weather in Pittsburgh.

  • Do I need a license to use Minorthird?

    Please see Licensing and Use.

  • Where can I find some documentation?

    Documentation can be found at http://github.com/TeamCohen/MinorThird/wiki/.

Download, Setup and Configuration

  • How can I download it?

    You can download it from GitHub at http://github.com/TeamCohen/MinorThird/. You will need Ant and a recent version (1.5 or higher) of JDK.

  • Why can't I make it compile?

    Most of the time its because the environment variables are not set correctly. Please check if the following variables are set in your environment before trying to compile it:

  • ANT_HOME set to the directory where you put Ant (e.g., /usr0/local/apache-ant-1.6.4)

  • JAVA_HOME set to wherever you put JDK (e.g., /usr0/local/java/jdk1.6.0)

  • MINORTHIRD set to wherever you installed MinorThird (e.g., /usr0/project/minorthird)

  • CLASSPATH set according to the directions in $MINORTHIRD/script/setup.xx file. For example, if you are using Bash shell on Linux, you should do the following:

$ export MINORTHIRD=/usr0/project/minorthird
$ export CLASSPATH=$MINORTHIRD:$MINORTHIRD/class:$MINORTHIRD/lib:$MINORTHIRD/lib/minorThirdIncludes.jar:$MINORTHIRD/lib/mixup:$MINORTHIRD/config
  • I'm frequently running out of memory. What should I do?

    Try changing the Java Virtual Machine memory settings (initial heap size, maximum heap size, etc.). Type java -X for details. For instance:

$ java -Xmx1000m edu.cmu.minorthird...

Experiments

  • If I want to run a classification experiment, in which format should I transform the data to be compatible with MinorThird? How do I perform a classification experiment if I already have a dataset with all features extracted?

    The possible dataset formats are specified in the documentation of the class DatasetLoader. One of the possible ways to run a classification experiment is, after formatting your data into MinorThird format, use NumericDemo.java to select which experiment you want. You can change this file to use different classifiers, splitters, test sets, etc.

  • How can I run a classification experiment using data in SVM-Light format?

    You can use DatasetLoader to convert a dataset from SVM-Light format (see method loadSVM). Then you have a standard Dataset and should be able to do whatever you want with it.

  • Is there a way to use non-recommended learners (like something using one-vs-all) using the minorthird.ui tools?

    Yes. If you specify the learner on the command line with -learner, you can specify any learner you like and not simply ones that pop up in the GUI. This could include a OneVsAll based learner.

  • For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need? Over the entire input or over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm?

    To classify documents into multiple classes using the minorthird.ui package tools (like TrainTestClassifier) you should define a span property, which is a mapping from spans to strings. For instance, if the span types deleteCommand, insertCommand, replaceCommand have been defined, you could use this mixup script to define the span prop whatCommand:

defSpanProp whatCommand:delete =: [@deleteCommand];
defSpanProp whatCommand:insert =: [@insertCommand];
defSpanProp whatCommand:replace =: [@replaceCommand];

After the property is defined, you can specify a OneVsAll learner with the -learner command, and specify that you want to train against the whatCommand property with the option -spanProp whatCommand, replacing -spanType deleteCommand, or whatever.

The result of training will be an Annotator that assigns some other span property (as specified by the -output option) to a document.

To summarize the training, you need to:

  • Specify the class (insert, delete, replace) for every training document, using some SpanProperty (e.g., whatCommand).
  • Tell the UI what SpanProperty you're training against (e.g., -spanProp whatCommand).
  • Tell the UI what learner to use, and make sure it's a learner that can handle non-binary classification.

After training (e.g., after using TestClassifier or the ApplyAnnotator method in minorthird.ui package) the learned annotator will add to every document its predicted class as the value for the span property specified by -output.

  • How do I communicate the k classes to OneVsAll? Does it simply take all declared span types in the training set? Or do I pass it a set of labels?

    The set of labels will be inferred from data.

  • If I want to run an extractor learner that uses a experimental NameFE class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one?

    Follow the output of java edu.cmu.minorthird.ui.TrainTestExtractor -help. If you've put NameFE on the classpath (which it's not, by default), then you can specify it

    • with the command line option -fe "new NameFE()", or
    • via the GUI, if you add the class name of the NameFE into your selectableTypes.txt file.

    There may be a bug in the UI which keeps you from inspecting the changed FE, but it does work.

    If you want to parameterize NameFE, the simplest approach is to add the parameters to the constructor, so you can specify them directly using the -fe argument. If you look at the Javadoc for util.CommandLineProcessor and have your NameFE implementation, implement the CommandLineProcessor.Configurable interface, then you can pass in additional arguments on the command line as well.

  • How can the supported token extractors be configured?

    The supported token extractors (e.g., Recommended.TokenFE()) can be configured in the GUI, or by first explicitly specifying them with -fe, and then using one of the command line options, which you can discover by using -feRecommended.TokenFE() -help or by looking up the API Javadoc.

    All of the supported FE's need a Mixup program that provides a specified type of annotation.

  • Where would I look to get a better understanding of how the spans are built?

    Try text.learn.SpanFE.

  • When I look at the features of each span (in the window I get when checking displayDataSetBeforeLearning), each span seems to have only a single token. Is that correct?

    Many of the extraction learners work by reducing extraction to tagging; i.e., labeling each word with one label, like "inside a name" vs "outside a name".

  • Is it true that there could be multiple tokens per span? Where is the code that translates the sentences from string into a set of tokens/spans?

    Text is translated from strings to tokens in text.TextBase. A Span is basically a sequence of tokens. The conversion from a Span to an Instances is inside text.learn.SpanFE.

  • While I have it working with a feature extractor that I wrote and compiled, I currently have no way to really test this at all other than look at the final error values and see that they have changed.

    We recommend:

    1. Engineering as much of the FE process as possible in Mixup, and using the ui.LabelViewer to check the results, and
    2. Using the database viewer to view the final results of extraction.
  • How do I get additional details (weights, features, etc) in a name extraction test?

    Use these options: -showResult and -showTestDetails.

  • Must the feature extractor be declared as a Serializable class when building a serialized model and passing in a hand-coded feature extractor?

    The feature extractor should indeed be Serializable. If it's not you can still perform cross-validation experiments, but an error may be thrown when you try and write it out; e.g., with a TrainExtractor.

    Serialization only saves instance-dependent (non-static) data; it doesn't save the code associated with a class, so you'll need to have the feature extraction code in your classpath when you load it back in.

    It's probably possible to hack the Mixup interpreter to serialize Mixup code and dictionaries along with an extractor, if you really want to do that (if the dictionaries are pretty big, you might not want to save multiple copies).

  • How can I run an experiment to learn multiple types?

    There are three ways to learn to extract multiple types:

    1. Define a SpanProperty and pass that into TrainTestExtractor with the -spanProp option:
$ java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp 
inCapsBecause
  1. Use the -spanProp option with a comma-separated list of non-overlapping types:
$ java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecauseStart,trueName
 Note that there are no spaces between the comma and types.
  1. You can also run ui.TrainExtractor multiple times and learn multiple extractors, which of course might extract overlapping spans.

In the first two cases what is learned is an extractor that inserts a new SpanProperty (by default named _prediction).

Other Issues

  • How do I report bugs?

    Bugs?! There are no bugs in MinorThird!!! But in case you really think you found one, please use the issues page.