Skip to content
linfrank edited this page Aug 16, 2012 · 5 revisions

Tutorials

MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It consists of four main packages:

  • edu.cmu.minorthird.classify contains the machine learning algorithms for extraction and classification as well as data structures for storing non-text data, classifiers, and evaluations of experiments. The classify package can stand on its own, so should not call any of the other packages.
  • edu.cmu.minorthird.text contains the classes necessary to process text data such as emails and newswire texts. The text package also contains Mixup (My Information eXtraction and Understanding Program), which is a matching language for modifying TextLabels.
  • edu.cmu.minorthird.ui provides user interfaces for running learning experiments on text data.
  • edu.cmu.minorthird.util provides utilities such as command line processors and methods for string manipulation.

To download and run MinorThird, see Getting Started.

The classify package is where the learning is performed and can be used on its own to perform experiments on non-text data, where each data instance is a list of features and comes with a classification label (e.g., "POS" or "NEG"). To learn how to use the classify package see the Classify Package Tutorial .

The ui package contains several classes for viewing, editing, and running experiments on text data. To learn how to put data in a format that MinorThird can recognize, look at the Labeling and Loading Data Tutorial.

Before getting started looking at different classes, here is some MinorThird terminology/classes that is helpful to know:

  • Document - a single example or file. For example, if a directory of emails is loaded into MinorThird, each email is a separate document.
  • TextToken - a particular substring of a particular document.
  • Span - a series of adjacent tokens from the same document.
  • SpanType - a string label that is associated with a span - binary.
  • SpanProp - a string label that is associated with a span - k-ary.
  • TextBase - a collection of documents.
  • TextLabels - assertions about types and properties of certain spans in a TextBase. In other words, the structure that stores the labels of each document in the TextBase.
  • Classifier - a structure that holds what MinorThird has learned from the training documents. This structure holds all the tokens from a TextBase and how strongly they are associated with the learned SpanType.

Here is a breakdown of the UI classes, their functionality, and their respective tutorials:

ViewLabels is a viewing tool for a collection of documents and their labels which are loaded into minorthird. See the ViewLabels Tutorial.

RunMixup and DebugMixup - Running Mixup with a Mixup program annotates a set of documents based on the rules defined in the program. RunMixup will create new labels for the documents you loaded. DebugMixup will run a Mixup program and pops up a TextBaseEditor so that Mixup results may be hand corrected. See the Mixup Tutorial

Extraction - extracts portions of a document. Examples include named entities such as person names and place names.

  • TrainExtractor tells MinorThird to learn a certain SpanType (such as name) based on a set of labeled documents you give it. This class will output a classifier which can be tested other labeled documents or used to annotate unlabeled documents. See the TrainExtractor Tutorial.
  • TestExtractor requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand-labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier, it will output the statistics of how the classifier performed. If you would like the classifier's predicted labels, run ApplyAnnotator. See the TestExtractor Tutorial.
  • TrainTestExtractor requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell MinorThird to create a classifier for a certain SpanType and test it on documents either imputed by the user or created by the splitter. See the TrainTestExtractor Tutorial.

Classification - classifies entire documents. An example would be classify emails as "real" or "spam".

  • TrainClassifier tells MinorThird to learn a SpanType (such as "spam") and create a classifier for this SpanType based on a set of labeled documents that you input. The classifier can be used to test other labeled documents or output predicted labels for unlabeled documents. See the TrainClassifier Tutorial.
  • TestClassifier requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand-labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier. If you would like the classifier's predicted labels, run ApplyAnnotator. See the TestClassifier Tutorial.
  • TrainTestClassifier requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell MinorThird to create a classifier for a certain SpanType and test it on documents either imputed by the user or created by the splitter. See the TrainTestClassifier Tutorial.

MultiClassification - like classification except it can learn, test, and annotate multiple dimensions per document. For example: it can learn and classify both color and shape. Note: this is different from learning one dimension with several options (e.g., learning whether something is square, rectangular, or circular); in that case, you are only learning one label rather than several. Tutorials are not yet available, but it work much in the same way as classification.

OnlineLearner - the OnlineTextClassifierLearner allows you to add documents to a learner by passing in a document string rather than a document span. The OnlineTextClassifierLearner also returns a TextClassifier with a call to getTextClassifier() which returns the score of a document string rather than a document span. See the OnlineLearner Tutorial.

ApplyAnnotator applies a saved classifier to a set of documents to output a set of predicted labels. You can use this to either label unlabeled data or compare the predicted labels to actual labels. See the ApplyAnnotator Tutorial.

EditLabels is tool for adding and/or removing labels from the collection of documents you loaded into MinorThird and for saving a new labels document. Useful for debugging the results of ApplyAnnotator. See the EditLabels Tutorial.

Clone this wiki locally