-
Notifications
You must be signed in to change notification settings - Fork 15
Tutorials
MinorThird is a collection of Java classes for storing text, annotating text, and learning to extract entities and categorize text. It consists of four main packages:
-
edu.cmu.minorthird.classify
contains the machine learning algorithms for extraction and classification as well as data structures for storing non-text data, classifiers, and evaluations of experiments. The classify package can stand on its own, so should not call any of the other packages. -
edu.cmu.minorthird.text
contains the classes necessary to process text data such as emails and newswire texts. The text package also contains Mixup (My Information eXtraction and Understanding Program), which is a matching language for modifyingTextLabels
. -
edu.cmu.minorthird.ui
provides user interfaces for running learning experiments on text data. -
edu.cmu.minorthird.util
provides utilities such as command line processors and methods for string manipulation.
To download and run MinorThird, see Getting Started.
The classify
package is where the learning is performed and can be used on its own to perform experiments on non-text data, where each data instance is a list of features and comes with a classification label (e.g., "POS" or "NEG"). To learn how to use the classify package see the Classify Package Tutorial .
The ui
package contains several classes for viewing, editing, and running experiments on text data. To learn how to put data in a format that MinorThird can recognize, look at the Labeling and Loading Data Tutorial.
Before getting started looking at different classes, here is some MinorThird terminology/classes that is helpful to know:
-
Document
- a single example or file. For example, if a directory of emails is loaded into MinorThird, each email is a separate document. -
TextToken
- a particular substring of a particular document. -
Span
- a series of adjacent tokens from the same document. -
SpanType
- a string label that is associated with a span - binary. -
SpanProp
- a string label that is associated with a span - k-ary. -
TextBase
- a collection of documents. -
TextLabels
- assertions about types and properties of certain spans in aTextBase
. In other words, the structure that stores the labels of each document in theTextBase
. -
Classifier
- a structure that holds what MinorThird has learned from the training documents. This structure holds all the tokens from aTextBase
and how strongly they are associated with the learnedSpanType
.
Here is a breakdown of the UI classes, their functionality, and their respective tutorials:
ViewLabels
is a viewing tool for a collection of documents and their labels which are loaded into minorthird. See the ViewLabels Tutorial.
RunMixup
and DebugMixup
- Running Mixup with a Mixup program annotates a set of documents based on the rules defined in the program. RunMixup
will create new labels for the documents you loaded. DebugMixup
will run a Mixup program and pops up a TextBaseEditor
so that Mixup results may be hand corrected. See the Mixup Tutorial
Extraction - extracts portions of a document. Examples include named entities such as person names and place names.
-
TrainExtractor
tells MinorThird to learn a certainSpanType
(such as name) based on a set of labeled documents you give it. This class will output a classifier which can be tested other labeled documents or used to annotate unlabeled documents. See the TrainExtractor Tutorial. -
TestExtractor
requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand-labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier, it will output the statistics of how the classifier performed. If you would like the classifier's predicted labels, runApplyAnnotator
. See the TestExtractor Tutorial. -
TrainTestExtractor
requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell MinorThird to create a classifier for a certainSpanType
and test it on documents either imputed by the user or created by the splitter. See the TrainTestExtractor Tutorial.
Classification - classifies entire documents. An example would be classify emails as "real" or "spam".
-
TrainClassifier
tells MinorThird to learn aSpanType
(such as "spam") and create a classifier for thisSpanType
based on a set of labeled documents that you input. The classifier can be used to test other labeled documents or output predicted labels for unlabeled documents. See the TrainClassifier Tutorial. -
TestClassifier
requires a set of labeled text documents and a classifier. Allows you to test the performance of a classifier on a set of test text documents. Reminder: the test documents MUST be hand-labeled in order for you to get the results of how accurate your classifier is. This tool will NOT output the predicted labels of the classifier. If you would like the classifier's predicted labels, runApplyAnnotator
. See the TestClassifier Tutorial. -
TrainTestClassifier
requires that you input either a set of labeled documents that will be split or a set of labeled train documents as well as a set of labeled test documents. This tool will tell MinorThird to create a classifier for a certainSpanType
and test it on documents either imputed by the user or created by the splitter. See the TrainTestClassifier Tutorial.
MultiClassification - like classification except it can learn, test, and annotate multiple dimensions per document. For example: it can learn and classify both color and shape. Note: this is different from learning one dimension with several options (e.g., learning whether something is square, rectangular, or circular); in that case, you are only learning one label rather than several. Tutorials are not yet available, but it work much in the same way as classification.
OnlineLearner
- the OnlineTextClassifierLearner
allows you to add documents to a learner by passing in a document string rather than a document span. The OnlineTextClassifierLearner
also returns a TextClassifier
with a call to getTextClassifier()
which returns the score of a document string rather than a document span. See the OnlineLearner Tutorial.
ApplyAnnotator
applies a saved classifier to a set of documents to output a set of predicted labels. You can use this to either label unlabeled data or compare the predicted labels to actual labels. See the ApplyAnnotator Tutorial.
EditLabels
is tool for adding and/or removing labels from the collection of documents you loaded into MinorThird and for saving a new labels document. Useful for debugging the results of ApplyAnnotator
. See the EditLabels Tutorial.