-
Notifications
You must be signed in to change notification settings - Fork 16
-
What is MinorThird?
MinorThird stands for "Methods for Identifying Names and Ontological Relationships in Text using Heuristics for Identifying Relationships in Data". It is a collection of Java classes for storing text, annotating text, categorizing text, and learning to extract entities from text.
-
Does it really work?
Depends mostly on the weather in Pittsburgh.
-
Do I need a license to use Minorthird?
Please see Licensing and Use.
-
Where can I find some documentation?
Documentation can be found at http://github.com/TeamCohen/MinorThird/wiki/.
-
How can I download it?
You can download it from GitHub at http://github.com/TeamCohen/MinorThird/. You will need Ant and a recent version (1.5 or higher) of JDK.
-
Why can't I make it compile?
Most of the time its because the environment variables are not set correctly. Please check if the following variables are set in your environment before trying to compile it:
-
ANT_HOME
set to the directory where you put Ant (e.g.,/usr0/local/apache-ant-1.6.4
) -
JAVA_HOME
set to wherever you put JDK (e.g.,/usr0/local/java/jdk1.6.0
) -
MINORTHIRD
set to wherever you installed MinorThird (e.g.,/usr0/project/minorthird
) -
CLASSPATH
set according to the directions in$MINORTHIRD/script/setup.xx file
. For example, if you are using Bash shell on Linux, you should do the following:
$ export MINORTHIRD=/usr0/project/minorthird
$ export CLASSPATH=$MINORTHIRD:$MINORTHIRD/class:$MINORTHIRD/lib:$MINORTHIRD/lib/minorThirdIncludes.jar:$MINORTHIRD/lib/mixup:$MINORTHIRD/config
-
I'm frequently running out of memory. What should I do?
Try changing the Java Virtual Machine memory settings (initial heap size, maximum heap size, etc.). Type
java -X
for details. For instance:
$ java -Xmx1000m edu.cmu.minorthird...
-
If I want to run a classification experiment, in which format should I transform the data to be compatible with MinorThird? How do I perform a classification experiment if I already have a dataset with all features extracted?
The possible dataset formats are specified in the documentation of the class
DatasetLoader
. One of the possible ways to run a classification experiment is, after formatting your data into MinorThird format, useNumericDemo.java
to select which experiment you want. You can change this file to use different classifiers, splitters, test sets, etc. -
How can I run a classification experiment using data in SVM-Light format?
You can use
DatasetLoader
to convert a dataset from SVM-Light format (see methodloadSVM
). Then you have a standardDataset
and should be able to do whatever you want with it. -
Is there a way to use non-recommended learners (like something using one-vs-all) using the
minorthird.ui
tools?Yes. If you specify the learner on the command line with
-learner
, you can specify any learner you like and not simply ones that pop up in the GUI. This could include aOneVsAll
based learner. -
For the one-vs-all case, what's the data format and how do we communicate the k different classes to the learning algorithm? What kind of span types do I need? Over the entire input or over a single token (or some other span of tokens)? If either are possible, what impact does it have on the algorithm?
To classify documents into multiple classes using the
minorthird.ui
package tools (likeTrainTestClassifier
) you should define a span property, which is a mapping from spans to strings. For instance, if the span typesdeleteCommand
,insertCommand
,replaceCommand
have been defined, you could use this mixup script to define the span propwhatCommand
:
defSpanProp whatCommand:delete =: [@deleteCommand];
defSpanProp whatCommand:insert =: [@insertCommand];
defSpanProp whatCommand:replace =: [@replaceCommand];
After the property is defined, you can specify a OneVsAll
learner with the -learner
command, and specify that you want to train against the whatCommand
property with the option -spanProp whatCommand
, replacing -spanType deleteCommand
, or whatever.
The result of training will be an Annotator
that assigns some other span property (as specified by the -output
option) to a document.
To summarize the training, you need to:
- Specify the class (insert, delete, replace) for every training document, using some
SpanProperty
(e.g.,whatCommand
). - Tell the UI what
SpanProperty
you're training against (e.g.,-spanProp whatCommand
). - Tell the UI what learner to use, and make sure it's a learner that can handle non-binary classification.
After training (e.g., after using TestClassifier
or the ApplyAnnotator
method in minorthird.ui
package) the learned annotator will add to every document its predicted class as the value for the span property specified by -output
.
-
How do I communicate the k classes to
OneVsAll
? Does it simply take all declared span types in the training set? Or do I pass it a set of labels?The set of labels will be inferred from data.
-
If I want to run an extractor learner that uses a experimental
NameFE
class as the feature extractor, how would I specify that on a command line? Or is it a matter of changing the source files and compiling it with the new one?Follow the output of
java edu.cmu.minorthird.ui.TrainTestExtractor -help
. If you've putNameFE
on the classpath (which it's not, by default), then you can specify it- with the command line option
-fe "new NameFE()"
, or - via the GUI, if you add the class name of the
NameFE
into yourselectableTypes.txt
file.
There may be a bug in the UI which keeps you from inspecting the changed FE, but it does work.
If you want to parameterize NameFE, the simplest approach is to add the parameters to the constructor, so you can specify them directly using the
-fe
argument. If you look at the Javadoc forutil.CommandLineProcessor
and have your NameFE implementation, implement theCommandLineProcessor.Configurable
interface, then you can pass in additional arguments on the command line as well. - with the command line option
-
How can the supported token extractors be configured?
The supported token extractors (e.g.,
Recommended.TokenFE()
) can be configured in the GUI, or by first explicitly specifying them with-fe
, and then using one of the command line options, which you can discover by using-feRecommended.TokenFE() -help
or by looking up the API Javadoc.All of the supported FE's need a Mixup program that provides a specified type of annotation.
-
Where would I look to get a better understanding of how the spans are built?
Try
text.learn.SpanFE
. -
When I look at the features of each span (in the window I get when checking
displayDataSetBeforeLearning
), each span seems to have only a single token. Is that correct?Many of the extraction learners work by reducing extraction to tagging; i.e., labeling each word with one label, like "inside a name" vs "outside a name".
-
Is it true that there could be multiple tokens per span? Where is the code that translates the sentences from string into a set of tokens/spans?
Text is translated from strings to tokens in
text.TextBase
. ASpan
is basically a sequence of tokens. The conversion from aSpan
to anInstances
is insidetext.learn.SpanFE
. -
While I have it working with a feature extractor that I wrote and compiled, I currently have no way to really test this at all other than look at the final error values and see that they have changed.
We recommend:
- Engineering as much of the FE process as possible in Mixup, and using the
ui.LabelViewer
to check the results, and - Using the database viewer to view the final results of extraction.
- Engineering as much of the FE process as possible in Mixup, and using the
-
How do I get additional details (weights, features, etc) in a name extraction test?
Use these options:
-showResult
and-showTestDetails
. -
Must the feature extractor be declared as a
Serializable
class when building a serialized model and passing in a hand-coded feature extractor?The feature extractor should indeed be
Serializable
. If it's not you can still perform cross-validation experiments, but an error may be thrown when you try and write it out; e.g., with aTrainExtractor
.Serialization only saves instance-dependent (non-static) data; it doesn't save the code associated with a class, so you'll need to have the feature extraction code in your classpath when you load it back in.
It's probably possible to hack the Mixup interpreter to serialize Mixup code and dictionaries along with an extractor, if you really want to do that (if the dictionaries are pretty big, you might not want to save multiple copies).
-
How can I run an experiment to learn multiple types?
There are three ways to learn to extract multiple types:
- Define a
SpanProperty
and pass that intoTrainTestExtractor
with the-spanProp
option:
- Define a
$ java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp
inCapsBecause
- Use the
-spanProp
option with a comma-separated list of non-overlapping types:
$ java edu.cmu.minorthird.ui.TrainTestExtractor -labels sample1.train -test sample1.test -spanProp inCapsBecauseStart,trueName
Note that there are no spaces between the comma and types.
- You can also run
ui.TrainExtractor
multiple times and learn multiple extractors, which of course might extract overlapping spans.
In the first two cases what is learned is an extractor that inserts a new SpanProperty
(by default named _prediction
).
-
How do I report bugs?
Bugs?! There are no bugs in MinorThird!!! But in case you really think you found one, please use the issues page.