There are several available parsing models for BLLIP Parser. This
document is designed to help you determine which one will perform best
for your task. Each one of the parsing models discussed includes a pair
of Charniak parser and Johnson reranker models designed to work together
(this is called a unified parsing model).
If you don't already have the Python bllipparser module, run the
following in your shell:
shell% pip install --user bllipparser
Or, if you can run sudo:
shell% sudo pip install bllipparser
Once you have bllipparser, you can use the ModelFetcher
functionality to list and download parsing models. To list parsing models,
run the following in your shell:
shell% python -mbllipparser.ModelFetcher -l
8 known unified parsing models: [uncompressed size]
GENIA+PubMed:
Self-trained model on GENIA treebank and approx. 200k sentences
from PubMed [152MB]
OntoNotes-WSJ:
WSJ portion of OntoNotes [61MB]
SANCL2012-Uniform:
Self-trained model on OntoNotes-WSJ and the Google Web Treebank
[890MB]
WSJ:
Wall Street Journal corpus from Penn Treebank, version 2
("AnyDomain" version) [52MB]
WSJ+Gigaword:
Self-trained model on PTB2-WSJ and approx. two million sentences
from Gigaword (deprecated) [473MB]
WSJ+Gigaword-v2:
Improved self-trained model on PTB WSJ and two million sentences
from Gigaword [435MB]
WSJ-PTB3:
Wall Street Journal corpus from Penn Treebank, version 3 [55MB]
WSJ-with-AUX:
Wall Street Journal corpus from Penn Treebank, version 2 (AUXified
version, deprecated) [55MB]
This list may change as new parsing models are added to the list.
To download and install WSJ+Gigaword-v2 (as an example), run the
following in your shell:
% python -mbllipparser.ModelFetcher -i WSJ+Gigaword-v2
Depending on the text that you'd like to parse, there are different optimal parsing models. Here are the current recommendations:
- News text:
WSJ+Gigaword-v2 - Web text:
SANCL2012-Uniform - Biomedical (PubMed) text:
GENIA+PubMed - WSJ section 23 evaluations to replicate papers: For purely supervised
parser or parser/reranker results, use either
WSJ-PTB3(for Penn Treebank WSJ) orOntoNotes-WSJ(for the OntoNotes version of WSJ). UseWSJ+Gigawordto replicate self-training results, thoughWSJ+Gigaword-v2performs slightly better. - Everything else: In general, it's probably best to use
SANCL2012-UniformorWSJ+Gigaword-v2depending on how well-formed your text is (SANCL2012-Uniformfor more informal web/email text).