-
Notifications
You must be signed in to change notification settings - Fork 0
TextPreprocessingPipeline
Here we want to write about everything that's involved in going from raw text to stuff we can directly work with in Chipa.
This will end up as a significant part of the tasks/evaluation chapter.
We have two sources of Bibles.
Some of our Bibles were provided by Chris Loza and Rada Mihalcea, and I'd like to thank them for sharing their data. Chris describes the preprocessing work he did in his masters thesis: "CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES" (Uni. North Texas, 2009)
Some of our Bibles, we downloaded from the web. In both cases, we change them into a single standard format, which is used by scripts further down the pipeline.
We end up producing four Bibles suitable for experiments: English, Spanish, Quechua, and Guarani.
Version in USFM format is intentionally public domain, which is rad. Wrote a short script to handle USFM format.
Lemmatized with NLTK's wordnet lemmatizer.
Our Guarani text is spidered from the web. We use beautifulsoup to pull out the flat text from the HTML.
Guarani text requires morphological analysis: we want to pick lemmas rather than fully inflected forms.
For Spanish text, we can try several different Bible translations, all of which are available electronically.
We have the Traducción en Lenguaje Actual (TLA), and the Reina Valeria (1995 edition), which were both found on the web.
In either case, we do lowercasing, tokenization and lemmatization on the input text. For lemmatization, we use Mike's morphological analyzer from Paramorfo.
- We have a Quechua translation of the Catholic Bible, provided by Rada's group.
- We also have a pre-preprocessed version from SQUOIA group.
Using MA from Mike's AntiMorfo.
We just take all the verse numbers that line up and call those matching sentences. Should we do some heuristic to check to see that they're about the same length? Probably. Moses and cdec have something like that, yeah?
Note that when working with Bible text, we use verses rather than sentences as the basic unit of text.
Alignment is done with cdec in the default way. From this, we get one-to-many alignments, where each source word is aligned to 0 or more target words.
From http://www.cdec-decoder.org/guide/fast_align.html
$ ~/cdec/word-aligner/fast_align -i corpus.de-en -d -v -o > corpus.de-en.fwd_align