Skip to content

TextPreprocessingPipeline

Alex Rudnick edited this page Aug 28, 2014 · 8 revisions

Here we want to write about everything that's involved in going from raw text to stuff we can directly work with in Chipa.

This will end up as a significant part of the tasks/evaluation chapter.

bible chopping

We have two sources of Bibles.

Some of our Bibles were provided by Chris Loza and Rada Mihalcea, and I'd like to thank them for sharing their data. Chris describes the preprocessing work he did in his masters thesis: "CROSS LANGUAGE INFORMATION RETRIEVAL FOR LANGUAGES WITH SCARCE RESOURCES" (Uni. North Texas, 2009)

Some of our Bibles, we downloaded from the web. In both cases, we change them into a single standard format, which is used by scripts further down the pipeline.

We end up producing four Bibles suitable for experiments: English, Spanish, Quechua, and Guarani.

bible chopping: English

Version in USFM format is intentionally public domain, which is rad. Wrote a short script to handle USFM format.

Lemmatized with NLTK's wordnet lemmatizer.

bible chopping: Guarani

Our Guarani text is spidered from the web. We use beautifulsoup to pull out the flat text from the HTML.

Guarani text requires morphological analysis: we want to pick lemmas rather than fully inflected forms.

bible chopping: Spanish

For Spanish text, we can try several different Bible translations, all of which are available electronically.

We have the Traducción en Lenguaje Actual (TLA), and the Reina Valeria (1995 edition), which were both found on the web.

In either case, we do lowercasing, tokenization and lemmatization on the input text. For lemmatization, we use Mike's morphological analyzer from Paramorfo.

bible chopping: Quechua

  • We have a Quechua translation of the Catholic Bible, provided by Rada's group.
  • We also have a pre-preprocessed version from SQUOIA group.

Using MA from Mike's AntiMorfo.

producing bitext

We just take all the verse numbers that line up and call those matching sentences. Should we do some heuristic to check to see that they're about the same length? Probably. Moses and cdec have something like that, yeah?

Note that when working with Bible text, we use verses rather than sentences as the basic unit of text.

alignment

Alignment is done with cdec in the default way. From this, we get one-to-many alignments, where each source word is aligned to 0 or more target words.

From http://www.cdec-decoder.org/guide/fast_align.html

$ ~/cdec/word-aligner/fast_align -i corpus.de-en -d -v -o > corpus.de-en.fwd_align