Heilmeier's Catechism

What are you trying to do? Articulate your objectives using absolutely no jargon.
- I want to make translation systems make better word choices, even in the case that there are not many example translations for the two languages in question, and the case where there is not much available target-language text. I want to make it easier for existing translation systems of varying designs to make use of new example translations as they become available.
How is it done today, and what are the limits of current practice?
- There are a number of different strategies used for lexical selection in current MT systems:
  - Currently, RBMT systems like SQUOIA make word choices with arbitrary dictionary lookups and a few hand-crafted semantic rules. This is limited because the rules must be written by hand, which is brittle and time-consuming; it would be better to have a strategy that improves as more bitext becomes available.
  - SMT systems use multiword phrases and language models to encourage good, coherent word choices. This works better as more training data becomes available, but for dealing with under-resourced languages, we don't have lots of that to start with, so our language models and phrase tables may not be very good.
  - Currently Apertium uses lexical selection rules that compile to FSTs and take into account the lemmas and the tags of the local context in order to pick an appropriate translation. These rules can either be learned from data or written by hand, and the learned rules are stored in a human-editable format. This is good because the system can improve with more data, but limited in that the formalism can only take into account specific local features, and each rule makes a hard decision. Our system is more flexible in that the its classifiers can use richer features, many features simultaneously, and in the case of terere, can be weighted against other evidence.
What's new in your approach and why do you think it will be successful?
- This is new because we have discriminative classifiers that use many different contextual features to make lexical selections, leveraging our resources (both good tools and lots of data, monolingual and multilingual) for the source languages. I think it will be successful because we have lots of tools and data for analyzing Spanish (and English).
Who cares?
- Broadly, anybody who cares about under-resourced languages and translation into them. Anybody who already has an RBMT system and would like to make it better.
If you're successful, what difference will it make?
- We'll have made it easier to bring up translation systems for under-resourced languages. And there are a lot of under-resourced languages in the world.
What are the risks and the payoffs?
- The major risk is that it won't work as well as we suspect and I'll have spent some time. But even in that case, we'll have learned about what doesn't work. The payoffs will be better translations for Guarani and Quechua (at least), reusable software for future MT systems, and some knowledge about how to translate into languages of that shape!
How much will it cost?
- Just my time and effort. And the time and effort of any collaborators, both developers and Paraguayans who want to help by contribution Guampa translations.
How long will it take?
- From now until September 2014, or possibly later into the fall, should things not work out like we hope.
What are the midterm and final "exams" to check for success?
- In the midterm, we want to do in-vitro WSD experiments that show that our system does better than the most-frequent-sense baseline, and that at least some of our more interesting features are helping.
- Finally, we'll have integrated chipa into some actual running MT systems for in-vivo improved machine translation. We should see improvements on MT metrics like BLEU and output sentences that users like better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heilmeier's Catechism

Clone this wiki locally