Outline

1: introduction, overview
- background for the problem: WSD in machine translation, why CL-WSD matters and some good lexical choice examples
- thesis statement: CL-WSD is a feasible and practical means for lexical selection in a hybrid MT system for a language pair with relatively modest resources.
- Intuition: when translating into a lower-resourced language, our target-language LM cannot do as much work in lexical selection for us, so we need to do more work understanding the source-language text. Hence. WSD.
- We'll demonstrate this with a number of CL-WSD experiments (showing we can do it), and integrate CL-WSD into a number of MT systems (to show that it helps).
- research questions
  - Can we use monolingual evidence from the source language to help with CL-WSD?
  - Can we use multilingual evidence from the source language with other languages?
  - Can we use sequence labeling techniques?
  - Which kinds of MT systems can benefit from CL-WSD?
- structure of the rest of the dissertation
  - here's what happens in the subsequent chapters
2: background
- history of WSD and CL-WSD
- hybrid MT
- Guarani language
- Quechua language
3: related work
- CL-WSD approaches in vitro
- WSD with sequence models
- lexical selection in RBMT
- CL-WSD for SMT
- WSD for lower-resourced languages
- SemEval and Senseval
- Translation into MRLs
4: overview of tasks, evaluation, baseline system
- measuring CL-WSD classification accuracy
- measuring MT improvements (extrinsic evaluation)
- using the Bible: how why
  - background on using Bible as corpus ...
  - preprocessing steps
- exploring our bitext
- description of the baseline system
- source-side annotations
- classification results: the baseline system
  - something Zipfian about most common words being most polysemous.
  - Test set on 100 most common words.
  - test set on Bible sentences: sample a bunch of sentences that use those 100 most common words
5: learning from monolingual data
- unsupervised learning from big corpora: word representations
- word2vec: embeddings
- brown clusters
- extracting features with existing NLP tools: analysis with FreeLing
- CL-WSD experiments
6: learning from multilingual data
- europarl
- prepackaged multilingual data: the PPDB
- other bitext for Spanish
- joint classification with MRFs
- classifier stacking
- CL-WSD experiments
~~sequence models~~
- MEMMs
- linear chain CRFs (maybe)
- CL-WSD experiments
~~all together now: combining different signals~~
- CL-WSD experiments
integration into MT systems
- SQUOIA (skip?)
- Terere: hybrid MT with cdec for es-gn
  - language models for Guarani
  - CL-WSD at runtime with cdec in Python
  - ~~getting morphology right~~
  - important: don't spend too much time on this, like comparing different SMT styles
  - training by symmetrization to extract phrases: that's our training set for CL-PSD
    - maybe this is really important for es-gn -- Spanish phrases often translate to Guarani words.
- MT experiments
conclusions
- we can do it
- it works well
- nobody had investigated these particular strategies before: now we can make better word choices when translating from a resource-rich language to a resource-poor one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outline

Clone this wiki locally