-
Notifications
You must be signed in to change notification settings - Fork 0
Outline
Alex Rudnick edited this page Feb 4, 2016
·
14 revisions
- 1: introduction, overview
- background for the problem: WSD in machine translation, why CL-WSD matters and some good lexical choice examples
- thesis statement: CL-WSD is a feasible and practical means for lexical selection in a hybrid MT system for a language pair with relatively modest resources.
- Intuition: when translating into a lower-resourced language, our target-language LM cannot do as much work in lexical selection for us, so we need to do more work understanding the source-language text. Hence. WSD.
- We'll demonstrate this with a number of CL-WSD experiments (showing we can do it), and integrate CL-WSD into a number of MT systems (to show that it helps).
- research questions
- Can we use monolingual evidence from the source language to help with CL-WSD?
- Can we use multilingual evidence from the source language with other languages?
- Can we use sequence labeling techniques?
- Which kinds of MT systems can benefit from CL-WSD?
- structure of the rest of the dissertation
- here's what happens in the subsequent chapters
- 2: background
- history of WSD and CL-WSD
- hybrid MT
- Guarani language
- Quechua language
- 3: related work
- CL-WSD approaches in vitro
- WSD with sequence models
- lexical selection in RBMT
- CL-WSD for SMT
- WSD for lower-resourced languages
- SemEval and Senseval
- Translation into MRLs
- 4: overview of tasks, evaluation, baseline system
- measuring CL-WSD classification accuracy
- measuring MT improvements (extrinsic evaluation)
- using the Bible: how why
- background on using Bible as corpus ...
- preprocessing steps
- exploring our bitext
- description of the baseline system
- source-side annotations
- classification results: the baseline system
- something Zipfian about most common words being most polysemous.
- Test set on 100 most common words.
- test set on Bible sentences: sample a bunch of sentences that use those 100 most common words
- 5: learning from monolingual data
- unsupervised learning from big corpora: word representations
- word2vec: embeddings
- brown clusters
- extracting features with existing NLP tools: analysis with FreeLing
- CL-WSD experiments
- 6: learning from multilingual data
- europarl
- prepackaged multilingual data: the PPDB
- other bitext for Spanish
- joint classification with MRFs
- classifier stacking
- CL-WSD experiments
-
sequence models- MEMMs
- linear chain CRFs (maybe)
- CL-WSD experiments
-
all together now: combining different signals- CL-WSD experiments
- integration into MT systems
- SQUOIA (skip?)
- Terere: hybrid MT with cdec for es-gn
- language models for Guarani
- CL-WSD at runtime with cdec in Python
getting morphology right- important: don't spend too much time on this, like comparing different SMT styles
- training by symmetrization to extract phrases: that's our training set for CL-PSD
- maybe this is really important for es-gn -- Spanish phrases often translate to Guarani words.
- MT experiments
- conclusions
- we can do it
- it works well
- nobody had investigated these particular strategies before: now we can make better word choices when translating from a resource-rich language to a resource-poor one