Skip to content
Alex Rudnick edited this page Feb 4, 2016 · 14 revisions
  • 1: introduction, overview
    • background for the problem: WSD in machine translation, why CL-WSD matters and some good lexical choice examples
    • thesis statement: CL-WSD is a feasible and practical means for lexical selection in a hybrid MT system for a language pair with relatively modest resources.
    • Intuition: when translating into a lower-resourced language, our target-language LM cannot do as much work in lexical selection for us, so we need to do more work understanding the source-language text. Hence. WSD.
    • We'll demonstrate this with a number of CL-WSD experiments (showing we can do it), and integrate CL-WSD into a number of MT systems (to show that it helps).
    • research questions
      • Can we use monolingual evidence from the source language to help with CL-WSD?
      • Can we use multilingual evidence from the source language with other languages?
      • Can we use sequence labeling techniques?
      • Which kinds of MT systems can benefit from CL-WSD?
    • structure of the rest of the dissertation
      • here's what happens in the subsequent chapters
  • 2: background
    • history of WSD and CL-WSD
    • hybrid MT
    • Guarani language
    • Quechua language
  • 3: related work
    • CL-WSD approaches in vitro
    • WSD with sequence models
    • lexical selection in RBMT
    • CL-WSD for SMT
    • WSD for lower-resourced languages
    • SemEval and Senseval
    • Translation into MRLs
  • 4: overview of tasks, evaluation, baseline system
    • measuring CL-WSD classification accuracy
    • measuring MT improvements (extrinsic evaluation)
    • using the Bible: how why
      • background on using Bible as corpus ...
      • preprocessing steps
    • exploring our bitext
    • description of the baseline system
    • source-side annotations
    • classification results: the baseline system
      • something Zipfian about most common words being most polysemous.
      • Test set on 100 most common words.
      • test set on Bible sentences: sample a bunch of sentences that use those 100 most common words
  • 5: learning from monolingual data
    • unsupervised learning from big corpora: word representations
    • word2vec: embeddings
    • brown clusters
    • extracting features with existing NLP tools: analysis with FreeLing
    • CL-WSD experiments
  • 6: learning from multilingual data
    • europarl
    • prepackaged multilingual data: the PPDB
    • other bitext for Spanish
    • joint classification with MRFs
    • classifier stacking
    • CL-WSD experiments
  • sequence models
    • MEMMs
    • linear chain CRFs (maybe)
    • CL-WSD experiments
  • all together now: combining different signals
    • CL-WSD experiments
  • integration into MT systems
    • SQUOIA (skip?)
    • Terere: hybrid MT with cdec for es-gn
      • language models for Guarani
      • CL-WSD at runtime with cdec in Python
      • getting morphology right
      • important: don't spend too much time on this, like comparing different SMT styles
      • training by symmetrization to extract phrases: that's our training set for CL-PSD
        • maybe this is really important for es-gn -- Spanish phrases often translate to Guarani words.
    • MT experiments
  • conclusions
    • we can do it
    • it works well
    • nobody had investigated these particular strategies before: now we can make better word choices when translating from a resource-rich language to a resource-poor one
Clone this wiki locally