The code in this repository is a support for the experiments in the paper On a Novel Application of Wasserstein-Procrustes for Unsupervised Cross-Lingual Learning.
Code iterative_hungarian takes one initialisation matrix W_0
and refines it.
Experiments from Section 5.1 are recreated the following way (this example shows English-Spanish):
-
The source and target embeddings can be downloaded in the following way (change link for other languages):
- English fastText Wikipedia embeddings:
curl -Lo wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
- Spanish fastText Wikipedia embeddings:
curl -Lo wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec
- English fastText Wikipedia embeddings:
-
Obtaining the initialisation matrix
- MUSE:
python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5
- Procrustes:
python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default
- ICP:
python get_data.py python run_icp.py python eval.py
- MUSE:
-
Running IH:
python iterative_hungarian.py —-grows 45000 —-write_path AUX —-src_path PATH_SRC_EMBEDDINGS —-tgt_path PATH_TGT_EMBEDDINGS —-w_path PATH_INITIALIZATION_MATRIX --nrefin 5
Experiments from Section 5.2 are recreated the following way:
- Word embeddings are obtained using Fasttext following the instructions in the paper Unsupervised Alignment of Embeddings with Wasserstein Procrustes
python iterative_hungarian.py —-grows 10000 —-write_path AUX —-src_path PATH_SRC_EMBEDDINGS —-tgt_path PATH_TGT_EMBEDDINGS —-w_path PATH_INITIALIZATION_MATRIX