Cantonese-Mandarin Machine Translation

Package managing

If using conda:

conda create -n 2590-project poetry
conda activate 2590-project
poetry install --no-root

If not using conda, first install poetry, then:

poetry install --no-root

When installing new packages to the virtual environment, first add to pyproject.toml, then:

poetry lock
poetry install --no-root

ALMA fine-tuning

Code and documentation in the ALMA submodule.

Huggingface: https://huggingface.co/superaidesu/cantonese-alma-2-7b-oasst-v1-lora

Evaluation script and results

Use evaluate.py -h to see the options. The script takes in a plain text output file, whose rows are the model translations, and output BLEU and ChrF++ scores.

We have 2 directions(Mandarin-Cantonese, Cantonese-Mandarin), 2 data sources(Main, Tatoeba), 2 splits(validation, test), 2 metrics, so 16 values for each method.

Mandarin to Cantonese:

Method(validation/test)	BLEU(main)	BLEU(tatoeba)	ChrF++(main)	ChrF++(tatoeba)
Naive baseline	12.356/13.103	21.987/24.761	10.872/11.373	16.195/16.645
Existing work	./24.941	./36.878	./19.056	./24.717
0-shot	10.458/11.217	20.286/20.977	9.824/10.368	15.641/16.085
5-shot	9.825/10.346	18.119/18.895	8.946/9.059	13.818/14.584
Finetuned	37.738/35.371	49.522/44.583	28.841/26.197	38.884/35.274
GPT-3.5-turbo	/25.840		/20.326

Cantonese to Mandarin:

Method(validation/test)	BLEU(main)	BLEU(tatoeba)	ChrF++(main)	ChrF++(tatoeba)
Naive baseline	12.437/13.181	21.974/24.711	10.617/11.136	16.450/16.825
Existing work	./16.534	./28.999	./13.259	./20.304
0-shot	7.749/8.411	13.687/13.208	7.177/7.643	11.061/10.995
5-shot	5.849/6.723	10.582/12.332	6.552/6.832	9.520/10.064
Finetuned	36.469/36.553	44.444/47.719	27.028/27.471	31.874/37.925
GPT-3.5-turbo	/28.232		/23.845

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
ALMA @ fe5a78e		ALMA @ fe5a78e
data		data
output_not_finetuned		output_not_finetuned
outputs		outputs
raw_data		raw_data
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
data_split.py		data_split.py
eval_prompting.sh		eval_prompting.sh
evaluate.py		evaluate.py
poetry.lock		poetry.lock
prompting.py		prompting.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cantonese-Mandarin Machine Translation

Package managing

ALMA fine-tuning

Evaluation script and results

About

Releases

Packages

Contributors 3

Languages

cmgao/nlp_project

Folders and files

Latest commit

History

Repository files navigation

Cantonese-Mandarin Machine Translation

Package managing

ALMA fine-tuning

Evaluation script and results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages