Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zulu to English #190

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions benchmarks/zu-en/jw300-baseline/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
## Data
JW300 : Zulu to English (English-Zulu Reverse Machine Translation)

Author: Musa Ntuli


## Model
Link to google drive folder with model(https://drive.google.com/drive/folders/10MSShnH8-V4ssOCpbEvFW4ZdVkop7iVd?usp=sharing)


## Model Architecture
### Text Preprocessing
- Remove blank/empty rows : 9037(0.85 %) samples
- Removed duplicates from source text : 82999(7.88 %) samples
- Removed duplicates from target text : 5045(0.52 %) samples
- Removed all numeric-only text : 182(0.02 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from source text: 6272(0.65 %) samples
- Removed rows where text is fewer than orequal to 8 characters long from target text: 713(0.07 %) samples
- Removed rows where text is in test set: 1068(0.11 %) samples

### BPE Tokenization
- vocab size : 4000 (superior results than 10X)

### Model Config
- Details in supplied config file but used fewer transformer layers than in default notebook, with more attention heads and lower embedding size
- Trained for 235000 steps
- Took few hours on a single P100 GPU on Google colab over a three days (stopped training saved best model then reloaded that model the next day)

## Analysis

Example #0
Source: 20 Ungaqonda ukuthi kungani kunjalo ngabaningi .
Reference: 20 You can understand why that is the case with many .
Hypothesis: 20 You can understand why so many .

Example #1
Source: OFakazi BakaJehova endaweni yakini bayokujabulela ukuxoxa nawe okwengeziwe mayelana nemfundo yeBhayibheli eqhutshwa emphakathini wakini . *
Reference: Jehovah ’ s Witnesses in your area will be glad to share more information with you about the Bible education program that is currently being carried on in your community . *
Hypothesis: Jehovah ’ s Witnesses in your area will enjoy discussing more about Bible education that are conducted in your society . *

Example #2
Source: Ukuhlakanipha nokuqonda nako kwakubalulekile .
Reference: Wisdom and discretion were also important .
Hypothesis: Wisdom and discernment was important .

Example #3
Source: Ngemva komhlangano wezizwe eRome ngalelo hlobo , ngaba nelungelo lokuba khona emhlanganweni owawuseNuremberg , eJalimane .
Reference: After the international convention in Rome that summer , I was privileged to attend the convention in Nuremberg , Germany .
Hypothesis: After the international convention in Rome that summer , I had the privilege of attending the convention in Nuremberg , Germany .


## Results
- BLEU dev: 28.48
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great results. Thank you for sharing also the insights into the data processing and the analysis.

- BLEU test: 38.33

### Curious analysis of the tokenization
> There are 66255 english tokens in the test set vocab, 2072 are unique
>
> There are 67851 zulu tokens in the test set vocab, 2336 are unique
>
> These results are in the same notebook as used for training. (Could something similar help inform BPE vocab size choices ?)

### Translation results
> 2019-11-13 07:43:32,728 Hello! This is Joey-NMT.
>
> 2019-11-13 07:44:03,502 dev bleu: 13.64 [Beam search decoding with beam size = 5 and alpha = 1.0]
>
> 2019-11-13 07:44:24,289 test bleu: 4.87 [Beam search decoding with beam size = 5 and alpha = 1.0]`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these results go with the above reported ones?


Download model weights from : [here](https://drive.google.com/open?id=1-QLxP7xLqu-AqDQkm1XaCtDEex1Oseo0)
Loading