-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zulu to English #190
Open
godide
wants to merge
6
commits into
masakhane-io:master
Choose a base branch
from
godide:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Zulu to English #190
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
ddc708f
Created using Colaboratory
godide a342e8a
Latest version
godide 6c8e295
Successfully ran for epoch equal to 1. Updated paths for reverse trai…
godide 7f9871f
Updated epoch to 2 rather than reducing truncating the dataset.
godide 1c8820a
TODO detokenized the data.
godide 54ff0c5
Musa Ntuli Zulu to English
musa-ntuli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
## Data | ||
JW300 : Zulu to English (English-Zulu Reverse Machine Translation) | ||
|
||
Author: Musa Ntuli | ||
|
||
|
||
## Model | ||
Link to google drive folder with model(https://drive.google.com/drive/folders/10MSShnH8-V4ssOCpbEvFW4ZdVkop7iVd?usp=sharing) | ||
|
||
|
||
## Model Architecture | ||
### Text Preprocessing | ||
- Remove blank/empty rows : 9037(0.85 %) samples | ||
- Removed duplicates from source text : 82999(7.88 %) samples | ||
- Removed duplicates from target text : 5045(0.52 %) samples | ||
- Removed all numeric-only text : 182(0.02 %) samples | ||
- Removed rows where text is fewer than orequal to 8 characters long from source text: 6272(0.65 %) samples | ||
- Removed rows where text is fewer than orequal to 8 characters long from target text: 713(0.07 %) samples | ||
- Removed rows where text is in test set: 1068(0.11 %) samples | ||
|
||
### BPE Tokenization | ||
- vocab size : 4000 (superior results than 10X) | ||
|
||
### Model Config | ||
- Details in supplied config file but used fewer transformer layers than in default notebook, with more attention heads and lower embedding size | ||
- Trained for 235000 steps | ||
- Took few hours on a single P100 GPU on Google colab over a three days (stopped training saved best model then reloaded that model the next day) | ||
|
||
## Analysis | ||
|
||
Example #0 | ||
Source: 20 Ungaqonda ukuthi kungani kunjalo ngabaningi . | ||
Reference: 20 You can understand why that is the case with many . | ||
Hypothesis: 20 You can understand why so many . | ||
|
||
Example #1 | ||
Source: OFakazi BakaJehova endaweni yakini bayokujabulela ukuxoxa nawe okwengeziwe mayelana nemfundo yeBhayibheli eqhutshwa emphakathini wakini . * | ||
Reference: Jehovah ’ s Witnesses in your area will be glad to share more information with you about the Bible education program that is currently being carried on in your community . * | ||
Hypothesis: Jehovah ’ s Witnesses in your area will enjoy discussing more about Bible education that are conducted in your society . * | ||
|
||
Example #2 | ||
Source: Ukuhlakanipha nokuqonda nako kwakubalulekile . | ||
Reference: Wisdom and discretion were also important . | ||
Hypothesis: Wisdom and discernment was important . | ||
|
||
Example #3 | ||
Source: Ngemva komhlangano wezizwe eRome ngalelo hlobo , ngaba nelungelo lokuba khona emhlanganweni owawuseNuremberg , eJalimane . | ||
Reference: After the international convention in Rome that summer , I was privileged to attend the convention in Nuremberg , Germany . | ||
Hypothesis: After the international convention in Rome that summer , I had the privilege of attending the convention in Nuremberg , Germany . | ||
|
||
|
||
## Results | ||
- BLEU dev: 28.48 | ||
- BLEU test: 38.33 | ||
|
||
### Curious analysis of the tokenization | ||
> There are 66255 english tokens in the test set vocab, 2072 are unique | ||
> | ||
> There are 67851 zulu tokens in the test set vocab, 2336 are unique | ||
> | ||
> These results are in the same notebook as used for training. (Could something similar help inform BPE vocab size choices ?) | ||
|
||
### Translation results | ||
> 2019-11-13 07:43:32,728 Hello! This is Joey-NMT. | ||
> | ||
> 2019-11-13 07:44:03,502 dev bleu: 13.64 [Beam search decoding with beam size = 5 and alpha = 1.0] | ||
> | ||
> 2019-11-13 07:44:24,289 test bleu: 4.87 [Beam search decoding with beam size = 5 and alpha = 1.0]` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do these results go with the above reported ones? |
||
|
||
Download model weights from : [here](https://drive.google.com/open?id=1-QLxP7xLqu-AqDQkm1XaCtDEex1Oseo0) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are great results. Thank you for sharing also the insights into the data processing and the analysis.