Skip to content
This repository was archived by the owner on Nov 28, 2022. It is now read-only.

Submission for WEDO Team #9

Open
wants to merge 102 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
2edf5c8
Add pretrained model
BeamNC Oct 12, 2022
89c27b5
add inference code
BeamNC Oct 12, 2022
97eef82
Update script
BeamNC Oct 12, 2022
669fd1d
first_commit
nattanaa Oct 12, 2022
4d5cd37
commit2
nattanaa Oct 12, 2022
8e6bf57
commit3
nattanaa Oct 12, 2022
917eddc
Add setup.sh for install
BeamNC Oct 12, 2022
cfa2d43
Delete submit/Gender_Category/gender_classification/pretrained_model …
beam11221 Oct 12, 2022
2f68b8b
Delete hparams_inference.yaml
beam11221 Oct 12, 2022
c772afc
Update setup.sh
BeamNC Oct 14, 2022
129e3d9
Add readme
BeamNC Oct 14, 2022
44e555f
From local. Fixed conflict
BeamNC Oct 14, 2022
6128900
Update README.md
beam11221 Oct 14, 2022
75a3089
Merge pull request #1 from KongpolC/gender_clf
KongpolC Oct 14, 2022
d12ba78
commit3
nattanaa Oct 14, 2022
5239ad1
commit4
nattanaa Oct 14, 2022
f8706aa
setup update
nattanaa Oct 14, 2022
75ba2a1
update setup.sh to download validated.tsv for cv11
BeamNC Oct 14, 2022
fe2da24
Add cv11 gender inference. Update setup.sh
BeamNC Oct 14, 2022
13b49e7
Update README.md
beam11221 Oct 14, 2022
8fceeb3
Add Thai-ser download script, add data preprocessing notebook; mp3 ->…
BeamNC Oct 14, 2022
281de7e
fixed local conflict
BeamNC Oct 14, 2022
82d6689
Merge pull request #2 from KongpolC/gender_clf_2
beam11221 Oct 14, 2022
67750a8
2 update
nattanaa Oct 14, 2022
73725db
Merge branch 'main' of https://github.com/KongpolC/our-voices-model-c…
nattanaa Oct 14, 2022
799d40c
Update readme
nattanaa Oct 14, 2022
31e0126
Update all
nattanaa Oct 16, 2022
a934cd3
Update readme
nattanaa Oct 16, 2022
7e9b9ee
Add training script
BeamNC Oct 17, 2022
7794fe5
Merge pull request #3 from KongpolC/gender_clf_3
beam11221 Oct 17, 2022
e1e985c
first commit
KongpolC Oct 17, 2022
b0552d6
Merge branch 'main' of github.com:KongpolC/our-voices-model-competition
KongpolC Oct 17, 2022
0175562
Change download directory to ./models
BeamNC Oct 17, 2022
5b6f4f4
Fix setup.sh
BeamNC Oct 17, 2022
eb1132e
Merge pull request #4 from KongpolC/edit_pretrain_path
beam11221 Oct 17, 2022
0eed352
Update dataset internal path
BeamNC Oct 17, 2022
ee12e67
change paths to data
KongpolC Oct 17, 2022
4a9fd68
Merge branch 'main' of github.com:KongpolC/our-voices-model-competition
KongpolC Oct 17, 2022
b475c29
Add requirements
BeamNC Oct 17, 2022
d7cbf71
Update main.ipynb
nattanaa Oct 17, 2022
346b645
Update main.ipynb
nattanaa Oct 17, 2022
83f9276
Add gitignore, add audio preprocessing script
BeamNC Oct 17, 2022
18171af
Update training config
BeamNC Oct 17, 2022
af94756
Add load_dataset.sh
BeamNC Oct 17, 2022
2f12875
Update load_dataset.sh; Add copy tsv file from cv11 to commonvoice11/…
BeamNC Oct 17, 2022
5058376
change paths and add more explanation
KongpolC Oct 17, 2022
03f550b
Update readme
BeamNC Oct 17, 2022
0865f00
Update comment in model_inference notebook. UPdate readme
BeamNC Oct 17, 2022
5606b7a
update_all
nattanaa Oct 17, 2022
7aa9265
Merge branch 'main' of https://github.com/KongpolC/our-voices-model-c…
nattanaa Oct 17, 2022
5f97e9f
modify analysis 5.3
KongpolC Oct 18, 2022
a165796
Merge branch 'main' of github.com:KongpolC/our-voices-model-competition
KongpolC Oct 18, 2022
b5b2002
modify analysis 5.3
KongpolC Oct 18, 2022
fdca75c
Merge remote-tracking branch 'origin/migrate'
KongpolC Oct 18, 2022
e9c0959
rearange data
KongpolC Oct 18, 2022
15a2e4f
sample clips
KongpolC Oct 18, 2022
6744a25
ignores .wav files except one
KongpolC Oct 18, 2022
6e22c0d
remove README
KongpolC Oct 18, 2022
c1b3a8a
rename
KongpolC Oct 18, 2022
4d8b3f7
Move data file to scripts
BeamNC Oct 19, 2022
ff8f973
Fix path to compat with new directory
BeamNC Oct 19, 2022
d16231c
Merge pull request #5 from KongpolC/migrate_script
beam11221 Oct 19, 2022
ad13bac
Add training script
BeamNC Oct 19, 2022
36ade8e
Update training config
BeamNC Oct 19, 2022
c7cf4f2
update path
nattanaa Oct 19, 2022
4377c4b
add_floder_data_prep
nattanaa Oct 19, 2022
e2296e0
Update readme
BeamNC Oct 19, 2022
1294cfb
Merge branch 'main' of https://github.com/KongpolC/our-voices-model-c…
BeamNC Oct 19, 2022
7f93408
update sh
nattanaa Oct 19, 2022
aa6cac3
Merge branch 'main' of https://github.com/KongpolC/our-voices-model-c…
nattanaa Oct 19, 2022
d0df94a
update setup to train
nattanaa Oct 19, 2022
7583333
Update README.md
nattanaa Oct 19, 2022
b277dbe
Split load_dataset.sh into 2 files for commonvoice11 and Thai-SER
BeamNC Oct 19, 2022
32d49b0
Update load_commonvoice11.sh
beam11221 Oct 19, 2022
d74af66
Update commonvoice11 loading script
BeamNC Oct 19, 2022
9cd17c1
Merge pull request #7 from KongpolC/split_load_dataset
beam11221 Oct 19, 2022
e7ec1a3
Update dataset path
BeamNC Oct 19, 2022
8d61e07
Update setup.sh & requirement
BeamNC Oct 19, 2022
12e1846
Update parameter for inference
BeamNC Oct 19, 2022
b572dc8
Add Commonvoice11 annotation genereator & annotation
BeamNC Oct 19, 2022
44b8110
Add Thai-SER annotation and annotation generate scripts
BeamNC Oct 19, 2022
2815225
Update ds_path
BeamNC Oct 19, 2022
3fe660e
Add manifest generator for gender classification training
BeamNC Oct 19, 2022
fad601a
Update readme.md
BeamNC Oct 19, 2022
01bc75d
Merge pull request #8 from KongpolC/add_create_anno
beam11221 Oct 19, 2022
82b77b3
Update README.md
nattanaa Oct 20, 2022
0f01361
Update README.md
nutchascg Oct 20, 2022
da49f73
Update README.md
nutchascg Oct 20, 2022
8793047
Update README.md
nutchascg Oct 20, 2022
eabc1e6
Update README.md
nutchascg Oct 20, 2022
c581687
Update README.md
nutchascg Oct 20, 2022
a42a529
Update README.md
nutchascg Oct 20, 2022
24c09b9
Update README.md
nutchascg Oct 20, 2022
f06ea43
Add README
BeamNC Oct 20, 2022
0e32565
Update README
BeamNC Oct 20, 2022
0187234
Update README.md
nutchascg Oct 20, 2022
21a19a5
Remove files
BeamNC Oct 20, 2022
eeb86c7
Update README.md
nutchascg Oct 20, 2022
db4345c
Update README.md
nutchascg Oct 20, 2022
23dc3ad
Update main notebook
BeamNC Oct 20, 2022
5a61c44
Merge branch 'doc_string' of https://github.com/KongpolC/our-voices-m…
BeamNC Oct 20, 2022
ffadf04
Merge pull request #9 from KongpolC/doc_string
beam11221 Oct 20, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
submit/Gender_Category/data/*
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# Clips data
*wav
!/submit/Gender_Category/data/commonvoice11/clips/clips_wav/common_voice_th_25691315.wav
!/submit/Gender_Category/data/commonvoice11/clips/clips_wav_trimmed/common_voice_th_25691315.wav
52 changes: 50 additions & 2 deletions submit/Gender_Category/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,52 @@
## Gender: An STT model for an under-resourced language that performs equally well for women
# Gender: A Speech-to-Text Model for Thai

### Please use this folder to submit for the Gender Category
## Introduction
WEDO's mission is to create sustainable and purposeful innovations using voice technology that is more fair and more open to everyone. We have developed our own voice command system and embedded it into smart home appliances, such as speakers and faucets, for better living. In addition, we are building “a voice-enabled device equipped with a camera for blind people” to guide them directions while walking on a street. Thanks to Mozilla Common Voice and its open-source datasets, we are able to build incredible things that have a positive impact on people's lives. This motivates us to build a speech to text (STT) model with gender inclusive performance. More precisely, our STT model should perform equally well for both male and female.

First, we explored the Common Voice 11 dataset. It seemed that there was gender bias. Total data are 135,897 files, which are composed of 52,769 files for male, 30,283 files for female, 1,821 files of other gender, and 51,024 files for undefined gender. This meant that female data was about one-third smaller than male data. As a result, if the model trained with the male-dominant data, the model tended to be biased to male voices. Thus, we tried to find ways to build the gender-inclusive model. However, the experiments showed that our models were not biased toward male voices. In fact, we can recognize female voices slightly better than male voices in most cases of our experiments.

Our contributions can be summarized as follows.

1. We perform an exploratory data analysis to understand data bias over gender in the Common Voice 11 dataset.
2. We propose a STT for Thai, which is fine-tuned on Data2Vec, a state-of-the-art for self-supervised learning in speech.
3. We conduct an experiment to understand performance bias possibly caused by data bias in the Common Voice 11 dataset.
4. To augment Common Voice 11 data, we propose a gender classification model that can infer a gender with F1-score of 0.95.
5. Finally, we did a further analysis to validate various assumptions about performance bias due to data bias between male and female.

## Prerequisites
In this project, we trained models and conducted experiments using a Linux server with GPU. The specification of our computer is listed below:
* CPU:
* AMD Ryzen7 5800X 3.8GHz
* RAM: 32GB x 2

* GPU:
* NVIDIA RTX A6000 48GB

## Project Outline
To evaluate our project, please perform each step in the following order:

1. Understand the overall study
* open the notebook "main.ipynb"
* You expected to see
1. Dataset analysis based on gender distribution
2. Dataset preparation for ASR model training
3. Model performance evaluation of STT model and Gender classifier which reported in WER and CER for STT and accuracy for gender classifier.

2. Reproduce our STT model
* To reproduce our STT model please visit `./STT`. Inside this directory, you can find the `README.md` file containing the steps to reproduce the STT model.

3. Reproduce our gender classification model
* To reproduce our gender classifier please visit `./gender_classification`. Inside this directory, you can find the `README.md` file containing the steps to reproduce the STT model.

Please noted that to reproduce our work, it is required to download and preprocess datasets. Visit `./data/scripts/`. Run `load_commonvoice11.sh` to download Commonvoice11 dataset and `load_thai_ser.sh` to download Thai-SER dataset.

## Setup
It is recommended to execute these commands before inferencing our repo.
```console
cd ./gender_classification
bash ./setup.sh
```
```console
cd ./STT
bash ./setup.sh
```
98 changes: 98 additions & 0 deletions submit/Gender_Category/STT/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
## STT finetune

Thank you for visiting our work to access our STT model. There are 2 options for you.
- First, if you want to reproduce our work, just follow the Setup and Model Training section to retrain the base model.
- Second, for those who just want to evaluate our trained models, you can simply download models and follow the evaluation part to get the result.csv file.

### Setup

```
pip install -r requirements.txt
```

Then, download the following files or run this script to automatically download the essential files for model training.
```
bash ./setup.sh
```

- <a href="https://drive.google.com/drive/folders/1zM_yEi0eEiAItiVSIlQeSgIGderRemHu?usp=sharing">Pretrained model</a>
- <a href="https://drive.google.com/drive/folders/1bsj7DV6Y9hYf4C-Tx0P6tmvPr2hJtwsp?usp=sharing">Processor</a>
- <a href="https://drive.google.com/file/d/1TX-Fp9CWz7U2AicAjhy3gmDoM7XHqSty/view?usp=sharing">Language Model</a>
- <a href="https://drive.google.com/drive/folders/1LAkmsgQ1KrxuFO54UOTnrA7NWcOGAshX?usp=sharing">WavAugment</a>







### Model training
Our base model is Data2VecAudio Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Data2VecAudio was proposed in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli. For more information visit, https://huggingface.co/docs/transformers/model_doc/data2vec


```py
# pretrianed model
BASE_MODEL = "./train/data2vec-thai-pretrained/1"
# load data
mixed_train = load_dataset("./cv11_dataloader.py", "th", split="train+validation")
mixed_test = load_dataset("./cv11_dataloader.py", "th", split="test")
# processor
processor = Wav2Vec2Processor.from_pretrained("./train/processor")
# import Waveaugment
import sys
sys.path.append("./train/WavAugment")
# clips path
abs_path_to_clips = "../data/commonvoice11/clips"
```
### Evaluation
#

For our trained models can be downloaded below or run this script to automatically download all models:
```
bash ./load_models.sh
```
- trained with the 1st dataset (original ratio of gender)
<a href="https://drive.google.com/drive/folders/1YPmUk3ZsfMxqq2nFwUV3fWL3uKFxz13q?usp=sharing">load model</a>

- trained with the 2nd dataset (balance ratio between female & male)
<a href="https://drive.google.com/drive/folders/19ufxw8j2jOt3t8_a3Li5tIzMI2idicVk?usp=sharing">load model</a>

- trained with the 3rd dataset (balance ratio between female & male with speaking same sentence)
<a href="https://drive.google.com/drive/folders/10DZLSO6ftUzZlvfme2FMbUIpH2ZZoYvS?usp=sharing">load model</a>

Model after upsampling training set:

Data upsampling by applying our gender classification model to identify genders for "not-filling" class.

- trained with the 2nd dataset including upsampling data (balance ratio between female & male)
<a href="https://drive.google.com/drive/folders/1nsyl3VLo76DIRNg0Zrrrvy_o4QYlUtXJ?usp=sharing">load model</a>

- trained with the 3rd dataset including upsampling data (balance ratio between female & male with speaking same sentence)
<a href="https://drive.google.com/drive/folders/1lBu9JD-_cQOBjsN747ElV-kAsAhR6rD6?usp=sharing">load model</a>


```py
# processor
self.processor = Wav2Vec2Processor.from_pretrained("./train/processor")

# model
model_path = <MODEL_PATH>
lm_path = "./train/newmm_4gram.bin"

# you must first specify a dataset
dataset_name = "dataset_1"
cv11_test_paths = [
"../data/commonvoice11/annotation/dataset_1/test.csv" # Test set
]

audio_paths = [
"../data/commonvoice11/clips/clips_wav"
]

```
- Output of this `data2vec_evaluate.py` is .csv file with WER and CER score per record. You can group by gender to see the final results.





Loading