common-voice · beam11221 · Oct 12, 2022 · Oct 12, 2022 · Oct 12, 2022 · Oct 12, 2022
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,135 @@
+submit/Gender_Category/data/*
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# Clips data
+*wav
+!/submit/Gender_Category/data/commonvoice11/clips/clips_wav/common_voice_th_25691315.wav
+!/submit/Gender_Category/data/commonvoice11/clips/clips_wav_trimmed/common_voice_th_25691315.wav
diff --git a/submit/Gender_Category/README.md b/submit/Gender_Category/README.md
@@ -1,4 +1,52 @@
-## Gender:	An STT model for an under-resourced language that performs equally well for women
+# Gender: A Speech-to-Text Model for Thai
 
-### Please use this folder to submit for the Gender Category 
+## Introduction
+WEDO's mission is to create sustainable and purposeful innovations using voice technology that is more fair and more open to everyone. We have developed our own voice command system and embedded it into smart home appliances, such as speakers and faucets, for better living. In addition, we are building “a voice-enabled device equipped with a camera for blind people” to guide them directions while walking on a street. Thanks to Mozilla Common Voice and its open-source datasets, we are able to build incredible things that have a positive impact on people's lives. This motivates us to build a speech to text (STT) model with gender inclusive performance. More precisely, our STT model should perform equally well for both male and female.
 
+First, we explored the Common Voice 11 dataset. It seemed that there was gender bias. Total data are 135,897 files, which are composed of 52,769 files for male, 30,283 files for female, 1,821 files of other gender, and  51,024 files for undefined gender. This meant that female data was about one-third smaller than male data. As a result, if the model trained with the male-dominant data, the model tended to be biased to male voices. Thus, we tried to find ways to build the gender-inclusive model. However, the experiments showed that our models were not biased toward male voices. In fact, we can recognize female voices slightly better than male voices in most cases of our experiments.
+
+Our contributions can be summarized as follows.
+
+1. We perform an exploratory data analysis to understand data bias over gender in the Common Voice 11 dataset.
+2. We propose a STT for Thai, which is fine-tuned on Data2Vec, a state-of-the-art for self-supervised learning in speech.
+3. We conduct an experiment to understand performance bias possibly caused by data bias in the Common Voice 11 dataset.
+4. To augment Common Voice 11 data, we propose a gender classification model that can infer a gender with F1-score of 0.95.
+5. Finally, we did a further analysis to validate various assumptions about performance bias due to data bias between male and female.
+
+## Prerequisites
+In this project, we trained models and conducted experiments using a Linux server with GPU. The specification of our computer is listed below:
+* CPU:
+    * AMD Ryzen7 5800X 3.8GHz
+    * RAM: 32GB x 2
+
+* GPU:
+    * NVIDIA RTX A6000 48GB
+
+## Project Outline 
+To evaluate our project, please perform each step in the following order:
+
+1. Understand the overall study
+  * open the notebook "main.ipynb"
+  * You expected to see 
+      1. Dataset analysis based on gender distribution 
+      2. Dataset preparation for ASR model training 
+      3. Model performance evaluation of STT model and Gender classifier which reported in WER and CER for STT and accuracy for gender classifier.
+
+2. Reproduce our STT model
+  * To reproduce our STT model please visit `./STT`. Inside this directory, you can find the `README.md` file containing the steps to reproduce the STT model.
+
+3. Reproduce our gender classification model
+  * To reproduce our gender classifier please visit `./gender_classification`. Inside this directory, you can find the `README.md` file containing the steps to reproduce the STT model.
+
+Please noted that to reproduce our work, it is required to download and preprocess datasets. Visit `./data/scripts/`. Run `load_commonvoice11.sh` to download Commonvoice11 dataset and `load_thai_ser.sh` to download Thai-SER dataset.
+
+## Setup
+It is recommended to execute these commands before inferencing our repo.
+```console
+cd ./gender_classification
+bash ./setup.sh
+```
+```console
+cd ./STT
+bash ./setup.sh
+```
diff --git a/submit/Gender_Category/STT/README.md b/submit/Gender_Category/STT/README.md
@@ -0,0 +1,98 @@
+## STT finetune 
+
+Thank you for visiting our work to access our STT model. There are 2 options for you.
+ - First, if you want to reproduce our work, just follow the Setup and Model Training section to retrain the base model.
+ - Second, for those who just want to evaluate our trained models, you can simply download models and follow the evaluation part to get the result.csv file.
+
+### Setup
+
+```
+pip install -r requirements.txt
+```
+
+Then, download the following files or run this script to automatically download the essential files for model training.
+```
+bash ./setup.sh
+``` 
+
+- <a href="https://drive.google.com/drive/folders/1zM_yEi0eEiAItiVSIlQeSgIGderRemHu?usp=sharing">Pretrained model</a>
+- <a href="https://drive.google.com/drive/folders/1bsj7DV6Y9hYf4C-Tx0P6tmvPr2hJtwsp?usp=sharing">Processor</a>
+- <a href="https://drive.google.com/file/d/1TX-Fp9CWz7U2AicAjhy3gmDoM7XHqSty/view?usp=sharing">Language Model</a>
+- <a href="https://drive.google.com/drive/folders/1LAkmsgQ1KrxuFO54UOTnrA7NWcOGAshX?usp=sharing">WavAugment</a>
+
+
+
+
+
+
+
+### Model training
+Our base model is Data2VecAudio Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Data2VecAudio was proposed in data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu and Michael Auli.  For more information visit, https://huggingface.co/docs/transformers/model_doc/data2vec
+
+
+```py
+# pretrianed model 
+BASE_MODEL = "./train/data2vec-thai-pretrained/1"
+# load data
+mixed_train = load_dataset("./cv11_dataloader.py", "th", split="train+validation")
+mixed_test = load_dataset("./cv11_dataloader.py", "th", split="test")
+# processor
+processor = Wav2Vec2Processor.from_pretrained("./train/processor")
+# import Waveaugment
+import sys
+sys.path.append("./train/WavAugment")
+# clips path
+abs_path_to_clips = "../data/commonvoice11/clips" 
+```
+### Evaluation
+#
+
+For our trained models can be downloaded below or run this script  to automatically download all models:
+```
+bash ./load_models.sh
+```
+- trained with the 1st dataset (original ratio of gender) 
+<a href="https://drive.google.com/drive/folders/1YPmUk3ZsfMxqq2nFwUV3fWL3uKFxz13q?usp=sharing">load model</a>
+
+- trained with the 2nd dataset (balance ratio between female & male)
+<a href="https://drive.google.com/drive/folders/19ufxw8j2jOt3t8_a3Li5tIzMI2idicVk?usp=sharing">load model</a>
+
+- trained with the 3rd dataset (balance ratio between female & male with speaking same sentence) 
+<a href="https://drive.google.com/drive/folders/10DZLSO6ftUzZlvfme2FMbUIpH2ZZoYvS?usp=sharing">load model</a>
+
+Model after upsampling training set:
+
+Data upsampling by applying our gender classification model to identify genders for "not-filling" class.
+
+- trained with the 2nd dataset including upsampling data (balance ratio between female & male) 
+<a href="https://drive.google.com/drive/folders/1nsyl3VLo76DIRNg0Zrrrvy_o4QYlUtXJ?usp=sharing">load model</a>
+
+- trained with the 3rd dataset including upsampling data (balance ratio between female & male with speaking same sentence)
+<a href="https://drive.google.com/drive/folders/1lBu9JD-_cQOBjsN747ElV-kAsAhR6rD6?usp=sharing">load model</a>
+
+
+```py
+# processor
+self.processor = Wav2Vec2Processor.from_pretrained("./train/processor")
+
+# model 
+model_path = <MODEL_PATH>
+lm_path = "./train/newmm_4gram.bin" 
+
+# you must first specify a dataset
+dataset_name = "dataset_1"
+cv11_test_paths = [
+                  "../data/commonvoice11/annotation/dataset_1/test.csv" # Test set
+                 ]
+
+audio_paths = [
+              "../data/commonvoice11/clips/clips_wav"
+              ]
+
+```
+- Output of this  `data2vec_evaluate.py` is .csv file with WER and CER score per record. You can group by gender to see the final results.
+
+
+
+
+