Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for finetuning a zipformer model #1509

Merged
merged 1 commit into from
Feb 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions docs/source/recipes/Finetune/from_supervised/finetune_zipformer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
Finetune from a supervised pre-trained Zipformer model
======================================================

This tutorial shows you how to fine-tune a supervised pre-trained **Zipformer**
transducer model on a new dataset.

.. HINT::

We assume you have read the page :ref:`install icefall` and have setup
the environment for ``icefall``.

.. HINT::

We recommend you to use a GPU or several GPUs to run this recipe


For illustration purpose, we fine-tune the Zipformer transducer model
pre-trained on `LibriSpeech`_ on the small subset of `GigaSpeech`_. You could use your
own data for fine-tuning if you create a manifest for your new dataset.

Data preparation
----------------

Please follow the instructions in the `GigaSpeech recipe <https://github.com/k2-fsa/icefall/tree/master/egs/gigaspeech/ASR>`_
to prepare the fine-tune data used in this tutorial. We only require the small subset in GigaSpeech for this tutorial.


Model preparation
-----------------

We are using the Zipformer model trained on full LibriSpeech (960 hours) as the intialization. The
checkpoint of the model can be downloaded via the following command:

.. code-block:: bash

$ GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Zengwei/icefall-asr-librispeech-zipformer-2023-05-15
$ cd icefall-asr-librispeech-zipformer-2023-05-15/exp
$ git lfs pull --include "pretrained.pt"
$ ln -s pretrained.pt epoch-99.pt
$ cd ../data/lang_bpe_500
$ git lfs pull --include bpe.model
$ cd ../../..

Before fine-tuning, let's test the model's WER on the new domain. The following command performs
decoding on the GigaSpeech test sets:

.. code-block:: bash

./zipformer/decode_gigaspeech.py \
--epoch 99 \
--avg 1 \
--exp-dir icefall-asr-librispeech-zipformer-2023-05-15/exp \
--use-averaged-model 0 \
--max-duration 1000 \
--decoding-method greedy_search

You should see the following numbers:

.. code-block::

For dev, WER of different settings are:
greedy_search 20.06 best for dev

For test, WER of different settings are:
greedy_search 19.27 best for test


Fine-tune
---------

Since LibriSpeech and GigaSpeech are both English dataset, we can initialize the whole
Zipformer model with the checkpoint downloaded in the previous step (otherwise we should consider
initializing the stateless decoder and joiner from scratch due to the mismatch of the output
vocabulary). The following command starts a fine-tuning experiment:

.. code-block:: bash

$ use_mux=0
$ do_finetune=1

$ ./zipformer/finetune.py \
--world-size 2 \
--num-epochs 20 \
--start-epoch 1 \
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
--use-fp16 1 \
--base-lr 0.0045 \
--bpe-model data/lang_bpe_500/bpe.model \
--do-finetune $do_finetune \
--use-mux $use_mux \
--master-port 13024 \
--finetune-ckpt icefall-asr-librispeech-zipformer-2023-05-15/exp/pretrained.pt \
--max-duration 1000

The following arguments are related to fine-tuning:

- ``--base-lr``
The learning rate used for fine-tuning. We suggest to set a **small** learning rate for fine-tuning,
otherwise the model may forget the initialization very quickly. A reasonable value should be around
1/10 of the original lr, i.e 0.0045.

- ``--do-finetune``
If True, do fine-tuning by initializing the model from a pre-trained checkpoint.
**Note that if you want to resume your fine-tuning experiment from certain epochs, you
need to set this to False.**

- ``--finetune-ckpt``
The path to the pre-trained checkpoint (used for initialization).

- ``--use-mux``
If True, mix the fine-tune data with the original training data by using `CutSet.mux <https://lhotse.readthedocs.io/en/latest/api.html#lhotse.supervision.SupervisionSet.mux>`_
This helps maintain the model's performance on the original domain if the original training
is available. **If you don't have the original training data, please set it to False.**

After fine-tuning, let's test the WERs. You can do this via the following command:

.. code-block:: bash

$ use_mux=0
$ do_finetune=1
$ ./zipformer/decode_gigaspeech.py \
--epoch 20 \
--avg 10 \
--exp-dir zipformer/exp_giga_finetune${do_finetune}_mux${use_mux} \
--use-averaged-model 1 \
--max-duration 1000 \
--decoding-method greedy_search

You should see numbers similar to the ones below:

.. code-block:: text

For dev, WER of different settings are:
greedy_search 13.47 best for dev

For test, WER of different settings are:
greedy_search 13.66 best for test

Compared to the original checkpoint, the fine-tuned model achieves much lower WERs
on the GigaSpeech test sets.
15 changes: 15 additions & 0 deletions docs/source/recipes/Finetune/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Fine-tune a pre-trained model
=============================

After pre-training on public available datasets, the ASR model is already capable of
performing general speech recognition with relatively high accuracy. However, the accuracy
could be still low on certain domains that are quite different from the original training
set. In this case, we can fine-tune the model with a small amount of additional labelled
data to improve the performance on new domains.


.. toctree::
:maxdepth: 2
:caption: Table of Contents

from_supervised/finetune_zipformer
1 change: 1 addition & 0 deletions docs/source/recipes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ We may add recipes for other tasks as well in the future.
Streaming-ASR/index
RNN-LM/index
TTS/index
Finetune/index
Loading