TL;DR

We used a conformer-like model consisting transformer encoder, 1d convolution with CBAM and bi-LSTM. Overall model size is about 16MB after INT8 quantization. Training objective is CTC with InterCTC loss.

Data preprocess

Used landmarks
- 1 for nose, 21 for dominant hand, 40 for lips
- x, y coordinates
Normalization
- Standardized distance from nose coordinates
Feature engineering
- Concatenation of normalized location, difference of next frame, joint distance of hand
- Total 582 dims
Removing data that input frame is shorter than 2 times of target phrase

Data augmentation

Horizontal flip landmarks
Interpolation
Affine transfrom

Model

2 stacked encoder with Transformer, 1D convolution with CBAM and Bi-LSTM
- hidden dim: 352
CTC loss and Inter CTC loss after first encoder
17M parameters and INT8 quantization

Train

2 staged training (300 + 200 epochs)
- Use supplemental and train data in first 300 epochs
- Use train data 200 epochs
Ranger optimizer
Cosine decay scheduler with 12 epochs warmup
AWP
- It prevents the validation loss from diverging, but it doesn't seem to improve the edit distance.
- Used to train long epochs.

Not worked

Augmentation
- Time, spatial, landmark masking
- Time reverse
Autoregressive decoder
- Use joint loss, CTC loss for encoder and Crossentropy for decoder
- Inference time is longer and the number of parameters larger than those of the CTC encoder alone, but the performance improvement is not clear, so it is not used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TL;DR

Data preprocess

Data augmentation

Model

Train

Not worked

Files

README.md

Latest commit

History

README.md

File metadata and controls

TL;DR

Data preprocess

Data augmentation

Model

Train

Not worked