General Plan #71

egorsmkv · 2025-02-26T17:40:27Z

What we want:

A massive corpus (~5k-10k hours) of Ukrainian speech from different domains: audiobooks, broadcast speech, room-speaking, online conferences, etc
Train open-sourced models developers can easily use
Evaluation datasets to check quality of already existing Speech-to-Text models: Evaluate Speech-to-Text models #52
A test-machine for all STT/TTS models that generates JSONL files for automated evaluation (predictions + references, in STT case) with metadata (RTF, GPU card, etc). It should be a container-based project.
Create leaderboards for STT and TTS tasks: Add Speech-to-Text leaderboard #60 Add Text-to-Speech leaderboard #63

How to achieve it:

Task 1:

Create a dataset with pseudo labels using a multilingual ASR model (for example, Whisper)
Filter out non-Ukrainian samples
Align data using a CTC-based model to make a better dataset we can use in further modeling

Task 2:

Task 3:

Task 4:

Task 5:

Create them as tables with all metadata we need. The table should be automatically generated from JSON files made by the test-machine.

egorsmkv pinned this issue Feb 26, 2025

Provide feedback