Add README

graphemecluster · graphemecluster · commit 96f3d792fe50 · 2025-10-21T20:17:41.000+08:00
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -3,6 +3,10 @@ on:
     push:
         branches:
             - main
+        paths-ignore:
+            - '*.md'
+            - 'LICENSE'
+            - '.gitignore'
 permissions:
     contents: read
     pages: write
diff --git a/README.md b/README.md
@@ -1,9 +1,120 @@
-# 香港圍頭話及客家話文字轉語音
+<h1>
+  <a href="https://hkilang.github.io/TTS/"><img src="./public/assets/favicon-256x256.png" width="84" align="left" /></a>
+  <div lang="zh-HK">香港圍頭話及客家話文字轉語音</div>
+  <div><sub>Hong Kong Waitau & Hakka Text-to-Speech</sub></div>
+</h1>
 
-## Data Preprocessing
+> <p>
+>   <div lang="zh-HK">輸入文字，聆聽圍頭話、客家話發音，傳承本土語言。</div>
+>   <div>Hear Waitau and Hakka from your words. Keep the languages alive.</div>
+> </p>
 
-**Inputs:** `dictionary.csv`, `public.csv`, `HakkaWords.csv`, `WaitauWords.csv`
-**Process:** `compile.py`
-**Outputs:** `chars.csv`, `hakka_words.csv`, `waitau_words.csv`
+<p>
+  <div lang="zh-HK">本儲存庫包含<a href="https://hkilang.github.io/TTS/"><strong>香港圍頭話及客家話文字轉語音</strong></a>朗讀器前端部分之原始碼。</div>
+  <div>This repository contains the source code of the front-end part of the <a href="https://hkilang.github.io/TTS/"><strong>Hong Kong Waitau & Hakka Text-to-Speech</strong></a> reader.</div>
+</p>
 
-In addition to words from `HakkaWords.csv` and `WaitauWords.csv`, extra words are automatically generated from collocations from the note column of `dictionary.csv` and entries with frequencies ≥ 10 from `public.csv`. Only entries which include at least one polyphone in the target language are included.
+<p>
+  <a href="https://hkilang.org"><img src="./public/assets/hkilang-logo.svg" width="64" align="left" /></a>
+  <div lang="zh-HK">本程式由<a href="https://hkilang.org">香港本土語言保育協會</a>開發及提供。</div>
+  <div>This application is developed and made available by the <a href="https://hkilang.org">Association for Conservation of Hong Kong Indigenous Languages</a> (HKILANG).</div>
+</p>
+
+## 簡介 Introduction
+
+<p>
+  <div lang="zh-HK"><strong>圍頭話</strong>及<strong>客家話</strong>皆是香港的非物質文化遺產，然而這些本土語言傳承因城市化出現了斷層。圍村新一代接觸圍頭話、客家話的機會甚少，或「曉聽唔曉講」。</div>
+  <div><strong>Waitau</strong> and <strong>Hakka</strong> are both recognised as intangible cultural heritage in Hong Kong. However, urbanisation has been disrupting the transmission of these indigenous languages. Younger generations in walled villages are rarely exposed to Waitau and Hakka, and many are what the elders call “<span lang="zh-HK">曉聽唔曉講</span>” — able to understand, but unable to speak.</div>
+</p>
+
+<p>
+  <div lang="zh-HK">使用本文字轉語音朗讀器，可以作為學習圍頭話、客家話的資源，亦可以成為與圍村長輩溝通的工具，延續家庭和社區的語言傳承。</div>
+  <div>This text-to-speech reader serves not only as a resource for learning Waitau and Hakka, but also as a communication tool for engaging with the elderly in walled villages, helping to preserve the linguistic heritage within families and communities.</div>
+</p>
+
+## Development
+
+This app is a static single-paged application (SPA) built with [TypeScript](https://www.typescriptlang.org), [React](https://reactjs.org), [Tailwind CSS](https://tailwindcss.com) and [daisyUI](https://daisyui.com).
+
+To convert it to a native Android and iOS application, it is made a progressive web application (PWA) powered by the [Vite PWA plugin](https://vite-pwa-org.netlify.app), then transformed with [PWABuilder](https://www.pwabuilder.com). For iOS, the output from PWABuilder is further compiled with Xcode.
+
+### Files Overview
+
+* `public/`
+  * `assets/`: contains pre-generated icons and screenshots of different sizes for use as PWA
+  * `site.webmanifest`: web application manifest for use as PWA
+* `src/`
+  * `db/`: contains database initialisation & manipulation logic that downloads and saves model & audio data into an Indexed DB for offline usage.
+  * `res/`: contains both raw and processed Waitau & Hakka pronunciation data of Chinese characters and words, as well as the compilation script. See the _Data Preprocessing_ section below.
+  * `inference/`: contains the code of a Web Worker for offline model inference as well as the API of the worker. A brief description of `infer.ts` is given in the _Models & Inference_ section below.
+  * `index.tsx` is the entry point of the app.
+  * `index.css` is a Tailwind CSS stylesheet containing repeatedly used styles that are not suitable to be inlined.
+  * `App.tsx` contains the outermost React component.
+  * The remaining files contain the definitions of other components, hooks, types and utility functions.
+
+### Technical Overview of How the App Works
+
+The app first segments the input into characters and non-characters (punctuation and symbols) in `src/parse.ts`. Then, characters are converted into pronunciation using pronunciation data loaded into a Trie data structure in `src/Resource.ts`. The input and the conversion result is then displayed as a `SentenceCard` component (`src/SentenceCard.ts`). The user can choose the desired pronunciation inside the card and the audio is generated by feeding the pronunciation as the input to the text-to-speech model.
+
+### Pronunciation Data
+
+Developers can safely ignore the actual content in the `src/res/` folder and only need to keep in mind of the following:
+
+* The app only loads and should only load the processed outputs, `chars.csv`, `waitau_words.csv` and `hakka_words.csv`. Among them:
+  * `chars.csv` contains four columns:
+    * `char`: the Chinese character consisting a single Unicode codepoint. This column is **not unique** and may repeat if the character is a polyphone (has multiple pronunciation).
+    * `waitau`, `hakka`: the Waitau and Hakka pronunciation of the character in [HKILANG](https://hkilang.org)'s own romanisation scheme, if any. You can refer to the website for the details of the romanisation scheme, but do not make further assumption of the format beyond `/^[a-zäöüæ]+[1-6]$/` for Waitau and `/^[a-z]+[1-6]$/` for Hakka in the code.
+    * `notes`: further explanation/clarification/disambiguation displayed underneath the character, if any
+  * `waitau_words.csv` and `hakka_words.csv` each contains two columns:
+    * `char`: the Chinese characters consisting two or more Unicode codepoints. Again, this column is **not unique** and may repeat if the word can pronounce in multiple variations.
+    * `pron`: the Waitau or Hakka pronunciation of the character with the **same number of syllables** as the number of characters in the `char` column. Each pair of syllables is separated by an ASCII (ordinary) whitespace.
+* The raw data, `dictionary.csv`, `WaitauWords.csv`, `HakkaWords.csv` and `public.csv`, as well as `compile.py`, should **never** be referenced in the code.
+
+The outputs are precompiled and managed as part of the Git repository, so you need not generate them manually unless you modified the inputs or the complication script described below.
+
+#### Data Preprocessing
+
+> [!NOTE]
+> <p>
+>   <div>This section is intended for dictionary maintainers. Developers of the body of the app need not read it.</div>
+>   <div>Read the above section for the description of the compilation outputs.</div>
+> </p>
+
+`src/res/compile.py` gathers pronunciation data from the following sources as the inputs:
+
+* `dictionary.csv`, `WaitauWords.csv`, `HakkaWords.csv`: Pronunciation data from the [HKILANG](https://hkilang.org)'s dictionary, surveyed and collected from villages in Hong Kong in an earlier project. These are the core sources.
+* `public.csv`: Lexicon table from the [TypeDuck](https://typeduck.hk) Cantonese keyboard, for further supplement of relatively uncommon words in order to facilitate automatic choice of pronunciation for polyphones (characters with multiple pronunciation). This is done inside the `generate` function in `compile.py` by looking up the Waitau/Hakka equivalent of the Cantonese pronunciation in the `dictionary.csv` table after Jyutping is converted into HKILANG's romanisation scheme by the `rom_map` function. Only entries with frequencies ≥ 10 which include at least one polyphone in the target language are included.
+
+In addition to words from `WaitauWords.csv`, `HakkaWords.csv` and `public.csv`, words are also extracted from collocations from the note column of `dictionary.csv`.
+
+The compilation script cleanses and normalises the inputs, computes extra words and outputs the result into the three files described in the above section. All monosyllabic results, whether linguistically a word or not, are included in the `chars.csv` file, and the polysyllabic results are written to `waitau_words.csv` and `hakka_words.csv`.
+
+### Audio Generation
+
+The app provides 3 different modes for generating audio from the input:
+
+1. **Online inference**: The app requests (`fetch`es) audio from the backend from the following URL:
+
+   > `https://Chaak2.pythonanywhere.com/TTS/${language}/${text}?voice=${voice}&speed=${speed}`
+
+   where the parameters are:
+
+   * `${language}`, which must be one of `waitau` or `hakka`;
+   * `${text}`, which is the **romanised** text input, separated by spaces (`%20`) or `+`. There are 7 available punctuation marks: `.`, `,`, `!`, `?`, `…`, `'` and `-`. Separators are required both before and after punctuation marks. Percent-encoding is not mandatory except for the punctuation `?` (`%3F`).  
+     Currently, only transliterations in HKILANG's own romanisation system is accepted. **Chinese characters are not yet supported** in the API, so you will need to first convert Chinese text into pronunciation using this app's interface.
+   * `${voice}`, which may be one of `male` and `female` (optional, defaults to `male`); and
+   * `${speed}`, which may be any number between 0.5 and 2 (optional, defaults to 1).
+
+   `${` and `}` indicate a parameter and should not be included as part of the URL.
+
+   The backend is deployed as a [PythonAnywhere](https://www.pythonanywhere.com) instance and the code is open sourced in [github.com/hkilang/TTS-API](../../../TTS-API), which is a dead code eliminated reduction of [Bert-VITS2](../../../../fishaudio/Bert-VITS2). The pre-trained [PyTorch](https://pytorch.org) machine learning models used for inference are published on [the release page](../../../TTS-API/releases).
+
+2. **Offline inference**: The app performs inference within itself in the Web Worker, `src/inference/worker.ts`, using the same machine learning models used for online inference but exported as [ONNX](https://onnx.ai) format, available in the [github.com/hkilang/TTS-models](../../../TTS-models) repo. Each model consists of several components, some of which are split into smaller chunks due to size limitations. In the app, each model is downloaded and stored into an IndexedDB per user request. The user must download the desired model manually before audio generation.
+
+   The `infer` method in `src/inference/infer.ts` resembles the `SynthesizerTrn.infer()` method in [`models.py`](../../../TTS-API/blob/main/models.py) in the TTS-API repo. In the method, each model component is loaded from the IndexedDB and the weights are released immediately after use to avoid out-of-memory errors in low-end devices with limited memory. A custom class, `NDArray`, is written for performing mathematical computation on the immediate results inferred by the model components.
+
+3. **Lightweight mode**: The app **concatenates the pre-generated audio files** available in the [github.com/hkilang/TTS-audios](../../../TTS-audios) repo. These files are created as follows: for each character in the dictionary, an audio file is generated using the same model as offline inference. The generated files are then concatenated into a single file, `chars.bin`, and the corresponding pronunciation and the start offset for each audio file is saved into an offset table, `chars.csv`. The same is performed for each word in the dictionary, producing the files `words.bin` (split into smaller chunks due to size limitations) and `words.csv`. These 4 files form an audio pack.
+
+   In the app, each audio pack is downloaded and stored into an IndexedDB per user request. The user must download the desired audio pack manually before audio generation. During audio generation, the app loads the audio components from IndexedDB, uses the offset tables to locate the byte ranges for each phrase, slices those segments from the audio components and decodes them. Then, all the decoded audio segments are concatenated in order into a single audio buffer for playback.
+
+   Although the generation process is fast thanks to its computational simplicity, **the use of this mode is discouraged** due to the poor quality of the results produced and is intended only as a last resort in extremely low-end devices without an Internet connection when even offline inference fails.