This project implements an enhanced pipeline for automated interlinear glossing. The system uses a Transformer-based character encoder with relative positional encodings, an unsupervised morpheme segmentation module with adaptive thresholding and a full forward–backward algorithm (with utility masking), a translation encoder, and a Transformer decoder with cross-attention to generate gloss sequences from source words and their translations.
-
Encoder.py
Implements a Transformer-based character encoder with relative positional encodings.
Input: One-hot encoded source characters (shape:(batch_size, seq_len, input_size)
) and sequence lengths.
Output: Contextualized embeddings of shape(batch_size, seq_len, embed_dim)
, which serve as input to the segmentation module. -
MorphemeSegmenter.py
Implements an unsupervised segmentation module that computes segmentation probabilities for each character, predicts an adaptive threshold from rich encoder statistics (max, mean, variance), and applies a forward–backward algorithm with utility masks (viamake_mask_2d
andmake_mask_3d
) to produce a binary segmentation mask and auxiliary outputs (predicted morpheme count and raw segmentation probabilities).
Input: Encoder outputs and valid sequence lengths (and optionally target morpheme counts).
Output: Binary segmentation mask, morpheme count, adaptive threshold, and segmentation probabilities. -
GlossingDecoder.py
Implements a Transformer decoder that generates gloss tokens. It uses cross-attention over a memory that is formed by concatenating aggregated morpheme representations (derived via an aggregation function from the segmentation mask) with a translation representation.
Input: Target gloss token indices (for teacher forcing during training) and a memory tensor (aggregated segments + translation representation).
Output: Logits over the gloss vocabulary for each token. -
Utilities.py
Contains helper functions for the pipeline:- Masking Functions:
make_mask
,make_mask_2d
, andmake_mask_3d
create boolean masks to handle variable-length sequences. - Aggregation:
aggregate_segments
aggregates contiguous character encoder outputs (based on the segmentation mask) into fixed-size morpheme-level representations. - Pooling:
max_pool_2d
is used for auxiliary pooling operations if needed.
- Masking Functions:
-
GlossingModel.py
Integrates the above modules into a full end-to-end system (implemented as a PyTorch Lightning module). The pipeline takes one-hot encoded source features, source lengths, target gloss tokens, and translation tokens; it processes these through the encoder, segmentation module, and decoder to generate gloss predictions.
Output: The decoder produces logits over the gloss vocabulary, along with auxiliary segmentation outputs (morpheme count, adaptive threshold, and segmentation probabilities). -
main.py Contains the training script which:
- Loads the glossing data (via
data.py
), - Builds the necessary vocabularies,
- Creates DataLoader objects,
- Trains the integrated glossing model using PyTorch Lightning, and
- Saves the model checkpoint.
- Runs inference and returns metrics
- Loads the glossing data (via
command-line arguments (batch size, epochs, language, etc.) can be parsed using argparse
to configure training dynamically.
Additionally, the script prints out the predictions alongside the true gloss for evaluation.
-
data.py
- Loads the datasets by using the DataLoader Object Input: Takes a data file Output: Outputs a DataLoader object for train, validation and test sets
-
metrics.py Contains helper functions for our metric calculations:
- Word Level Gloss Accuracy
- Morpheme Level Gloss Accuracy
-
old_main.py
Contains the training script that:- Loads the dataset (from
Dummy_Dataset.csv
), - Builds vocabularies for source characters, gloss tokens, and translation tokens,
- Creates a PyTorch DataLoader,
- Trains the integrated glossing model using PyTorch Lightning, and
- Saves a model checkpoint (e.g.,
glossing_model.ckpt
).
- Loads the dataset (from
-
Sample.py
Provides a sample prediction script that:- Loads the dataset (to retrieve vocabularies and a sample input),
- Loads the trained model checkpoint,
- Converts the source input into the required one-hot format,
- Runs the integrated model to generate gloss predictions, and
- Outputs the predicted gloss (truncated at the first occurrence of the
</s>
token).
Follow these steps:
- Use Python versions between 3.11.0 to 3.12.7
PYTHON VIEW
- Clone the Repository:
git clone https://github.com/gfkaceli/COMP8730_Project.git cd COMP8730_Project
- Create Environment and Install Dependencies
- A. Create Virtual Environment:
python -m venv venv
- B. Activate Virtual Environment:
- For Windows:
venv\Scripts\activate
- For Linux/MacOS:
source venv/bin/activate
- For Windows:
- C. To Install Dependencies Do:
pip install -r requirements.txt
- A. Create Virtual Environment:
- Make Models Directory (if you do not have it already)
mkdir models
- Run the Training and Inference Scripts
python main.py --language Gitksan --batch 7 --epochs 20 python main.py --language Lezgi --batch 128 --numheads 32 --epochs 30 python main.py --language Natugu --dropout 0.1354 --batch 128 --numheads 64 --epochs 20 python main.py --language Tsez --batch 128 --epochs 25
CONDA VIEW
- Conda version utilized is 24.11.3
- Clone the Repository:
git clone https://github.com/gfkaceli/COMP8730_Project.git cd COMP8730_Project
- Create Environment and Install Dependencies
conda create -n glossing_env python=3.12.7 pip # python version 3.11.0 to 3.12.7 should work so feel free to change conda activate glossing_env pip install -r requirements.txt
- Make Models Directory (if you do not have it already)
mkdir models
- Run the Training and Inference Script
python main.py --language Gitksan --batch 7 --epochs 20 python main.py --language Lezgi --batch 128 --numheads 32 --epochs 30 python main.py --language Natugu --dropout 0.1354 --batch 128 --numheads 64 --epochs 20 python main.py --language Tsez --batch 128 --epochs 25
- After training and
lightning_logs
directory will be created. In this directory you will see the training metrics, parameters and checkpoints for the model, also the final model checkpoint will be saved asmodels/glossing_model_{lang}.ckpt
- Training and Prediction is done in main.py feel free to tune or adjust the hyperparameters to your liking
- the possible hyperparameters to tune are as follows:
- --batch: the size of the batch
- --layers: the number of layers
- --dropout: the dropout rate per layer
- --lr: the learning rate
- --embdim: the embedding dimensions
- --ffdim: the feed forward dimensions
- --numheads: the number of attention heads
- --epochs: the number of epochs
Any questions or concerns reach out to [email protected] or [email protected]