Official implementation of the paper "Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation".
Paper | Project Page | Poster | Video
Skeleton Motion Quantization (SMQ) is an unsupervised framework that temporally segments long untrimmed skeleton sequences into meaningful actions without supervision.
SMQ learns discrete, patch-level representation of skeleton motion using vector quantization.
A dilated TCN encodes each joint independently into the latent space. The embeddings are grouped into short temporal patches and quantized into "motion words” using a learned codebook. TCN decoder reconstructs the original skeleton sequence from these discrete patches.
We present qualitative results from the HuGaDB, LARa, and BABEL datasets, illustrating the action segments predicted by SMQ alongside the ground truth.
HuGaDB LARa
BABEL
The results show that SMQ effectively detects recurring actions and avoids the over-segmentation and under-segmentation issues observed in prior unsupervised methods.
The datasets can be downloaded from :
- HuGaDB (v2): https://github.com/romanchereshnev/HuGaDB
- LARa (v3 - OMoCap annotated): https://zenodo.org/records/8189341
- BABEL: https://babel.is.tue.mpg.de/
Each dataset has a preprocessing script in src/data/ that converts raw files into:
features/ (.npy skeleton tensors)
groundTruth/ (.txt frame-wise labels)
mapping/ (action-id mapping)
HuGaDB
python src/data/hugadb.py -i <raw_hugadb_dir> -o data/LARa
python src/data/lara.py -i <raw_lara_dir> -o data/BABEL
First create BABEL subsets following: https://github.com/line/Skeleton-Temporal-Action-Localization
This will create train and val pkl files for each subset and then run :
python src/data/babel.py -i train.pkl val.pkl -o data/ -p {babel1|babel2|babel3}Example (BABEL-Subset 1) :
python src/data/babel.py -i train_split1.pkl val_split1.pkl -o data/ -p babel1
Preprocessing scripts should create subfolders under data/ directory for each dataset. Inside each dataset folder (ex. data/hugadb/), the structure should be as follows:
data # root path for all datasets
├─ dataset_name/ # root path for single dataset
│ ├─ features/ # skeleton features
│ │ ├─ fname1.npy
│ │ ├─ fname2.npy
│ │ ├─ ...
| ├─ groundTruth/ # ground truth labels
│ │ ├─ fname1.txt
│ │ ├─ fname2.txt
│ │ ├─ ...
| ├─ mapping # mapping folder
│ │ ├─ mapping.txt # mapping file for matching
To create the conda environment run the following command:
conda env create -n smq -f environment.yml
conda activate smq
pip install --no-deps -r requirements.txt
To train the model run main.py with train action.
python main.py --action=train --dataset=DS
where DS is hugadb, lara, babel1, babel2 or babel3.
To evaluate the model run main.py with eval action.
python main.py --action=eval --dataset=DS
To evaluate a pretrained model (provided in models/pretrained/):
python main.py --action=eval --dataset=DS --ckpt models/pretrained/DS.model
In our code we made use of the following repositories: MS-TCN, CTE and VQ. We sincerely thank the authors for their codebases!
If you use the code, please cite our paper:
@InProceedings{Gokay_2025_ICCV,
author = {G\"okay, Uzay and Spurio, Federico and Bach, Dominik R. and Gall, Juergen},
title = {Skeleton Motion Words for Unsupervised Skeleton-Based Temporal Action Segmentation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {12101-12111}
}


