by Xin Jiang*, Junwei Zheng*, Ruiping Liu, Jiahang Li, Jiaming Zhang†, Sven Matthiesen, Rainer Stiefelhagen
* denotes equal contribution and † denotes corresponding author
- [2024.09.17] ATBench (Assistive Technology Benchmark) is accepted to WACV2025.
- [2024.10.13] We are excited to release ATModel (Assistive Technology Model) training code (INSTALL.md, DATASET.md, TRAIN.md, EVALUATION.md)
ATBench is designed by a pre-design user study with PVIs, including five five most crucial vision-language tasks: Panoptic Segmentation, Image Captioning, Visual Question Answering (VQA), Depth Estimation, Optical Character Recognition (OCR). And we also proposed a novel ATModel that can address all tasks simultaneously.
More detailed can be found in our arxiv paper.
Checkpoints and Numbers:
PS (ADE-150) |
DE (NYU-V2) |
OCR (6 datasets avg) |
IC (VizWiz_Cap) |
VQA (VizWiz_VQA) |
#Params | |
---|---|---|---|---|---|---|
Model | PQ | RMSE | Acc(%) | CIDEr | Acc(%) | |
Unified-IO (S) | - | 0.649 | - | - | 42.4 | 71M |
Unified-IO (B) | - | 0.469 | - | - | 45.8 | 241M |
Unified-IO (L) | - | 0.402 | - | - | 47.7 | 776M |
X-Decoder (T) | 41.6 | - | - | - | - | 164M |
GIT (T) | - | - | - | 113.1 | 68.0 | 0.7B |
PaLI (T) | - | - | - | 117.2 | 67.5 | 3.0B |
ATModel | 38.5 | 0.425 | 80.1 | 52.5 | 53.7 | 62M |
Installation, Dataset, Training and Evaluation Guide:
- We build our work on top of X-Decoder and use their code. We appreciate the previous open-source repository X-Decoder.
If you find our work useful in your research, please cite:
@inproceedings{jiang2025atbench,
title={@BENCH: Benchmarking Vision-Language Models for Human-centered Assistive Technology},
author={Jiang, Xin and Zheng, Junwei and Liu, Ruiping and Li, Jiahang and Zhang, Jiaming and Matthiesen, Sven and Stiefelhagen, Rainer},
booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
year={2025}
}