This repo is the official implementation for End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer arxiv. The paper has been accepted to AAAI 2026.
Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency.
The pretrained model weights have been released and are available for download at: Resnet50 and Swin-L.
The following figure demonstrates the accuracy advantage of the method we proposed over the current advanced end-to-end algorithms based on static images.
| Method | Backbone | Head | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | Mean |
|---|---|---|---|---|---|---|---|---|---|
| Image-Based | |||||||||
| PETR (2022) | ResNet-50 | 80.5 | 80.8 | 71.3 | 62.1 | 73.4 | 68.5 | 61.2 | 71.7 |
| GroupPose (2023) | ResNet-50 | 82.4 | 82.1 | 73.3 | 64.3 | 74.4 | 70.7 | 63.7 | 73.6 |
| PETR (2022) | HRNet-W48 | 82.4 | 83.2 | 74.4 | 70.8 | 74.5 | 72.3 | 66.9 | 75.4 |
| GroupPose (2023) | HRNet-W48 | 83.3 | 84.3 | 77.8 | 70.3 | 75.6 | 72.8 | 66.8 | 76.3 |
| PETR (2022) | Swin-L | 83.3 | 84.3 | 78.3 | 71.3 | 76.4 | 73.4 | 67.6 | 76.8 |
| GroupPose (2023) | Swin-L | 83.9 | 84.7 | 78.8 | 70.6 | 77.5 | 74.4 | 68.7 | 77.4 |
| Video-Based | |||||||||
| PAVE-Net (Ours) | ResNet-50 | 86.5 | 87.4 | 78.9 | 69.3 | 78.2 | 73.8 | 65.8 | 77.7 |
| PAVE-Net (Ours) | HRNet-W48 | 87.1 | 88.4 | 80.9 | 73.9 | 80.3 | 76.9 | 69.9 | 80.1 |
| PAVE-Net (Ours) | Swin-L | 88.2 | 89.1 | 81.7 | 74.8 | 81.6 | 78.5 | 71.8 | 81.3 |
The following figure shows the speed advantage of the method we proposed over the current advanced two-stage algorithms, especially when there are a large number of people. Note: All results were obtained using a single A800 GPU, and the unit is milliseconds (ms).
| Number of Persons (ms) | |||||
|---|---|---|---|---|---|
| Method | 1 | 3 | 5 | 10 | 20 |
| Two-Stage (Top-Down) | |||||
| DCPose | 150 | 204 | 292 | 431 | 721 |
| DSTA | 122 | 181 | 265 | 418 | 631 |
| End-to-End | |||||
| PETR | 116 | ||||
| GroupPose | 89 | ||||
| PAVE-Net (Ours) | 153 | ||||
Here are some qualitative results from both the PoseTrack dataset and real-world scenarios:
To download some auxiliary materials, please refer to DCPose.
Follow the PETR to install the mmcv and mmdetection.
The training data we use is derived from DCPose, which provides joint point labels of 17 points, while the original posetrack dataset itself only has 15 joint points. To facilitate data processing and model training, we removed the label data of two joint points (left ear and right ear) that were not used in testing. Before training with this project, you can either process it yourself or use the script provided by this project.
python tools/data_process.py
python tools/train.py --cfg your_config.yaml
python tools/test.py --cfg your_config.yaml
python demo/video_inference.py --cfg your_config.yaml --checkpoint use_model_path --video your_video_path
If you find our paper useful in your research, please consider citing:
@inproceedings{yu2025end,
title={End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer},
author={Yu, Yonghui and Cai, Jiahang and Wang, Xun and Yang, Wenwu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2026}
}
Our codes are mainly based on PETR. Part of our code is borrowed from DSTA、DCPose and RLE. Many thanks to the authors!




