End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

This repo is the official implementation for End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer arxiv. The paper has been accepted to AAAI 2026.

Introduction

Existing multi-person video pose estimation methods typically adopt a two-stage pipeline: detecting individuals in each frame, followed by temporal modeling for single person pose estimation. This design relies on heuristic operations such as detection, RoI cropping, and non-maximum suppression (NMS), limiting both accuracy and efficiency. In this paper, we present a fully end-to-end framework for multi-person 2D pose estimation in videos, effectively eliminating heuristic operations. A key challenge is to associate individuals across frames under complex and overlapping temporal trajectories. To address this, we introduce a novel Pose-Aware Video transformEr Network (PAVE-Net), which features a spatial encoder to model intra-frame relations and a spatiotemporal pose decoder to capture global dependencies across frames. To achieve accurate temporal association, we propose a pose-aware attention mechanism that enables each pose query to selectively aggregate features corresponding to the same individual across consecutive frames. Additionally, we explicitly model spatiotemporal dependencies among pose keypoints to improve accuracy. Notably, our approach is the first end-to-end method for multi-frame 2D human pose estimation. Extensive experiments show that PAVE-Net substantially outperforms prior image-based end-to-end methods, achieving a 6.0 mAP improvement on PoseTrack2017, and delivers accuracy competitive with state-of-the-art two-stage video based approaches, while offering significant gains in efficiency.

Weights Download

The pretrained model weights have been released and are available for download at: Resnet50 and Swin-L.

Quantitative Performance

The following figure demonstrates the accuracy advantage of the method we proposed over the current advanced end-to-end algorithms based on static images.

Method	Backbone	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Mean
Image-Based
PETR (2022)	ResNet-50	80.5	80.8	71.3	62.1	73.4	68.5	61.2	71.7
GroupPose (2023)	ResNet-50	82.4	82.1	73.3	64.3	74.4	70.7	63.7	73.6
PETR (2022)	HRNet-W48	82.4	83.2	74.4	70.8	74.5	72.3	66.9	75.4
GroupPose (2023)	HRNet-W48	83.3	84.3	77.8	70.3	75.6	72.8	66.8	76.3
PETR (2022)	Swin-L	83.3	84.3	78.3	71.3	76.4	73.4	67.6	76.8
GroupPose (2023)	Swin-L	83.9	84.7	78.8	70.6	77.5	74.4	68.7	77.4
Video-Based
PAVE-Net (Ours)	ResNet-50	86.5	87.4	78.9	69.3	78.2	73.8	65.8	77.7
PAVE-Net (Ours)	HRNet-W48	87.1	88.4	80.9	73.9	80.3	76.9	69.9	80.1
PAVE-Net (Ours)	Swin-L	88.2	89.1	81.7	74.8	81.6	78.5	71.8	81.3

The following figure shows the speed advantage of the method we proposed over the current advanced two-stage algorithms, especially when there are a large number of people. Note: All results were obtained using a single A800 GPU, and the unit is milliseconds (ms).

	Number of Persons (ms)
Method	1	3	5	10	20
Two-Stage (Top-Down)
DCPose	150	204	292	431	721
DSTA	122	181	265	418	631
End-to-End
PETR	116
GroupPose	89
PAVE-Net (Ours)	153

Visualizations

Here are some qualitative results from both the PoseTrack dataset and real-world scenarios:

Video Demo

Usage and Install

To download some auxiliary materials, please refer to DCPose.

Follow the PETR to install the mmcv and mmdetection.

Train Data Process

The training data we use is derived from DCPose, which provides joint point labels of 17 points, while the original posetrack dataset itself only has 15 joint points. To facilitate data processing and model training, we removed the label data of two joint points (left ear and right ear) that were not used in testing. Before training with this project, you can either process it yourself or use the script provided by this project.

python tools/data_process.py

Training

python tools/train.py --cfg your_config.yaml

Evaluation

python tools/test.py --cfg your_config.yaml

Video Inference

python demo/video_inference.py --cfg your_config.yaml --checkpoint use_model_path --video your_video_path

Citations

If you find our paper useful in your research, please consider citing:

@inproceedings{yu2025end,
  title={End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer},
  author={Yu, Yonghui and Cai, Jiahang and Wang, Xun and Yang, Wenwu},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2026}
}

Acknowledgment

Our codes are mainly based on PETR. Part of our code is borrowed from DSTA、DCPose and RLE. Many thanks to the authors!

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
configs		configs
demo		demo
opera.egg-info		opera.egg-info
opera		opera
requirements		requirements
third_party		third_party
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Introduction

Weights Download

Quantitative Performance

Visualizations

Video Demo

Usage and Install

Train Data Process

Training

Evaluation

Video Inference

Citations

Acknowledgment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

End-to-End Multi-Person Pose Estimation with Pose-Aware Video Transformer

Introduction

Weights Download

Quantitative Performance

Visualizations

Video Demo

Usage and Install

Train Data Process

Training

Evaluation

Video Inference

Citations

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages