MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

Abstract

While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba's traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP₂₅ +24% on the challenging three-camera setting in CMU Panoptic, +7.0 on AP₂₅ +13% on varying camera arrangements, and +15.3 PCP +38% on Campus A1 in cross-dataset evaluations.

Model Architecture

MV-SSM proposes multi-view images through ResNet-50 backbone to extract multi-scale features, which are refined by stacked Projective State Space (PSS) blocks. These blocks leverage projective attenton and state space modeling to progressively refine the keypoints, with final 3D keypoints estimated via geometric triangulation.

Model Architecture. Architecture of the (a) Mamba block, (b) VSS block, and (c) the proposed PSS block. The PSS block captures joint spatial relationships through projective attention and state space modeling, progressively refining results.

Contributions

MV-SSM is the first to adapt visual mamba for the 3D multi-view multi-person pose estimation task.

Our Projective State Space (PSS) block integrates state space modeling and projection attention to effectively capture joint spatial sequences.

We introduce the Grid-Token guided Bidirectional Scanning (GTBS) to enhance performance.

MV-SSM significantly outperforms SOTA methods on in-domain 3D keypoints estimation and generalizability evaluation.

Visual Comparisons

We present a visual comparison against MVGFormer on CMU Panoptic benchmark. The Ground truth human poses are shown in red and the predicted pose is overlapped on it to show an accurate comparison. MV-SSM achieves accurate poses, especially in difficult scenarios, demonstrating superior performance. As illustrated in the first row, MV-SSM is better able to predict, for example, the person's left foot. Note that the colors of persons are different since we do not perform ID-matching.

BibTeX