TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for ...

TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving

Shaoheng Fang, Zi Wang, Yiqi Zhong, Junhao Ge, Siheng Chen, Yanfeng Wang

2023-03-17CVPR 2023 1Autonomous DrivingBird's-Eye View Semantic Segmentation
PaperPDF

Abstract

Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial-temporal priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods.

Results

TaskDatasetMetricValueModel
Semantic SegmentationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.518.6TBP-Former
Semantic SegmentationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.517.2TBP-Former (static)
10-shot image generationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.518.6TBP-Former
10-shot image generationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.517.2TBP-Former (static)
Bird's-Eye View Semantic SegmentationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.518.6TBP-Former
Bird's-Eye View Semantic SegmentationnuScenesIoU ped - 224x480 - Vis filter. - 100x100 at 0.517.2TBP-Former (static)

Related Papers

GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving2025-07-19AGENTS-LLM: Augmentative GENeration of Challenging Traffic Scenarios with an Agentic LLM Framework2025-07-18World Model-Based End-to-End Scene Generation for Accident Anticipation in Autonomous Driving2025-07-17Orbis: Overcoming Challenges of Long-Horizon Prediction in Driving World Models2025-07-17Channel-wise Motion Features for Efficient Motion Segmentation2025-07-17LaViPlan : Language-Guided Visual Path Planning with RLVR2025-07-17Safeguarding Federated Learning-based Road Condition Classification2025-07-16Towards Autonomous Riding: A Review of Perception, Planning, and Control in Intelligent Two-Wheelers2025-07-16