TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Temporal Enhanced Training of Multi-view 3D Object Detecto...

Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction

Zhuofan Zong, Dongzhi Jiang, Guanglu Song, Zeyue Xue, Jingyong Su, Hongsheng Li, Yu Liu

2023-04-03ICCV 2023 13D Object Detection
PaperPDFCode(official)

Abstract

In this paper, we propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. The HoP approach is straightforward: given the current timestamp t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k. Our approach is motivated by the observation that enforcing the detector to capture both the spatial location and temporal motion of objects occurring at historical timestamps can lead to more accurate BEV feature learning. First, we elaborately design short-term and long-term temporal decoders, which can generate the pseudo BEV feature for timestamp t-k without the involvement of its corresponding camera images. Second, an additional object decoder is flexibly attached to predict the object targets using the generated pseudo BEV feature. Note that we only perform HoP during training, thus the proposed method does not introduce extra overheads during inference. As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary HoP approach is complementary to prevalent temporal modeling methods, leading to significant performance gains. Extensive experiments are conducted to evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We choose the representative methods, including BEVFormer and BEVDet4D-Depth to evaluate our method. Surprisingly, HoP achieves 68.5% NDS and 62.4% mAP with ViT-L on nuScenes test, outperforming all the 3D object detectors on the leaderboard. Codes will be available at https://github.com/Sense-X/HoP.

Results

TaskDatasetMetricValueModel
Object DetectionnuScenes Camera OnlyNDS68.5HoP
3DnuScenes Camera OnlyNDS68.5HoP
3D Object DetectionnuScenes Camera OnlyNDS68.5HoP
2D ClassificationnuScenes Camera OnlyNDS68.5HoP
2D Object DetectionnuScenes Camera OnlyNDS68.5HoP
16knuScenes Camera OnlyNDS68.5HoP

Related Papers

Dual LiDAR-Based Traffic Movement Count Estimation at a Signalized Intersection: Deployment, Data Collection, and Preliminary Analysis2025-07-17Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations2025-07-07MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection2025-07-06A Survey of Multi-sensor Fusion Perception for Embodied AI: Background, Methods, Challenges and Prospects2025-06-24Teleoperated Driving: a New Challenge for 3D Object Detection in Compressed Point Clouds2025-06-13Vision-based Lifting of 2D Object Detections for Automated Driving2025-06-13DySS: Dynamic Queries and State-Space Learning for Efficient 3D Object Detection from Multi-Camera Videos2025-06-11Gaussian2Scene: 3D Scene Representation Learning via Self-supervised Learning with 3D Gaussian Splatting2025-06-10