When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

2024-07-14Multispectral Object Detection Pedestrian Detection 3D Object Detection Object Detection

Abstract

Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (e.g. RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, i.e. MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with 30x smaller parameters. Codes and data are available at https://github.com/BubblyYi/MMPedestron.

Results

Task	Dataset	Metric	Value	Model
Autonomous Vehicles	MMPD-Dataset	box mAP	79	MMPedestron
Autonomous Vehicles	LLVIP	AP	0.726	MMPedestron
Object Detection	CrowdHuman (full body)	AP	97.1	MMPedestron
Object Detection	CrowdHuman (full body)	mMR	30.8	MMPedestron
Object Detection	InOutDoor	AP	65.7	MMPedestron
Object Detection	EventPed	AP	79	MMPedestron
Object Detection	STCrowd	AP	74.9	MMPedestron
3D	CrowdHuman (full body)	AP	97.1	MMPedestron
3D	CrowdHuman (full body)	mMR	30.8	MMPedestron
3D	InOutDoor	AP	65.7	MMPedestron
3D	EventPed	AP	79	MMPedestron
3D	STCrowd	AP	74.9	MMPedestron
2D Classification	CrowdHuman (full body)	AP	97.1	MMPedestron
2D Classification	CrowdHuman (full body)	mMR	30.8	MMPedestron
2D Classification	InOutDoor	AP	65.7	MMPedestron
2D Classification	EventPed	AP	79	MMPedestron
2D Classification	STCrowd	AP	74.9	MMPedestron
Pedestrian Detection	MMPD-Dataset	box mAP	79	MMPedestron
Pedestrian Detection	LLVIP	AP	0.726	MMPedestron
2D Object Detection	CrowdHuman (full body)	AP	97.1	MMPedestron
2D Object Detection	CrowdHuman (full body)	mMR	30.8	MMPedestron
2D Object Detection	InOutDoor	AP	65.7	MMPedestron
2D Object Detection	EventPed	AP	79	MMPedestron
2D Object Detection	STCrowd	AP	74.9	MMPedestron
16k	CrowdHuman (full body)	AP	97.1	MMPedestron
16k	CrowdHuman (full body)	mMR	30.8	MMPedestron
16k	InOutDoor	AP	65.7	MMPedestron
16k	EventPed	AP	79	MMPedestron
16k	STCrowd	AP	74.9	MMPedestron

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Abstract

Results

Related Papers

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

Abstract

Results

Related Papers