TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hulk: A Universal Knowledge Translator for Human-Centric T...

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang

2023-12-043D Human Pose EstimationPedestrian Attribute RecognitionSkeleton Based Action RecognitionHuman Part SegmentationSemantic SegmentationPose EstimationHuman Mesh RecoveryPedestrian DetectionAction RecognitionObject DetectionPedestrian Image Caption
PaperPDFCode(official)Code(official)

Abstract

Human-centric perception tasks, e.g., pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, they did not explore 3D and vision-language tasks for human-centric and required task-specific finetuning. These limitations restrict their application to more downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing 2D vision, 3D vision, skeleton-based, and vision-language tasks without task-specific finetuning. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat diverse human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. Comprehensive evaluations of Hulk on 12 benchmarks covering 8 human-centric tasks demonstrate the superiority of our proposed method, achieving state-of-the-art performance in 11 benchmarks. The code is available on https://github.com/OpenGVLab/Hulk.

Results

TaskDatasetMetricValueModel
Autonomous VehiclesPA-100KAccuracy88.97Hulk(Finetune, ViT-L)
Autonomous VehiclesPA-100KAccuracy87.85Hulk(Finetune, ViT-B)
Autonomous VehiclesRAPv2Accuracy85.86Hulk(Finetune, ViT-L)
Autonomous VehiclesRAPv2Accuracy85.26Hulk(Finetune, ViT-B)
3D Human Pose Estimation3DPWMPJPE66.3Hulk(ViT-L)
3D Human Pose Estimation3DPWMPVPE77.4Hulk(ViT-L)
3D Human Pose Estimation3DPWPA-MPJPE38.5Hulk(ViT-L)
3D Human Pose Estimation3DPWMPJPE67Hulk(ViT-B)
3D Human Pose Estimation3DPWMPVPE79.8Hulk(ViT-B)
3D Human Pose Estimation3DPWPA-MPJPE39.9Hulk(ViT-B)
VideoNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
VideoNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Temporal Action LocalizationNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Temporal Action LocalizationNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Zero-Shot LearningNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Zero-Shot LearningNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Activity RecognitionNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Activity RecognitionNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Action LocalizationNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Action LocalizationNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Pose EstimationCOCO (Common Objects in Context)AP78.7Hulk(Finetune, ViT-L)
Pose EstimationCOCO (Common Objects in Context)AP77.5Hulk(Finetune, ViT-B)
Pose EstimationAICAP37.1Hulk(Finetune, ViT-L)
Pose EstimationAICAP35.6Hulk(Finetune, ViT-B)
Pose Estimation3DPWMPJPE66.3Hulk(ViT-L)
Pose Estimation3DPWMPVPE77.4Hulk(ViT-L)
Pose Estimation3DPWPA-MPJPE38.5Hulk(ViT-L)
Pose Estimation3DPWMPJPE67Hulk(ViT-B)
Pose Estimation3DPWMPVPE79.8Hulk(ViT-B)
Pose Estimation3DPWPA-MPJPE39.9Hulk(ViT-B)
Action DetectionNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Action DetectionNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Pedestrian Attribute RecognitionPA-100KAccuracy88.97Hulk(Finetune, ViT-L)
Pedestrian Attribute RecognitionPA-100KAccuracy87.85Hulk(Finetune, ViT-B)
Pedestrian Attribute RecognitionRAPv2Accuracy85.86Hulk(Finetune, ViT-L)
Pedestrian Attribute RecognitionRAPv2Accuracy85.26Hulk(Finetune, ViT-B)
3D Action RecognitionNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
3D Action RecognitionNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
Human Part SegmentationHuman3.6MmIoU69.89Hulk(Finetune, ViT-L)
Human Part SegmentationHuman3.6MmIoU68.56Hulk(Finetune, ViT-B)
Human Part SegmentationCIHPMean IoU72.68Hulk(Finetune, ViT-L)
Human Part SegmentationCIHPMean IoU71.26Hulk(Finetune, ViT-B)
Object DetectionCrowdHuman (full body)AP93Hulk(Finetune, ViT-L)
Object DetectionCrowdHuman (full body)mMR36.5Hulk(Finetune, ViT-L)
Object DetectionCrowdHuman (full body)AP92.4Hulk(Finetune, ViT-B)
Object DetectionCrowdHuman (full body)mMR40.7Hulk(Finetune, ViT-B)
3DCrowdHuman (full body)AP93Hulk(Finetune, ViT-L)
3DCrowdHuman (full body)mMR36.5Hulk(Finetune, ViT-L)
3DCrowdHuman (full body)AP92.4Hulk(Finetune, ViT-B)
3DCrowdHuman (full body)mMR40.7Hulk(Finetune, ViT-B)
3DCOCO (Common Objects in Context)AP78.7Hulk(Finetune, ViT-L)
3DCOCO (Common Objects in Context)AP77.5Hulk(Finetune, ViT-B)
3DAICAP37.1Hulk(Finetune, ViT-L)
3DAICAP35.6Hulk(Finetune, ViT-B)
3D3DPWMPJPE66.3Hulk(ViT-L)
3D3DPWMPVPE77.4Hulk(ViT-L)
3D3DPWPA-MPJPE38.5Hulk(ViT-L)
3D3DPWMPJPE67Hulk(ViT-B)
3D3DPWMPVPE79.8Hulk(ViT-B)
3D3DPWPA-MPJPE39.9Hulk(ViT-B)
Action RecognitionNTU RGB+DAccuracy (CS)94.3Hulk(Finetune, ViT-L)
Action RecognitionNTU RGB+DAccuracy (CS)94Hulk(Finetune, ViT-B)
2D Semantic SegmentationHuman3.6MmIoU69.89Hulk(Finetune, ViT-L)
2D Semantic SegmentationHuman3.6MmIoU68.56Hulk(Finetune, ViT-B)
2D Semantic SegmentationCIHPMean IoU72.68Hulk(Finetune, ViT-L)
2D Semantic SegmentationCIHPMean IoU71.26Hulk(Finetune, ViT-B)
2D ClassificationCrowdHuman (full body)AP93Hulk(Finetune, ViT-L)
2D ClassificationCrowdHuman (full body)mMR36.5Hulk(Finetune, ViT-L)
2D ClassificationCrowdHuman (full body)AP92.4Hulk(Finetune, ViT-B)
2D ClassificationCrowdHuman (full body)mMR40.7Hulk(Finetune, ViT-B)
2D Object DetectionCrowdHuman (full body)AP93Hulk(Finetune, ViT-L)
2D Object DetectionCrowdHuman (full body)mMR36.5Hulk(Finetune, ViT-L)
2D Object DetectionCrowdHuman (full body)AP92.4Hulk(Finetune, ViT-B)
2D Object DetectionCrowdHuman (full body)mMR40.7Hulk(Finetune, ViT-B)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)AP78.7Hulk(Finetune, ViT-L)
1 Image, 2*2 StitchiCOCO (Common Objects in Context)AP77.5Hulk(Finetune, ViT-B)
1 Image, 2*2 StitchiAICAP37.1Hulk(Finetune, ViT-L)
1 Image, 2*2 StitchiAICAP35.6Hulk(Finetune, ViT-B)
1 Image, 2*2 Stitchi3DPWMPJPE66.3Hulk(ViT-L)
1 Image, 2*2 Stitchi3DPWMPVPE77.4Hulk(ViT-L)
1 Image, 2*2 Stitchi3DPWPA-MPJPE38.5Hulk(ViT-L)
1 Image, 2*2 Stitchi3DPWMPJPE67Hulk(ViT-B)
1 Image, 2*2 Stitchi3DPWMPVPE79.8Hulk(ViT-B)
1 Image, 2*2 Stitchi3DPWPA-MPJPE39.9Hulk(ViT-B)
16kCrowdHuman (full body)AP93Hulk(Finetune, ViT-L)
16kCrowdHuman (full body)mMR36.5Hulk(Finetune, ViT-L)
16kCrowdHuman (full body)AP92.4Hulk(Finetune, ViT-B)
16kCrowdHuman (full body)mMR40.7Hulk(Finetune, ViT-B)

Related Papers

SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction2025-07-21DiffOSeg: Omni Medical Image Segmentation via Multi-Expert Collaboration Diffusion Model2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17Unified Medical Image Segmentation with State Space Modeling Snake2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning2025-07-17Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark2025-07-17DINO-VO: A Feature-based Visual Odometry Leveraging a Visual Foundation Model2025-07-17