TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Hiera: A Hierarchical Vision Transformer without the Bells...

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

2023-06-01Image ClassificationAction ClassificationVideo RecognitionInstance SegmentationAction RecognitionAction Recognition In VideosObject Detection
PaperPDFCodeCodeCode(official)Code

Abstract

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Results

TaskDatasetMetricValueModel
VideoKinetics-700Top-1 Accuracy81.1Hiera-H (no extra data)
VideoKinetics-400Acc@187.8Hiera-H (no extra data)
VideoKinetics-600Top-1 Accuracy88.8Hiera-H (no extra data)
Activity RecognitionSomething-Something V2Top-1 Accuracy76.5Hiera-L (no extra data)
Activity RecognitionAVA v2.2mAP43.3Hiera-H (K700 PT+FT)
Object DetectionCOCO minivalbox AP55Hiera-L
Image ClassificationiNaturalistTop 1 Accuracy83.8Hiera-H (448px)
Image ClassificationPlaces365-StandardTop 1 Accuracy60.6Hiera-H (448px)
Image ClassificationiNaturalist 2019Top-1 Accuracy88.5Hiera-H (448px)
3DCOCO minivalbox AP55Hiera-L
Instance SegmentationCOCO minivalmask AP48.6Heira-L
Action RecognitionSomething-Something V2Top-1 Accuracy76.5Hiera-L (no extra data)
Action RecognitionAVA v2.2mAP43.3Hiera-H (K700 PT+FT)
2D ClassificationCOCO minivalbox AP55Hiera-L
2D Object DetectionCOCO minivalbox AP55Hiera-L
16kCOCO minivalbox AP55Hiera-L

Related Papers

Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17RS-TinyNet: Stage-wise Feature Fusion Network for Detecting Tiny Objects in Remote Sensing Images2025-07-17