Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

2023-06-01Image Classification Action Classification Video Recognition Instance Segmentation Action Recognition Action Recognition In Videos Object Detection

Paper PDF Code Code Code(official)Code

Abstract

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Results

Task	Dataset	Metric	Value	Model
Video	Kinetics-700	Top-1 Accuracy	81.1	Hiera-H (no extra data)
Video	Kinetics-400	Acc@1	87.8	Hiera-H (no extra data)
Video	Kinetics-600	Top-1 Accuracy	88.8	Hiera-H (no extra data)
Activity Recognition	Something-Something V2	Top-1 Accuracy	76.5	Hiera-L (no extra data)
Activity Recognition	AVA v2.2	mAP	43.3	Hiera-H (K700 PT+FT)
Object Detection	COCO minival	box AP	55	Hiera-L
Image Classification	iNaturalist	Top 1 Accuracy	83.8	Hiera-H (448px)
Image Classification	Places365-Standard	Top 1 Accuracy	60.6	Hiera-H (448px)
Image Classification	iNaturalist 2019	Top-1 Accuracy	88.5	Hiera-H (448px)
3D	COCO minival	box AP	55	Hiera-L
Instance Segmentation	COCO minival	mask AP	48.6	Heira-L
Action Recognition	Something-Something V2	Top-1 Accuracy	76.5	Hiera-L (no extra data)
Action Recognition	AVA v2.2	mAP	43.3	Hiera-H (K700 PT+FT)
2D Classification	COCO minival	box AP	55	Hiera-L
2D Object Detection	COCO minival	box AP	55	Hiera-L
16k	COCO minival	box AP	55	Hiera-L

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Abstract

Results

Related Papers

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Abstract

Results

Related Papers