XCiT: Cross-Covariance Image Transformers

Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

2021-06-17NeurIPS 2021 12Self-Supervised Image Classification Image Classification Semantic Segmentation Instance Segmentation object-detection Object Detection

Paper PDF Code Code(official)Code Code Code(official)Code Code Code Code Code Code Code

Abstract

Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.

Results

Task	Dataset	Metric	Value	Model
Semantic Segmentation	ADE20K	Validation mIoU	48.4	XCiT-M24/8 (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	48.1	XCiT-S24/8 (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	47.1	XCiT-S24/8 (Semantic-FPN)
Semantic Segmentation	ADE20K	Validation mIoU	46.9	XCiT-M24/8 (Semantic-FPN)
Semantic Segmentation	ADE20K	Validation mIoU	46.6	XCiT-S12/8 (UperNet)
Semantic Segmentation	ADE20K	Validation mIoU	44.2	XCiT-S12/8 (Semantic-FPN)
Object Detection	COCO minival	box AP	48.5	XCiT-M24/8
Object Detection	COCO minival	box AP	48.1	XCiT-S24/8
Image Classification	ImageNet	GFLOPs	417.9	XCiT-L24
Image Classification	ImageNet	GFLOPs	188	XCiT-M24
Image Classification	ImageNet	GFLOPs	106	XCiT-S24
Image Classification	ImageNet	GFLOPs	55.6	XCiT-S12
3D	COCO minival	box AP	48.5	XCiT-M24/8
3D	COCO minival	box AP	48.1	XCiT-S24/8
Instance Segmentation	COCO minival	mask AP	43.7	XCiT-M24/8
Instance Segmentation	COCO minival	mask AP	43	XCiT-S24/8
2D Classification	COCO minival	box AP	48.5	XCiT-M24/8
2D Classification	COCO minival	box AP	48.1	XCiT-S24/8
2D Object Detection	COCO minival	box AP	48.5	XCiT-M24/8
2D Object Detection	COCO minival	box AP	48.1	XCiT-S24/8
10-shot image generation	ADE20K	Validation mIoU	48.4	XCiT-M24/8 (UperNet)
10-shot image generation	ADE20K	Validation mIoU	48.1	XCiT-S24/8 (UperNet)
10-shot image generation	ADE20K	Validation mIoU	47.1	XCiT-S24/8 (Semantic-FPN)
10-shot image generation	ADE20K	Validation mIoU	46.9	XCiT-M24/8 (Semantic-FPN)
10-shot image generation	ADE20K	Validation mIoU	46.6	XCiT-S12/8 (UperNet)
10-shot image generation	ADE20K	Validation mIoU	44.2	XCiT-S12/8 (Semantic-FPN)
16k	COCO minival	box AP	48.5	XCiT-M24/8
16k	COCO minival	box AP	48.1	XCiT-S24/8

XCiT: Cross-Covariance Image Transformers

Abstract

Results

Related Papers

XCiT: Cross-Covariance Image Transformers

Abstract

Results

Related Papers