TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Feature Fusion Vision Transformer for Fine-Grained Visual ...

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Jun Wang, Xiaohan Yu, Yongsheng Gao

2021-07-06Fine-Grained Visual CategorizationFine-Grained Image Classification
PaperPDFCode(official)

Abstract

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.

Results

TaskDatasetMetricValueModel
Image ClassificationCUB-200-2011Accuracy91.6FFVT
Fine-Grained Image ClassificationCUB-200-2011Accuracy91.6FFVT

Related Papers

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification2025-06-25Structural feature enhanced transformer for fine-grained image recognition2025-06-14GPLQ: A General, Practical, and Lightning QAT Method for Vision Transformers2025-06-13Towards Privacy-Preserving Fine-Grained Visual Classification via Hierarchical Learning from Label Proportions2025-05-29DS_FusionNet: Dynamic Dual-Stream Fusion with Bidirectional Knowledge Distillation for Plant Disease Recognition2025-04-29Enhancing Multimodal In-Context Learning for Image Classification through Coreset Optimization2025-04-19Cross-Hierarchical Bidirectional Consistency Learning for Fine-Grained Visual Classification2025-04-18Adaptive Classification of Interval-Valued Time Series2025-04-04