Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Vision Transformer

Vision Transformer

Computer VisionIntroduced 20002145 papers

Description

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Papers Using This Method

DASViT: Differentiable Architecture Search for Vision Transformer2025-07-17 Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI2025-07-13 Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays2025-07-11 Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion2025-07-08 Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification2025-07-08 Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30 GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models2025-06-30 Attention to Burstiness: Low-Rank Bilinear Prompt Tuning2025-06-28 Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26 Distributed Cross-Channel Hierarchical Aggregation for Foundation Models2025-06-26 X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis2025-06-25 Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications2025-06-24 Deep CNN Face Matchers Inherently Support Revocable Biometric Templates2025-06-23 An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify2025-06-23 SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification2025-06-21 Exoplanet Classification through Vision Transformers with Temporal Image Analysis2025-06-19 DepthSeg: Depth prompting in remote sensing semantic segmentation2025-06-17 How Real is CARLAs Dynamic Vision Sensor? A Study on the Sim-to-Real Gap in Traffic Object Detection2025-06-16 MultiViT2: A Data-augmented Multimodal Neuroimaging Prediction Framework via Latent Diffusion Model2025-06-16 GM-LDM: Latent Diffusion Model for Brain Biomarker Identification through Functional Data-Driven Gray Matter Synthesis2025-06-15