TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Vision Transformer

Vision Transformer

Computer VisionIntroduced 20002145 papers
Source Paper

Description

The Vision Transformer, or ViT, is a model for image classification that employs a Transformer-like architecture over patches of the image. An image is split into fixed-size patches, each of them are then linearly embedded, position embeddings are added, and the resulting sequence of vectors is fed to a standard Transformer encoder. In order to perform classification, the standard approach of adding an extra learnable “classification token” to the sequence is used.

Papers Using This Method

DASViT: Differentiable Architecture Search for Vision Transformer2025-07-17Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI2025-07-13Comparative Analysis of Vision Transformers and Traditional Deep Learning Approaches for Automated Pneumonia Detection in Chest X-Rays2025-07-11Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion2025-07-08Tile-Based ViT Inference with Visual-Cluster Priors for Zero-Shot Multi-Species Plant Identification2025-07-08Mamba-FETrack V2: Revisiting State Space Model for Frame-Event based Visual Object Tracking2025-06-30GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models2025-06-30Attention to Burstiness: Low-Rank Bilinear Prompt Tuning2025-06-28Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features2025-06-26Distributed Cross-Channel Hierarchical Aggregation for Foundation Models2025-06-26X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis2025-06-25Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications2025-06-24Deep CNN Face Matchers Inherently Support Revocable Biometric Templates2025-06-23An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify2025-06-23SSAVSV: Towards Unified Model for Self-Supervised Audio-Visual Speaker Verification2025-06-21Exoplanet Classification through Vision Transformers with Temporal Image Analysis2025-06-19DepthSeg: Depth prompting in remote sensing semantic segmentation2025-06-17How Real is CARLAs Dynamic Vision Sensor? A Study on the Sim-to-Real Gap in Traffic Object Detection2025-06-16MultiViT2: A Data-augmented Multimodal Neuroimaging Prediction Framework via Latent Diffusion Model2025-06-16GM-LDM: Latent Diffusion Model for Brain Biomarker Identification through Functional Data-Driven Gray Matter Synthesis2025-06-15