Description
Multiscale Vision Transformer, or MViT, is a transformer architecture for modeling visual data such as images and videos. Unlike conventional transformers, which maintain a constant channel capacity and resolution throughout the network, Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features.
Papers Using This Method
ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image Identification2023-12-28PaReprop: Fast Parallelized Reversible Backpropagation2023-06-15SVT: Supertoken Video Transformer for Efficient Video Understanding2023-04-01Multi-Channel Vision Transformer for Epileptic Seizure Prediction2022-06-29Benchmarking Conventional Vision Models on Neuromorphic Fall Detection and Action Recognition Dataset2022-01-28MViTv2: Improved Multiscale Vision Transformers for Classification and Detection2021-12-02Efficient Video Transformers with Spatial-Temporal Token Selection2021-11-23Class-agnostic Object Detection with Multi-modal Transformer2021-11-22Multiscale Vision Transformers2021-04-22