TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/The Information Pathways Hypothesis: Transformers are Dyna...

The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

Md Shamim Hussain, Mohammed J. Zaki, Dharmashankar Subramanian

2023-06-02Image ClassificationGraph RegressionGraph LearningImage GenerationLanguage Modelling
PaperPDFCode(official)

Abstract

Transformers use the dense self-attention mechanism which gives a lot of flexibility for long-range connectivity. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However, very few of these contribute to the performance of the network, and even fewer are essential. We hypothesize that there are sparsely connected sub-networks within a transformer, called information pathways which can be trained independently. However, the dynamic (i.e., input-dependent) nature of these pathways makes it difficult to prune dense self-attention during training. But the overall distribution of these pathways is often predictable. We take advantage of this fact to propose Stochastically Subsampled self-Attention (SSA) - a general-purpose training strategy for transformers that can reduce both the memory and computational cost of self-attention by 4 to 8 times during training while also serving as a regularization method - improving generalization over dense training. We show that an ensemble of sub-models can be formed from the subsampled pathways within a network, which can achieve better performance than its densely attended counterpart. We perform experiments on a variety of NLP, computer vision and graph learning tasks in both generative and discriminative settings to provide empirical evidence for our claims and show the effectiveness of the proposed method.

Results

TaskDatasetMetricValueModel
Language ModellingWikiText-103Test perplexity17.18Transformer+SSA+Self-ensemble
Language ModellingWikiText-103Validation perplexity16.54Transformer+SSA+Self-ensemble
Language ModellingWikiText-103Test perplexity17.6Transformer+SSA
Language ModellingWikiText-103Validation perplexity16.91Transformer+SSA
Language Modellingenwik8Bit per Character (BPC)1.024Transformer+SSA
Graph RegressionPCQM4Mv2-LSCValidation MAE0.0865EGT+SSA+Self-ensemble
Graph RegressionPCQM4Mv2-LSCValidation MAE0.0876EGT+SSA

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Automatic Classification and Segmentation of Tunnel Cracks Based on Deep Learning and Visual Explanations2025-07-18Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy2025-07-17Federated Learning for Commercial Image Sources2025-07-17MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17fastWDM3D: Fast and Accurate 3D Healthy Tissue Inpainting2025-07-17