TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/SwiGLU

SwiGLU

GeneralIntroduced 200013 papers
Source Paper

Description

SwiGLU is an activation function which is a variant of GLU. The definition is as follows:

SwiGLU(x,W,V,b,c,β)=Swish_β(xW+b)⊗(xV+c)\text{SwiGLU}\left(x, W, V, b, c, \beta\right) = \text{Swish}\_{\beta}\left(xW + b\right) \otimes \left(xV + c\right)SwiGLU(x,W,V,b,c,β)=Swish_β(xW+b)⊗(xV+c)

Papers Using This Method

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference2025-04-04DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification2024-12-23Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking2024-12-02Deriving Activation Functions Using Integration2024-11-20Scaling FP8 training to trillion-token LLMs2024-09-19How Lightweight Can A Vision Transformer Be2024-07-25Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters2024-06-10ReLU$^2$ Wins: Discovering Efficient Activation Functions for Sparse LLMs2024-02-06BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model2023-09-20SlimPajama-DC: Understanding Data Combinations for LLM Training2023-09-19Llama 2: Open Foundation and Fine-Tuned Chat Models2023-07-18PaLM: Scaling Language Modeling with Pathways2022-04-05GLU Variants Improve Transformer2020-02-12