Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Switch FFN

Switch FFN

GeneralIntroduced 200020 papers

Description

A Switch FFN is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens ( $x\_{1}$ = “More” and $x\_{2}$ = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

Papers Using This Method

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration2025-05-10 ExpertRAG: Efficient RAG with Mixture of Experts -- Optimizing Context Retrieval for Adaptive LLM Responses2025-03-23 ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration2025-03-10 Sparse Backpropagation for MoE Training2023-10-01 SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget2023-08-29 Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model2023-05-23 Condensing Multilingual Knowledge with Lightweight Language-Specific Modules2023-05-23 Improving Transformer Performance for French Clinical Notes Classification Using Mixture of Experts on a Limited Dataset2023-03-22 SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing2022-12-10 Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods2022-06-18 Argumentative Text Generation in Economic Domain2022-06-18 Build a Robust QA System with Transformer-based Mixture of Experts2022-03-20 Efficient Language Modeling with Sparse all-MLP2022-03-14 Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning2022-03-14 Mixture-of-Experts with Expert Choice Routing2022-02-18 Taming Sparsely Activated Transformer with Stochastic Experts2021-10-08 M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining2021-10-08 Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 3.1$\times$ Faster Inference2021-08-04 Carbon Emissions and Large Neural Network Training2021-04-21 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity2021-01-11