TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Switch FFN

Switch FFN

GeneralIntroduced 200020 papers
Source Paper

Description

A Switch FFN is a sparse layer that operates independently on tokens within an input sequence. It is shown in the blue block in the figure. We diagram two tokens (x_1x\_{1}x_1 = “More” and x_2x\_{2}x_2 = “Parameters” below) being routed (solid lines) across four FFN experts, where the router independently routes each token. The switch FFN layer returns the output of the selected FFN multiplied by the router gate value (dotted-line).

Papers Using This Method

QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration2025-05-10ExpertRAG: Efficient RAG with Mixture of Experts -- Optimizing Context Retrieval for Adaptive LLM Responses2025-03-23ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration2025-03-10Sparse Backpropagation for MoE Training2023-10-01SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget2023-08-29Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model2023-05-23Condensing Multilingual Knowledge with Lightweight Language-Specific Modules2023-05-23Improving Transformer Performance for French Clinical Notes Classification Using Mixture of Experts on a Limited Dataset2023-03-22SMILE: Scaling Mixture-of-Experts with Efficient Bi-level Routing2022-12-10Automatic Summarization of Russian Texts: Comparison of Extractive and Abstractive Methods2022-06-18Argumentative Text Generation in Economic Domain2022-06-18Build a Robust QA System with Transformer-based Mixture of Experts2022-03-20Efficient Language Modeling with Sparse all-MLP2022-03-14Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning2022-03-14Mixture-of-Experts with Expert Choice Routing2022-02-18Taming Sparsely Activated Transformer with Stochastic Experts2021-10-08M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining2021-10-08Random Offset Block Embedding Array (ROBE) for CriteoTB Benchmark MLPerf DLRM Model : 1000$\times$ Compression and 3.1$\times$ Faster Inference2021-08-04Carbon Emissions and Large Neural Network Training2021-04-21Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity2021-01-11