TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Weight Tying

Weight Tying

GeneralIntroduced 200063 papers
Source Paper

Description

Weight Tying improves the performance of language models by tying (sharing) the weights of the embedding and softmax layers. This method also massively reduces the total number of parameters in the language models that it is applied to.

Language models are typically comprised of an embedding layer, followed by a number of Transformer or LSTM layers, which are finally followed by a softmax layer. Embedding layers learn word representations, such that similar words (in meaning) are represented by vectors that are near each other (in cosine distance). [Press & Wolf, 2016] showed that the softmax matrix, in which every word also has a vector representation, also exhibits this property. This leads them to propose to share the softmax and embedding matrices, which is done today in nearly all language models.

This method was independently introduced by Press & Wolf, 2016 and Inan et al, 2016.

Additionally, the Press & Wolf paper proposes Three-way Weight Tying, a method for NMT models in which the embedding matrix for the source language, the embedding matrix for the target language, and the softmax matrix for the target language are all tied. That method has been adopted by the Attention Is All You Need model and many other neural machine translation models.

Papers Using This Method

Advanced Deep Learning Techniques for Analyzing Earnings Call Transcripts: Methodologies and Applications2025-02-27No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts2024-10-24RICo: Reddit ideological communities2024-06-05Exploring Multi-Level Threats in Telegram Data with AI-Human Annotation: A Preliminary Study2023-12-15Illicit Darkweb Classification via Natural-language Processing: Classifying Illicit Content of Webpages based on Textual Information2023-12-08Tied-Lora: Enhancing parameter efficiency of LoRA with weight tying2023-11-16Headless Language Models: Learning without Predicting with Contrastive Weight Tying2023-09-15Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation2023-06-30Explainable and High-Performance Hate and Offensive Speech Detection2022-06-26Approximately Equivariant Networks for Imperfectly Symmetric Dynamics2022-01-28IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages2021-11-15Offensive Language Identification in Low-resourced Code-mixed Dravidian languages using Pseudo-labeling2021-08-27Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts2021-08-24Learning ULMFiT and Self-Distillation with Calibration for Medical Dialogue System2021-07-20WHOSe Heritage: Classification of UNESCO World Heritage "Outstanding Universal Value" Documents with Soft Labels2021-04-12L3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset2021-03-21On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers2021-02-15indicnlp@kgp at DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Languages2021-02-14indicnlp@ kgp at DravidianLangTech-EACL2021: Offensive Language Identification in Dravidian Languages2021-02-14Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers2021-02-09