Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Discriminative Fine-Tuning

Discriminative Fine-Tuning

GeneralIntroduced 20001990 papers

Description

Discriminative Fine-Tuning is a fine-tuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters $\theta$ at time step $t$ looks like the following (Ruder, 2016):

$\theta\_{t} = \theta\_{t-1} − \eta\cdot\nabla\_{\theta}J\left(\theta\right)$

where $\eta$ is the learning rate and $\nabla\_{\theta}J\left(\theta\right)$ is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters $\theta$ into { $\theta\_{1}, \ldots, \theta\_{L}$ } where $\theta\_{l}$ contains the parameters of the model at the $l$ -th layer and $L$ is the number of layers of the model. Similarly, we obtain { $\eta\_{1}, \ldots, \eta\_{L}$ } where $\theta\_{l}$ where $\eta\_{l}$ is the learning rate of the $l$ -th layer. The SGD update with discriminative finetuning is then:

$\theta\_{t}^{l} = \theta\_{t-1}^{l} - \eta^{l}\cdot\nabla\_{\theta^{l}}J\left(\theta\right)$

The authors find that empirically it worked well to first choose the learning rate $\eta^{L}$ of the last layer by fine-tuning only the last layer and using $\eta^{l-1}=\eta^{l}/2.6$ as the learning rate for lower layers.

Papers Using This Method

Making Language Model a Hierarchical Classifier and Generator2025-07-17 Generative Click-through Rate Prediction with Applications to Search Advertising2025-07-15 Behaviour Space Analysis of LLM-driven Meta-heuristic Discovery2025-07-04 Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models2025-06-28 Large Language Models Acing Chartered Accountancy2025-06-26 Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?2025-06-26 Large Language Model-Driven Code Compliance Checking in Building Information Modeling2025-06-25 InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking2025-06-17 M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models2025-06-17 Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks2025-06-17 NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors2025-06-12 Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization2025-06-12 A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11 Latent Multi-Head Attention for Small Language Models2025-06-11 Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving2025-06-10 AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP2025-06-10 LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization2025-06-09 Generative Voice Bursts during Phone Call2025-06-09 Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models2025-06-08 RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation2025-06-07