TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Discriminative Fine-Tuning

Discriminative Fine-Tuning

GeneralIntroduced 20001990 papers
Source Paper

Description

Discriminative Fine-Tuning is a fine-tuning strategy that is used for ULMFiT type models. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning allows us to tune each layer with different learning rates. For context, the regular stochastic gradient descent (SGD) update of a model’s parameters θ\thetaθ at time step ttt looks like the following (Ruder, 2016):

θ_t=θ_t−1−η⋅∇_θJ(θ) \theta\_{t} = \theta\_{t-1} − \eta\cdot\nabla\_{\theta}J\left(\theta\right)θ_t=θ_t−1−η⋅∇_θJ(θ)

where η\etaη is the learning rate and ∇_θJ(θ)\nabla\_{\theta}J\left(\theta\right)∇_θJ(θ) is the gradient with regard to the model’s objective function. For discriminative fine-tuning, we split the parameters θ\thetaθ into {θ_1,…,θ_L\theta\_{1}, \ldots, \theta\_{L}θ_1,…,θ_L} where θ_l\theta\_{l}θ_l contains the parameters of the model at the lll-th layer and LLL is the number of layers of the model. Similarly, we obtain {η_1,…,η_L\eta\_{1}, \ldots, \eta\_{L}η_1,…,η_L} where θ_l\theta\_{l}θ_l where η_l\eta\_{l}η_l is the learning rate of the lll-th layer. The SGD update with discriminative finetuning is then:

θ_tl=θ_t−1l−ηl⋅∇_θlJ(θ)\theta\_{t}^{l} = \theta\_{t-1}^{l} - \eta^{l}\cdot\nabla\_{\theta^{l}}J\left(\theta\right)θ_tl=θ_t−1l−ηl⋅∇_θlJ(θ)

The authors find that empirically it worked well to first choose the learning rate ηL\eta^{L}ηL of the last layer by fine-tuning only the last layer and using ηl−1=ηl/2.6\eta^{l-1}=\eta^{l}/2.6ηl−1=ηl/2.6 as the learning rate for lower layers.

Papers Using This Method

Making Language Model a Hierarchical Classifier and Generator2025-07-17Generative Click-through Rate Prediction with Applications to Search Advertising2025-07-15Behaviour Space Analysis of LLM-driven Meta-heuristic Discovery2025-07-04Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models2025-06-28Large Language Models Acing Chartered Accountancy2025-06-26Cat and Mouse -- Can Fake Text Generation Outpace Detector Systems?2025-06-26Large Language Model-Driven Code Compliance Checking in Building Information Modeling2025-06-25InsertRank: LLMs can reason over BM25 scores to Improve Listwise Reranking2025-06-17M2BeamLLM: Multimodal Sensing-empowered mmWave Beam Prediction with Large Language Models2025-06-17Toward a Graph Foundation Model: Pre-Training Transformers With Random Walks2025-06-17NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors2025-06-12Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization2025-06-12A Novel Lightweight Transformer with Edge-Aware Fusion for Remote Sensing Image Captioning2025-06-11Latent Multi-Head Attention for Small Language Models2025-06-11Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving2025-06-10AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP2025-06-10LLM-driven Indoor Scene Layout Generation via Scaled Human-aligned Data Synthesis and Multi-Stage Preference Optimization2025-06-09Generative Voice Bursts during Phone Call2025-06-09Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models2025-06-08RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation2025-06-07