TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Target Policy Smoothing

Target Policy Smoothing

GeneralIntroduced 2000116 papers
Source Paper

Description

Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral.

The modified target update is:

y=r+γQ_θ′(s′,π_θ′(s′)+ϵ)y = r + \gamma{Q}\_{\theta'}\left(s', \pi\_{\theta'}\left(s'\right) + \epsilon \right)y=r+γQ_θ′(s′,π_θ′(s′)+ϵ)

ϵ∼clip(N(0,σ),−c,c)\epsilon \sim \text{clip}\left(\mathcal{N}\left(0, \sigma\right), -c, c \right)ϵ∼clip(N(0,σ),−c,c)

where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter σ\sigmaσ.

Papers Using This Method

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning2025-06-06FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control2025-05-28LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models2025-05-21Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control2025-05-13Energy Efficient RSMA-Based LEO Satellite Communications Assisted by UAV-Mounted BD-Active RIS: A DRL Approach2025-05-07AlphaGrad: Non-Linear Gradient Normalization Optimizer2025-04-22Motion Control in Multi-Rotor Aerial Robots Using Deep Reinforcement Learning2025-02-09TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation2025-02-05EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning2025-01-25Enhancing UAV Path Planning Efficiency Through Accelerated Learning2025-01-17An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space2024-12-17Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning2024-11-20Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers2024-10-31NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation2024-10-30Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution2024-10-29Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions2024-10-15Navigation in a simplified Urban Flow through Deep Reinforcement Learning2024-09-26Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning2024-08-27Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning2024-08-26Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks2024-07-31