Tasks SotA Datasets Papers Methods Submit About

Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable Benchmarks All SotA Datasets Papers Methods

Community

Submit Results About

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Target Policy Smoothing

Target Policy Smoothing

GeneralIntroduced 2000116 papers

Description

Target Policy Smoothing is a regularization strategy for the value function in reinforcement learning. Deterministic policies can overfit to narrow peaks in the value estimate, making them highly susceptible to functional approximation error, increasing the variance of the target. To reduce this variance, target policy smoothing adds a small amount of random noise to the target policy and averages over mini-batches - approximating a SARSA-like expectation/integral.

The modified target update is:

$y = r + \gamma{Q}\_{\theta'}\left(s', \pi\_{\theta'}\left(s'\right) + \epsilon \right)$

$\epsilon \sim \text{clip}\left(\mathcal{N}\left(0, \sigma\right), -c, c \right)$

where the added noise is clipped to keep the target close to the original action. The outcome is an algorithm reminiscent of Expected SARSA, where the value estimate is instead learned off-policy and the noise added to the target policy is chosen independently of the exploration policy. The value estimate learned is with respect to a noisy policy defined by the parameter $\sigma$ .

Papers Using This Method

Gradual Transition from Bellman Optimality Operator to Bellman Operator in Online Reinforcement Learning2025-06-06 FastTD3: Simple, Fast, and Capable Reinforcement Learning for Humanoid Control2025-05-28 LLM-Explorer: A Plug-in Reinforcement Learning Policy Exploration Enhancement Driven by Large Language Models2025-05-21 Monte Carlo Beam Search for Actor-Critic Reinforcement Learning in Continuous Control2025-05-13 Energy Efficient RSMA-Based LEO Satellite Communications Assisted by UAV-Mounted BD-Active RIS: A DRL Approach2025-05-07 AlphaGrad: Non-Linear Gradient Normalization Optimizer2025-04-22 Motion Control in Multi-Rotor Aerial Robots Using Deep Reinforcement Learning2025-02-09 TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation2025-02-05 EvoRL: A GPU-accelerated Framework for Evolutionary Reinforcement Learning2025-01-25 Enhancing UAV Path Planning Efficiency Through Accelerated Learning2025-01-17 An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space2024-12-17 Provably Efficient Action-Manipulation Attack Against Continuous Reinforcement Learning2024-11-20 Reinforcement Learning Gradients as Vitamin for Online Finetuning Decision Transformers2024-10-31 NetworkGym: Reinforcement Learning Environments for Multi-Access Traffic Management in Network Simulation2024-10-30 Human-Readable Programs as Actors of Reinforcement Learning Agents Using Critic-Moderated Evolution2024-10-29 Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions2024-10-15 Navigation in a simplified Urban Flow through Deep Reinforcement Learning2024-09-26 Simultaneous Training of First- and Second-Order Optimizers in Population-Based Reinforcement Learning2024-08-27 Optimizing TD3 for 7-DOF Robotic Arm Grasping: Overcoming Suboptimality with Exploration-Enhanced Contrastive Learning2024-08-26 Image-Based Deep Reinforcement Learning with Intrinsically Motivated Stimuli: On the Execution of Complex Robotic Tasks2024-07-31