TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/TRPO

TRPO

Trust Region Policy Optimization

Reinforcement LearningIntroduced 200081 papers
Source Paper

Description

Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration.

Take the case of off-policy reinforcement learning, where the policy β\betaβ for collecting trajectories on rollout workers is different from the policy π\piπ to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:

J(θ)=∑_s∈Spπ_θ_old∑_a∈A(π_θ(a∣s)A^_θ_old(s,a))J\left(\theta\right) = \sum\_{s\in{S}}p^{\pi\_{\theta\_{old}}}\sum\_{a\in\mathcal{A}}\left(\pi\_{\theta}\left(a\mid{s}\right)\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)J(θ)=∑_s∈Spπ_θ_old∑_a∈A(π_θ(a∣s)A^_θ_old(s,a))

J(θ)=∑_s∈Spπ_θ_old∑_a∈A(β(a∣s)π_θ(a∣s)β(a∣s)A^_θ_old(s,a))J\left(\theta\right) = \sum\_{s\in{S}}p^{\pi\_{\theta\_{old}}}\sum\_{a\in\mathcal{A}}\left(\beta\left(a\mid{s}\right)\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)J(θ)=∑_s∈Spπ_θ_old∑_a∈A(β(a∣s)β(a∣s)π_θ(a∣s)​A^_θ_old(s,a))

J(θ)=E_s∼pπ_θ_old,a∼β(π_θ(a∣s)β(a∣s)A^_θ_old(s,a)) J\left(\theta\right) = \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}, a\sim{\beta}} \left(\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\beta\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)J(θ)=E_s∼pπ_θ_old,a∼β(β(a∣s)π_θ(a∣s)​A^_θ_old(s,a))

When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as π_θ_old(a∣s)\pi\_{\theta\_{old}}\left(a\mid{s}\right)π_θ_old(a∣s) and thus the objective function becomes:

J(θ)=E_s∼pπ_θ_old,a∼π_θ_old(π_θ(a∣s)π_θ_old(a∣s)A^_θ_old(s,a)) J\left(\theta\right) = \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}, a\sim{\pi\_{\theta\_{old}}}} \left(\frac{\pi\_{\theta}\left(a\mid{s}\right)}{\pi\_{\theta\_{old}}\left(a\mid{s}\right)}\hat{A}\_{\theta\_{old}}\left(s, a\right)\right)J(θ)=E_s∼pπ_θ_old,a∼π_θ_old(π_θ_old(a∣s)π_θ(a∣s)​A^_θ_old(s,a))

TRPO aims to maximize the objective function J(θ)J\left(\theta\right)J(θ) subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter δ\deltaδ:

E_s∼pπ_θ_old[D_KL(π_θ_old(.∣s)∣∣π_θ(.∣s))]≤δ \mathbb{E}\_{s\sim{p}^{\pi\_{\theta\_{old}}}} \left[D\_{KL}\left(\pi\_{\theta\_{old}}\left(.\mid{s}\right)\mid\mid\pi\_{\theta}\left(.\mid{s}\right)\right)\right] \leq \deltaE_s∼pπ_θ_old[D_KL(π_θ_old(.∣s)∣∣π_θ(.∣s))]≤δ

Papers Using This Method

StaQ it! Growing neural networks for Policy Mirror Descent2025-06-16Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean-Field Games2025-05-28Improving Value Estimation Critically Enhances Vanilla Policy Gradient2025-05-25Energy Efficient RSMA-Based LEO Satellite Communications Assisted by UAV-Mounted BD-Active RIS: A DRL Approach2025-05-07Deep Reinforcement Learning-Based User Association in Hybrid LiFi/WiFi Indoor Networks2025-03-03Fast Convergence of Softmax Policy Mirror Ascent2024-11-18Dynamics of Resource Allocation in O-RANs: An In-depth Exploration of On-Policy and Off-Policy Deep Reinforcement Learning for Real-Time Applications2024-11-17Embedding Safety into RL: A New Take on Trust Region Methods2024-11-05Matrix Low-Rank Trust Region Policy Optimization2024-05-27Linear Function Approximation as a Computationally Efficient Method to Solve Classical Reinforcement Learning Challenges2024-05-27Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues2024-04-12Policy Mirror Descent with Lookahead2024-03-21Convergence for Natural Policy Gradient on Infinite-State Queueing MDPs2024-02-07Simple Policy Optimization2024-01-29Clipped-Objective Policy Gradients for Pessimistic Policy Optimization2023-11-10Dropout Strategy in Reinforcement Learning: Limiting the Surrogate Objective Variance in Policy Optimization Methods2023-10-31Distributional Soft Actor-Critic with Three Refinements2023-10-09General Munchausen Reinforcement Learning with Tsallis Kullback-Leibler Divergence2023-09-21Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing2023-09-18ContainerGym: A Real-World Reinforcement Learning Benchmark for Resource Allocation2023-07-06