TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/PPO

PPO

Proximal Policy Optimization

Reinforcement LearningIntroduced 2000949 papers
Source Paper

Description

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.

Let r_t(θ)r\_{t}\left(\theta\right)r_t(θ) denote the probability ratio r_t(θ)=π_θ(a_t∣s_t)π_θ_old(a_t∣s_t)r\_{t}\left(\theta\right) = \frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)}r_t(θ)=π_θ_old(a_t∣s_t)π_θ(a_t∣s_t)​, so r(θ_old)=1r\left(\theta\_{old}\right) = 1r(θ_old)=1. TRPO maximizes a “surrogate” objective:

LCPI(θ)=E^_t[π_θ(a_t∣s_t)π_θ_old(a_t∣s_t))A^_t]=E^_t[r_t(θ)A^_t]L^{\text{CPI}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)})\hat{A}\_{t}\right] = \hat{\mathbb{E}}\_{t}\left[r\_{t}\left(\theta\right)\hat{A}\_{t}\right]LCPI(θ)=E^_t[π_θ_old(a_t∣s_t)π_θ(a_t∣s_t)​)A^_t]=E^_t[r_t(θ)A^_t]

Where CPICPICPI refers to a conservative policy iteration. Without a constraint, maximization of LCPIL^{CPI}LCPI would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move r_t(θ)r\_{t}\left(\theta\right)r_t(θ) away from 1:

JCLIP(θ)=E^_t[min⁡(r_t(θ)A^_t,clip(r_t(θ),1−ϵ,1+ϵ)A^_t)]J^{\text{CLIP}}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\min\left(r\_{t}\left(\theta\right)\hat{A}\_{t}, \text{clip}\left(r\_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}\_{t}\right)\right]JCLIP(θ)=E^_t[min(r_t(θ)A^_t,clip(r_t(θ),1−ϵ,1+ϵ)A^_t)]

where ϵ\epsilonϵ is a hyperparameter, say, ϵ=0.2\epsilon = 0.2ϵ=0.2. The motivation for this objective is as follows. The first term inside the min is LCPIL^{CPI}LCPI. The second term, clip(r_t(θ),1−ϵ,1+ϵ)A^_t\text{clip}\left(r\_{t}\left(\theta\right), 1-\epsilon, 1+\epsilon\right)\hat{A}\_{t}clip(r_t(θ),1−ϵ,1+ϵ)A^_t modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving r_tr\_{t}r_t outside of the interval [1−ϵ,1+ϵ]\left[1 − \epsilon, 1 + \epsilon\right][1−ϵ,1+ϵ]. Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse.

One detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.

Papers Using This Method

Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air2025-07-15Scene-Aware Conversational ADAS with Generative AI for Real-Time Driver Assistance2025-07-14Meta-Reinforcement Learning for Fast and Data-Efficient Spectrum Allocation in Dynamic Wireless Networks2025-07-13Deep Reinforcement Learning with Gradient Eligibility Traces2025-07-12Geo-ORBIT: A Federated Digital Twin Framework for Scene-Adaptive Lane Geometry Detection2025-07-11LeAD: The LLM Enhanced Planning System Converged with End-to-end Autonomous Driving2025-07-08Model-free Optical Processors using In Situ Reinforcement Learning with Proximal Policy Optimization2025-07-08YOLO-APD: Enhancing YOLOv8 for Robust Pedestrian Detection on Complex Road Geometries2025-07-072048: Reinforcement Learning in a Delayed Reward Environment2025-07-07LLM-based Realistic Safety-Critical Driving Video Generation2025-07-02BIDA: A Bi-level Interaction Decision-making Algorithm for Autonomous Vehicles in Dynamic Traffic Scenarios2025-06-19Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study2025-06-18Light Aircraft Game : Basic Implementation and training results analysis2025-06-17TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization2025-06-17Algorithmic Approaches to Enhance Safety in Autonomous Vehicles: Minimizing Lane Changes and Merging2025-06-17AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning2025-06-16Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models2025-06-16How Real is CARLAs Dynamic Vision Sensor? A Study on the Sim-to-Real Gap in Traffic Object Detection2025-06-16Ego-centric Learning of Communicative World Models for Autonomous Driving2025-06-09