TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/DD-PPO

DD-PPO

Decentralized Distributed Proximal Policy Optimization

Reinforcement LearningIntroduced 20008 papers
Source Paper

Description

Decentralized Distributed Proximal Policy Optimization (DD-PPO) is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever `stale'), making it conceptually simple and easy to implement.

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization.

Let r_t(θ)r\_{t}\left(\theta\right)r_t(θ) denote the probability ratio r_t(θ)=π_θ(a_t∣s_t)π_θ_old(a_t∣s_t)r\_{t}\left(\theta\right) = \frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)}r_t(θ)=π_θ_old(a_t∣s_t)π_θ(a_t∣s_t)​, so r(θ_old)=1r\left(\theta\_{old}\right) = 1r(θ_old)=1. TRPO maximizes a “surrogate” objective:

Lv(θ)=E^_t[π_θ(a_t∣s_t)π_θ_old(a_t∣s_t))A^_t]=E^_t[r_t(θ)A^_t]L^{v}\left({\theta}\right) = \hat{\mathbb{E}}\_{t}\left[\frac{\pi\_{\theta}\left(a\_{t}\mid{s\_{t}}\right)}{\pi\_{\theta\_{old}}\left(a\_{t}\mid{s\_{t}}\right)})\hat{A}\_{t}\right] = \hat{\mathbb{E}}\_{t}\left[r\_{t}\left(\theta\right)\hat{A}\_{t}\right]Lv(θ)=E^_t[π_θ_old(a_t∣s_t)π_θ(a_t∣s_t)​)A^_t]=E^_t[r_t(θ)A^_t]

As a general abstraction, DD-PPO implements the following: at step kkk, worker nnn has a copy of the parameters, θnk\theta^k_nθnk​, calculates the gradient, δθnk\delta \theta^k_nδθnk​, and updates θ\thetaθ via

θk+1_n=ParamUpdate(θk_n,AllReduce(δθk_1,…,δθk_N))=ParamUpdate(θk_n,1N∑i=1Nδθik)\theta^{k+1}\_n = \text{ParamUpdate}\Big(\theta^{k}\_n, \text{AllReduce}\big(\delta \theta^k\_1, \ldots, \delta \theta^k\_N\big)\Big) = \text{ParamUpdate}\Big(\theta^{k}\_n, \frac{1}{N} \sum_{i=1}^{N} { \delta \theta^k_i} \Big)θk+1_n=ParamUpdate(θk_n,AllReduce(δθk_1,…,δθk_N))=ParamUpdate(θk_n,N1​∑i=1N​δθik​)

where ParamUpdate\text{ParamUpdate}ParamUpdate is any first-order optimization technique (e.g. gradient descent) and AllReduce\text{AllReduce}AllReduce performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).

Papers Using This Method

Decentralized Distributed Proximal Policy Optimization (DD-PPO) for High Performance Computing Scheduling on Multi-User Systems2025-05-06Sharing Lifelong Reinforcement Learning Knowledge via Modulating Masks2023-05-18Comparison of Model-Free and Model-Based Learning-Informed Planning for PointGoal Navigation2022-12-17VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement2022-10-11Uncertainty-driven Planner for Exploration and Navigation2022-02-24Integrating Egocentric Localization for More Realistic Point-Goal Navigation Agents2020-09-07Auxiliary Tasks Speed Up Learning PointGoal Navigation2020-07-09DD-PPO: Learning Near-Perfect PointGoal Navigators from 2.5 Billion Frames2019-11-01