TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Prioritized Experience Replay

Prioritized Experience Replay

Reinforcement LearningIntroduced 2000138 papers
Source Paper

Description

Prioritized Experience Replay is a type of experience replay in reinforcement learning where we more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling.

The stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition iii as

P(i)=piα∑kpkαP(i) = \frac{p_i^{\alpha}}{\sum_k p_k^{\alpha}}P(i)=∑k​pkα​piα​​

where pi>0p_i > 0pi​>0 is the priority of transition iii. The exponent α\alphaα determines how much prioritization is used, with α=0\alpha=0α=0 corresponding to the uniform case.

Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using importance-sampling (IS) weights:

w_i=(1N⋅1P(i))βw\_{i} = \left(\frac{1}{N}\cdot\frac{1}{P\left(i\right)}\right)^{\beta}w_i=(N1​⋅P(i)1​)β

that fully compensates for the non-uniform probabilities P(i)P\left(i\right)P(i) if β=1\beta = 1β=1. These weights can be folded into the Q-learning update by using w_iδ_iw\_{i}\delta\_{i}w_iδ_i instead of δ_i\delta\_{i}δ_i - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by 1/max⁡_iw_i1/\max\_{i}w\_{i}1/max_iw_i so that they only scale the update downwards.

The two types of prioritization are proportional based, where p_i=∣δ_i∣+ϵp\_{i} = |\delta\_{i}| + \epsilonp_i=∣δ_i∣+ϵ and rank-based, where p_i=1rank(i)p\_{i} = \frac{1}{\text{rank}\left(i\right)}p_i=rank(i)1​, the latter where rank(i)\text{rank}\left(i\right)rank(i) is the rank of transition iii when the replay memory is sorted according to |δ_i\delta\_{i}δ_i|, For proportional based, hyperparameters used were α=0.7\alpha = 0.7α=0.7, β_0=0.5\beta\_{0} = 0.5β_0=0.5. For the rank-based variant, hyperparameters used were α=0.6\alpha = 0.6α=0.6, β_0=0.4\beta\_{0} = 0.4β_0=0.4.

Papers Using This Method

CAWR: Corruption-Averse Advantage-Weighted Regression for Robust Policy Optimization2025-06-18Calibrated Value-Aware Model Learning with Stochastic Environment Models2025-05-28Online Learning-based Adaptive Beam Switching for 6G Networks: Enhancing Efficiency and Resilience2025-05-12Graph Based Deep Reinforcement Learning Aided by Transformers for Multi-Agent Cooperation2025-04-11PER-DPP Sampling Framework and Its Application in Path Planning2025-03-10OptionZero: Planning with Learned Options2025-02-23Reinforcement Learning in Strategy-Based and Atari Games: A Review of Google DeepMinds Innovations2025-02-14Enhancing UAV Path Planning Efficiency Through Accelerated Learning2025-01-17SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks2025-01-07Evaluating World Models with LLM for Decision Making2024-11-13Evaluating Robustness of Reinforcement Learning Algorithms for Autonomous Shipping2024-11-07Interpreting the Learned Model in MuZero Planning2024-11-07Beyond The Rainbow: High Performance Deep Reinforcement Learning on a Desktop PC2024-11-06Enhancing LLM Agents for Code Generation with Possibility and Pass-rate Prioritized Experience Replay2024-10-16Learning in complex action spaces without policy gradients2024-10-08Investigating the Interplay of Prioritized Replay and Generalization2024-07-12ROER: Regularized Optimal Experience Replay2024-07-04Combining AI Control Systems and Human Decision Support via Robustness and Criticality2024-07-03Physics-informed Imitative Reinforcement Learning for Real-world Driving2024-06-18Efficient Monte Carlo Tree Search via On-the-Fly State-Conditioned Action Abstraction2024-06-02