TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Retrace

Retrace

Reinforcement LearningIntroduced 200031 papers
Source Paper

Description

Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy (π,β)\left(\pi, \beta\right)(π,β). With off-policy rollout for TD learning, we must use importance sampling for the update:

ΔQimp(S_t,A_t)=γt∏_1≤τ≤tπ(A_τ∣S_τ)β(A_τ∣S_τ)δ_t\Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\delta\_{t}ΔQimp(S_t,A_t)=γt∏_1≤τ≤tβ(A_τ∣S_τ)π(A_τ∣S_τ)​δ_t

This product term can lead to high variance, so Retrace modifies ΔQ\Delta{Q}ΔQ to have importance weights truncated by no more than a constant ccc:

ΔQimp(S_t,A_t)=γt∏_1≤τ≤tmin⁡(c,π(A_τ∣S_τ)β(A_τ∣S_τ))δ_t\Delta{Q}^{\text{imp}}\left(S\_{t}, A\_{t}\right) = \gamma^{t}\prod\_{1\leq{\tau}\leq{t}}\min\left(c, \frac{\pi\left(A\_{\tau}\mid{S\_{\tau}}\right)}{\beta\left(A\_{\tau}\mid{S\_{\tau}}\right)}\right)\delta\_{t}ΔQimp(S_t,A_t)=γt∏_1≤τ≤tmin(c,β(A_τ∣S_τ)π(A_τ∣S_τ)​)δ_t

Papers Using This Method

UI-Evol: Automatic Knowledge Evolving for Computer Use Agents2025-05-28Generative Artificial Intelligence: Evolving Technology, Growing Societal Impact, and Opportunities for Information Systems Research2025-02-25Dynamics of Resource Allocation in O-RANs: An In-depth Exploration of On-Policy and Off-Policy Deep Reinforcement Learning for Real-Time Applications2024-11-17IDRetracor: Towards Visual Forensics Against Malicious Face Swapping2024-08-13Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues2024-04-12Off-policy Distributional Q($λ$): Distributional RL without Importance Sampling2024-02-08Network-thinking to optimize surveillance and control of crop parasites. A review2023-10-11Distributional Estimation of Data Uncertainty for Surveillance Face Anti-spoofing2023-09-18PDVN: A Patch-based Dual-view Network for Face Liveness Detection using Light Field Focal Stack2023-01-17AcceRL: Policy Acceleration Framework for Deep Reinforcement Learning2022-11-28Asynchronous Curriculum Experience Replay: A Deep Reinforcement Learning Approach for UAV Autonomous Motion Control in Unknown Dynamic Environments2022-07-04Safe-FinRL: A Low Bias and Variance Deep Reinforcement Learning Implementation for High-Freq Stock Trading2022-06-13Bias-inducing geometries: an exactly solvable data model with fairness implications2022-05-31Deep Learning with Logical Constraints2022-05-01Marginalized Operators for Off-policy Reinforcement Learning2022-03-30Is Word Error Rate a good evaluation metric for Speech Recognition in Indic Languages?2022-03-30Improving the Efficiency of Off-Policy Reinforcement Learning by Accounting for Past Decisions2021-12-23Learning Reward Machines: A Study in Partially Observable Reinforcement Learning2021-12-17Human Languages with Greater Information Density Increase Communication Speed, but Decrease Conversation Breadth2021-12-15Dynamics of the market states in the space of correlation matrices with applications to financial markets2021-07-12