TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/V-trace

V-trace

Reinforcement LearningIntroduced 200034 papers
Source Paper

Description

V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory (x_t,a_t,r_t)t=s+n_t=s\left(x\_{t}, a\_{t}, r\_{t}\right)^{t=s+n}\_{t=s}(x_t,a_t,r_t)t=s+n_t=s generated by the actor following some policy μ\muμ. We can define the nnn-steps V-trace target for V(x_s)V\left(x\_{s}\right)V(x_s), our value approximation at state x_sx\_{s}x_s as:

v_s=V(x_s)+∑s+n−1_t=sγt−s(∏t−1_i=sc_i)δ_tVv\_{s} = V\left(x\_{s}\right) + \sum^{s+n-1}\_{t=s}\gamma^{t-s}\left(\prod^{t-1}\_{i=s}c\_{i}\right)\delta\_{t}Vv_s=V(x_s)+∑s+n−1_t=sγt−s(∏t−1_i=sc_i)δ_tV

Where δ_tV=ρ_t(r_t+γV(x_t+1)−V(x_t))\delta\_{t}V = \rho\_{t}\left(r\_{t} + \gamma{V}\left(x\_{t+1}\right) - V\left(x\_{t}\right)\right)δ_tV=ρ_t(r_t+γV(x_t+1)−V(x_t)) is a temporal difference algorithm for VVV, and ρ_t=min(ρˉ,π(a_t∣x_t)μ(a_t∣x_t))\rho\_{t} = \text{min}\left(\bar{\rho}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right)ρ_t=min(ρˉ​,μ(a_t∣x_t)π(a_t∣x_t)​) and c_i=min(cˉ,π(a_t∣x_t)μ(a_t∣x_t))c\_{i} = \text{min}\left(\bar{c}, \frac{\pi\left(a\_{t}\mid{x\_{t}}\right)}{\mu\left(a\_{t}\mid{x\_{t}}\right)}\right)c_i=min(cˉ,μ(a_t∣x_t)π(a_t∣x_t)​) are truncated importance sampling weights. We assume that the truncation levels are such that ρˉ≥cˉ\bar{\rho} \geq \bar{c}ρˉ​≥cˉ.

Papers Using This Method

World Model Agents with Change-Based Intrinsic Motivation2025-03-26Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation2024-03-07Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach2023-12-19Cleanba: A Reproducible and Efficient Distributed Reinforcement Learning Platform2023-09-29A Robust and Opponent-Aware League Training Method for StarCraft II2023-09-21AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning2023-08-07Exploring the Promise and Limits of Real-Time Recurrent Learning2023-05-30DoMo-AC: Doubly Multi-step Off-policy Actor-Critic Algorithm2023-05-29Sharing Lifelong Reinforcement Learning Knowledge via Modulating Masks2023-05-18Lifelong Reinforcement Learning with Modulating Masks2022-12-21AcceRL: Policy Acceleration Framework for Deep Reinforcement Learning2022-11-28On Efficient Reinforcement Learning for Full-length Game of StarCraft II2022-09-23EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine2022-06-21Semantic Exploration from Language Abstractions and Pretrained Representations2022-04-08Off-Policy Correction For Multi-Agent Reinforcement Learning2021-11-22AI in Human-computer Gaming: Techniques, Challenges and Opportunities2021-11-15A Distributed Deep Reinforcement Learning Technique for Application Placement in Edge and Fog Computing Environments2021-10-24MACRPO: Multi-Agent Cooperative Recurrent Policy Optimization2021-09-02Rethinking of AlphaStar2021-08-07An Introduction of mini-AlphaStar2021-04-14