TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

99 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

Q-Learning

Q-Learning is an off-policy temporal difference control algorithm: The learned action-value function directly approximates , the optimal action-value function, independent of the policy being followed. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 19841734 papers

PPO

Proximal Policy Optimization

Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization. Let denote the probability ratio , so . TRPO maximizes a “surrogate” objective: Where refers to a conservative policy iteration. Without a constraint, maximization of would lead to an excessively large policy update; hence, we PPO modifies the objective, to penalize changes to the policy that move away from 1: where is a hyperparameter, say, . The motivation for this objective is as follows. The first term inside the min is . The second term, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving outside of the interval . Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse. One detail to note is that when we apply PPO for a network where we have shared parameters for actor and critic functions, we typically add to the objective function an error term on value estimation and an entropy term to encourage exploration.

Reinforcement LearningIntroduced 2000949 papers

Experience Replay

Experience Replay is a replay memory technique used in reinforcement learning where we store the agent’s experiences at each time-step, in a data-set , pooled over many episodes into a replay memory. We then usually sample the memory randomly for a minibatch of experience, and use this to learn off-policy, as with Deep Q-Networks. This tackles the problem of autocorrelation leading to unstable training, by making the problem more like a supervised learning problem. Image Credit: Hands-On Reinforcement Learning with Python, Sudharsan Ravichandiran

Reinforcement LearningIntroduced 1993865 papers

DQN

Deep Q-Network

A DQN, or Deep Q-Network, approximates a state-value function in a Q-Learning framework with a neural network. In the Atari Games case, they take in several frames of the game as an input and output state values for each action as an output. It is usually used in conjunction with Experience Replay, for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random. Additionally, the Q-Network is usually optimized towards a frozen target network that is periodically updated with the latest weights every steps (where is a hyperparameter). The latter makes training more stable by preventing short-term oscillations from a moving target. The former tackles autocorrelation that would occur from on-line learning, and having a replay memory makes the problem more like a supervised learning problem. Image Source: here

Reinforcement LearningIntroduced 2000519 papers

HOC

High-Order Consensuses

Reinforcement LearningIntroduced 2000515 papers

CARLA

CARLA: An Open Urban Driving Simulator

CARLA is an open-source simulator for autonomous driving research. CARLA has been developed from the ground up to support development, training, and validation of autonomous urban driving systems. In addition to open-source code and protocols, CARLA provides open digital assets (urban layouts, buildings, vehicles) that were created for this purpose and can be used freely. Source: Dosovitskiy et al. Image source: Dosovitskiy et al.

Reinforcement LearningIntroduced 2000422 papers

DPO

Direct Preference Optimization

Reinforcement LearningIntroduced 2000409 papers

Counterfactuals

Counterfactuals Explanations

Reinforcement LearningIntroduced 2000400 papers

AM

Attention Model

Reinforcement LearningIntroduced 2000274 papers

GA

Genetic Algorithms

Genetic Algorithms are search algorithms that mimic Darwinian biological evolution in order to select and propagate better solutions.

Reinforcement LearningIntroduced 2000259 papers

DDPG

Deep Deterministic Policy Gradient

DDPG, or Deep Deterministic Policy Gradient, is an actor-critic, model-free algorithm based on the deterministic policy gradient that can operate over continuous action spaces. It combines the actor-critic approach with insights from DQNs: in particular, the insights that 1) the network is trained off-policy with samples from a replay buffer to minimize correlations between samples, and 2) the network is trained with a target Q network to give consistent targets during temporal difference backups. DDPG makes use of the same ideas along with batch normalization.

Reinforcement LearningIntroduced 2000218 papers

REINFORCE

REINFORCE is a Monte Carlo variant of a policy gradient algorithm in reinforcement learning. The agent collects samples of an episode using its current policy, and uses it to update the policy parameter . Since one full trajectory must be completed to construct a sample space, it is updated as an off-policy algorithm. Image Credit: Tingwu Wang

Reinforcement LearningIntroduced 1999185 papers

Monte-Carlo Tree Search

Monte-Carlo Tree Search is a planning algorithm that accumulates value estimates obtained from Monte Carlo simulations in order to successively direct simulations towards more highly-rewarded trajectories. We execute MCTS after encountering each new state to select an agent's action for that state: it is executed again to select the action for the next state. Each execution is an iterative process that simulates many trajectories starting from the current state to the terminal state. The core idea is to successively focus multiple simulations starting at the current state by extending the initial portions of trajectories that have received high evaluations from earlier simulations. Source: Sutton and Barto, Reinforcement Learning (2nd Edition) Image Credit: Chaslot et al

Reinforcement LearningIntroduced 2006166 papers

Prioritized Experience Replay

Prioritized Experience Replay is a type of experience replay in reinforcement learning where we more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. This prioritization can lead to a loss of diversity, which is alleviated with stochastic prioritization, and introduce bias, which can be corrected with importance sampling. The stochastic sampling method interpolates between pure greedy prioritization and uniform random sampling. The probability of being sampled is ensured to be monotonic in a transition's priority, while guaranteeing a non-zero probability even for the lowest-priority transition. Concretely, define the probability of sampling transition as where is the priority of transition . The exponent determines how much prioritization is used, with corresponding to the uniform case. Prioritized replay introduces bias because it changes this distribution in an uncontrolled fashion, and therefore changes the solution that the estimates will converge to. We can correct this bias by using importance-sampling (IS) weights: that fully compensates for the non-uniform probabilities if . These weights can be folded into the Q-learning update by using instead of - weighted IS rather than ordinary IS. For stability reasons, we always normalize weights by so that they only scale the update downwards. The two types of prioritization are proportional based, where and rank-based, where , the latter where is the rank of transition when the replay memory is sorted according to ||, For proportional based, hyperparameters used were , . For the rank-based variant, hyperparameters used were , .

Reinforcement LearningIntroduced 2000138 papers

TD3

Twin Delayed Deep Deterministic

TD3 builds on the DDPG algorithm for reinforcement learning, with a couple of modifications aimed at tackling overestimation bias with the value function. In particular, it utilises clipped double Q-learning, delayed update of target and policy networks, and target policy smoothing (which is similar to a SARSA based update; a safer update, as they provide higher value to actions resistant to perturbations).

Reinforcement LearningIntroduced 2000117 papers

AlphaZero

AlphaZero is a reinforcement learning agent for playing board games such as Go, chess, and shogi.

Reinforcement LearningIntroduced 2000114 papers

Double Q-learning

Double Q-learning is an off-policy reinforcement learning algorithm that utilises double estimation to counteract overestimation problems with traditional Q-learning. The max operator in standard Q-learning and DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. To prevent this, we can decouple the selection from the evaluation, which is the idea behind Double Q-learning: The Double Q-learning error can then be written as: Here the selection of the action in the is still due to the online weights . But we use a second set of weights to fairly evaluate the value of this policy. Source: Deep Reinforcement Learning with Double Q-learning

Reinforcement LearningIntroduced 2000112 papers

A2C

A2C, or Advantage Actor Critic, is a synchronous version of the A3C policy gradient method. As an alternative to the asynchronous implementation of A3C, A2C is a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before updating, averaging over all of the actors. This more effectively uses GPUs due to larger batch sizes. Image Credit: OpenAI Baselines

Reinforcement LearningIntroduced 200082 papers

TRPO

Trust Region Policy Optimization

Trust Region Policy Optimization, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL divergence constraint on the size of the policy update at each iteration. Take the case of off-policy reinforcement learning, where the policy for collecting trajectories on rollout workers is different from the policy to optimize for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator: When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimize. However, when rollout workers and optimizers are running in parallel asynchronously, the behavior policy can get stale. TRPO considers this subtle difference: It labels the behavior policy as and thus the objective function becomes: TRPO aims to maximize the objective function subject to a trust region constraint which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter :

Reinforcement LearningIntroduced 200081 papers

Soft Actor Critic

Soft Actor Critic, or SAC, is an off-policy actor-critic deep RL algorithm based on the maximum entropy reinforcement learning framework. In this framework, the actor aims to maximize expected reward while also maximizing entropy. That is, to succeed at the task while acting as randomly as possible. Prior deep RL methods based on this framework have been formulated as Q-learning methods. SAC combines off-policy updates with a stable stochastic actor-critic formulation. The SAC objective has a number of advantages. First, the policy is incentivized to explore more widely, while giving up on clearly unpromising avenues. Second, the policy can capture multiple modes of near-optimal behavior. In problem settings where multiple actions seem equally attractive, the policy will commit equal probability mass to those actions. Lastly, the authors present evidence that it improves learning speed over state-of-art methods that optimize the conventional RL objective function.

Reinforcement LearningIntroduced 200058 papers

A3C

A3C, Asynchronous Advantage Actor Critic, is a policy gradient algorithm in reinforcement learning that maintains a policy and an estimate of the value function . It operates in the forward view and uses a mix of -step returns to update both the policy and the value-function. The policy and the value function are updated after every actions or when a terminal state is reached. The update performed by the algorithm can be seen as where is an estimate of the advantage function given by: where can vary from state to state and is upper-bounded by . The critics in A3C learn the value function while multiple actors are trained in parallel and get synced with global parameters every so often. The gradients are accumulated as part of training for stability - this is like parallelized stochastic gradient descent. Note that while the parameters of the policy and of the value function are shown as being separate for generality, we always share some of the parameters in practice. We typically use a convolutional neural network that has one softmax output for the policy and one linear output for the value function , with all non-output layers shared.

Reinforcement LearningIntroduced 200057 papers

Sarsa

Sarsa is an on-policy TD control algorithm: This update is done after every transition from a nonterminal state . if is terminal, then is defined as zero. To design an on-policy control algorithm using Sarsa, we estimate for a behaviour policy and then change towards greediness with respect to . Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 199456 papers

REM

Random Ensemble Mixture

Random Ensemble Mixture (REM) is an easy to implement extension of DQN inspired by Dropout. The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.

Reinforcement LearningIntroduced 200048 papers

MuZero

MuZero is a model-based reinforcement learning algorithm. It builds upon AlphaZero's search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure. The main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observation (e.g. an image of the Go board or the Atari screen) as an input and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward. There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state. Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.

Reinforcement LearningIntroduced 200046 papers

MADDPG

MADDPG, or Multi-agent DDPG, extends DDPG into a multi-agent policy gradient algorithm where decentralized agents learn a centralized critic based on the observations and actions of all agents. It leads to learned policies that only use local information (i.e. their own observations) at execution time, does not assume a differentiable model of the environment dynamics or any particular structure on the communication method between agents, and is applicable not only to cooperative interaction but to competitive or mixed interaction involving both physical and communicative behavior. The critic is augmented with extra information about the policies of other agents, while the actor only has access to local information. After training is completed, only the local actors are used at execution phase, acting in a decentralized manner.

Reinforcement LearningIntroduced 200036 papers

URL

Umbrella Reinforcement Learning

A computationally efficient approach for solving hard nonlinear problems of reinforcement learning (RL). It combines umbrella sampling, from computational physics/chemistry, with optimal control methods. The approach is realized on the basis of neural networks, with the use of policy gradient. It outperforms, by computational efficiency and implementation universality, the available state-of-the-art algorithms, in application to hard RL problems with sparse reward, state traps and lack of terminal states. The proposed approach uses an ensemble of simultaneously acting agents, with a modified reward which includes the ensemble entropy, yielding an optimal exploration-exploitation balance.

Reinforcement LearningIntroduced 200035 papers

V-trace

V-trace is an off-policy actor-critic reinforcement learning algorithm that helps tackle the lag between when actions are generated by the actors and when the learner estimates the gradient. Consider a trajectory generated by the actor following some policy . We can define the -steps V-trace target for , our value approximation at state as: Where is a temporal difference algorithm for , and and are truncated importance sampling weights. We assume that the truncation levels are such that .

Reinforcement LearningIntroduced 200034 papers

Retrace

Retrace is an off-policy Q-value estimation algorithm which has guaranteed convergence for a target and behaviour policy . With off-policy rollout for TD learning, we must use importance sampling for the update: This product term can lead to high variance, so Retrace modifies to have importance weights truncated by no more than a constant :

Reinforcement LearningIntroduced 200031 papers

N-step Returns

-step Returns are used for value function estimation in reinforcement learning. Specifically, for steps we can write the complete return as: We can then write an -step backup, in the style of TD learning, as: Multi-step returns often lead to faster learning with suitably tuned . Image Credit: Sutton and Barto, Reinforcement Learning

Reinforcement LearningIntroduced 200029 papers

R2D2

Recurrent Replay Distributed DQN

Building on the recent successes of distributed training of RL agents, R2D2 is an RL approach that trains a RNN-based RL agents from distributed prioritized experience replay. Using a single network architecture and fixed set of hyperparameters, Recurrent Replay Distributed DQN quadrupled the previous state of the art on Atari-57, and matches the state of the art on DMLab-30. It was the first agent to exceed human-level performance in 52 of the 57 Atari games.

Reinforcement LearningIntroduced 200025 papers

Firefly algorithm

Metaheuristic algorithm

Reinforcement LearningIntroduced 200024 papers

Dueling Network

A Dueling Network is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q as shown in the figure to the right. The last module uses the following mapping: This formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.

Reinforcement LearningIntroduced 200023 papers

DPG

Deterministic Policy Gradient

Deterministic Policy Gradient, or DPG, is a policy gradient method for reinforcement learning. Instead of the policy function being modeled as a probability distribution, DPG considers and calculates gradients for a deterministic policy .

Reinforcement LearningIntroduced 201420 papers

RLAIF

Reinforcement Learning from AI Feedback

Reinforcement LearningIntroduced 200019 papers

Go-Explore

Go-Explore is a family of algorithms aiming to tackle two challenges with effective exploration in reinforcement learning: algorithms forgetting how to reach previously visited states ("detachment") and from failing to first return to a state before exploring from it ("derailment"). To avoid detachment, Go-Explore builds an archive of the different states it has visited in the environment, thus ensuring that states cannot be forgotten. Starting with an archive beginning with the initial state, the archive is built iteratively. In Go-Explore we: (a) Probabilistically select a state from the archive, preferring states associated with promising cells. (b) Return to the selected state, such as by restoring simulator state or by running a goal-conditioned policy. (c) Explore from that state by taking random actions or sampling from a trained policy. (d) Map every state encountered during returning and exploring to a low-dimensional cell representation. (e) Add states that map to new cells to the archive and update other archive entries.

Reinforcement LearningIntroduced 200016 papers

IQL

Implicit Q-Learning

Reinforcement LearningIntroduced 200016 papers

Accumulating Eligibility Trace

An Accumulating Eligibility Trace is a type of eligibility trace where the trace increments in an accumulative way. For the memory vector :

Reinforcement LearningIntroduced 200015 papers

TD Lambda

TDINLINEMATH1 is a generalisation of TDINLINEMATH2 reinforcement learning algorithms, but it employs an eligibility trace and -weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by : The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here is the feature vector. The TD error for state-value prediction is: In TDINLINEMATH1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace: Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 200014 papers

Stochastic Dueling Network

A Stochastic Dueling Network, or SDN, is an architecture for learning a value function . The SDN learns both and off-policy while maintaining consistency between the two estimates. At each time step it outputs a stochastic estimate of and a deterministic estimate of .

Reinforcement LearningIntroduced 200012 papers

ACER

ACER, or Actor Critic with Experience Replay, is an actor-critic deep reinforcement learning agent with experience replay. It can be seen as an off-policy extension of A3C, where the off-policy estimator is made feasible by: - Using Retrace Q-value estimation. - Using truncated importance sampling with bias correction. - Using a trust region policy optimization method. - Using a stochastic dueling network architecture.

Reinforcement LearningIntroduced 200012 papers

SCST

Self-critical Sequence Training

Reinforcement LearningIntroduced 200012 papers

Eligibility Trace

An Eligibility Trace is a memory vector that parallels the long-term weight vector . The idea is that when a component of participates in producing an estimated value, the corresponding component of is bumped up and then begins to fade away. Learning will then occur in that component of if a nonzero TD error occurs before the trade falls back to zero. The trace-decay parameter determines the rate at which the trace falls. Intuitively, they tackle the credit assignment problem by capturing both a frequency heuristic - states that are visited more often deserve more credit - and a recency heuristic - states that are visited more recently deserve more credit. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 200011 papers

Noisy Linear Layer

A Noisy Linear Layer is a linear layer with parametric noise added to the weights. This induced stochasticity can be used in reinforcement learning networks for the agent's policy to aid efficient exploration. The parameters of the noise are learned with gradient descent along with any other remaining network weights. Factorized Gaussian noise is the type of noise usually employed. The noisy linear layer takes the form: where and are random variables.

Reinforcement LearningIntroduced 200011 papers

D4PG

Distributed Distributional DDPG

D4PG, or Distributed Distributional DDPG, is a policy gradient algorithm that extends upon the DDPG. The improvements include a distributional updates to the DDPG algorithm, combined with the use of multiple distributed workers all writing into the same replay table. The biggest performance gain of other simpler changes was the use of -step returns. The authors found that the use of prioritized experience replay was less crucial to the overall D4PG algorithm especially on harder problems.

Reinforcement LearningIntroduced 200011 papers

Ape-X

Ape-X is a distributed architecture for deep reinforcement learning. The algorithm decouples acting from learning: the actors interact with their own instances of the environment by selecting actions according to a shared neural network, and accumulate the resulting experience in a shared experience replay memory; the learner replays samples of experience and updates the neural network. The architecture relies on prioritized experience replay to focus only on the most significant data generated by the actors. In contrast to Gorila, Ape-X uses a shared, centralized replay memory, and instead of sampling uniformly, it prioritizes, to sample the most useful data more often. All communications are batched with the centralized replay, increasing the efficiency and throughput at the cost of some latency. And by learning off-policy, Ape-X has the ability to combine data from many distributed actors, by giving the different actors different exploration policies, broadening the diversity of the experience they jointly encounter.

Reinforcement LearningIntroduced 200010 papers

AlphaStar

DeepMind AlphaStar

AlphaStar is a reinforcement learning agent for tackling the game of Starcraft II. It learns a policy using a neural network for parameters that receives observations as inputs and chooses actions as outputs. Additionally, the policy conditions on a statistic that summarizes a strategy sampled from human data such as a build order [1]. AlphaStar uses numerous types of architecture to incorporate different types of features. Observations of player and enemy units are processed with a Transformer. Scatter connections are used to integrate spatial and non-spatial information. The temporal sequence of observations is processed by a core LSTM. Minimap features are extracted with a Residual Network. To manage the combinatorial action space, the agent uses an autoregressive policy and a recurrent pointer network. The agent is trained first with supervised learning from human replays. Parameters are subsequently trained using reinforcement learning that maximizes the win rate against opponents. The RL algorithm is based on a policy-gradient algorithm similar to actor-critic. Updates are performed asynchronously and off-policy. To deal with this, a combination of and V-trace are used, as well as a new self-imitation algorithm (UPGO). Lastly, to address game-theoretic challenges, AlphaStar is trained with league training to try to approximate a fictitious self-play (FSP) setting which avoids cycles by computing a best response against a uniform mixture of all previous policies. The league of potential opponents includes a diverse range of agents, including policies from current and previous agents. Image Credit: Yekun Chai References 1. Chai, Yekun. "Deciphering AlphaStar on StarCraft II." (2019). https://cyk1337.github.io/notes/2019/07/21/RL/DRL/Decipher-AlphaStar-on-StarCraft-II/ Code Implementation 1. https://github.com/opendilab/DI-star

Reinforcement LearningIntroduced 200010 papers

Expected Sarsa

Expected Sarsa is like Q-learning but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy. Except for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than Sarsa but it eliminates the variance due to the random selection of . Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 20009 papers

Rainbow DQN

Rainbow DQN is an extended DQN that combines several improvements into a single learner. Specifically: - It uses Double Q-Learning to tackle overestimation bias. - It uses Prioritized Experience Replay to prioritize important transitions. - It uses dueling networks. - It uses multi-step learning. - It uses distributional reinforcement learning instead of the expected return. - It uses noisy linear layers for exploration.

Reinforcement LearningIntroduced 20009 papers

DD-PPO

Decentralized Distributed Proximal Policy Optimization

Decentralized Distributed Proximal Policy Optimization (DD-PPO) is a method for distributed reinforcement learning in resource-intensive simulated environments. DD-PPO is distributed (uses multiple machines), decentralized (lacks a centralized server), and synchronous (no computation is ever stale'), making it conceptually simple and easy to implement. Proximal Policy Optimization, or PPO, is a policy gradient method for reinforcement learning. The motivation was to have an algorithm with the data efficiency and reliable performance of TRPO, while using only first-order optimization. Let denote the probability ratio , so . TRPO maximizes a “surrogate” objective: As a general abstraction, DD-PPO implements the following: at step , worker has a copy of the parameters, , calculates the gradient, , and updates via where is any first-order optimization technique (e.g. gradient descent) and performs a reduction (e.g. mean) over all copies of a variable and returns the result to all workers. Distributed DataParallel scales very well (near-linear scaling up to 32,000 GPUs), and is reasonably simple to implement (all workers synchronously running identical code).

Reinforcement LearningIntroduced 20008 papers

TLA

Temporally Layered Architecture

Please enter a description about the method here

Reinforcement LearningIntroduced 20006 papers
Page 1 of 2Next