Papers With Code 2 | ML Benchmarks, SotA Results & Code

POMO

Reinforcement LearningIntroduced 20006 papers

Prioritized Sweeping is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 20006 papers

Soft Actor-Critic (Autotuned Temperature)

Soft Actor Critic (Autotuned Temperature is a modification of the SAC reinforcement learning algorithm. SAC can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.

Reinforcement LearningIntroduced 20006 papers

APPO

Asynchronous Proximal Policy Optimization

Reinforcement LearningIntroduced 20004 papers

DouZero

DouZero is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The Q-network of DouZero consists of an LSTM to encode historical actions and six layers of MLP with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.

Reinforcement LearningIntroduced 20004 papers

MDPO

Mirror Descent Policy Optimization

Mirror Descent Policy Optimization (MDPO) is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that attempts to keep consecutive iterates close to each other.

Reinforcement LearningIntroduced 20004 papers

TD-Gammon

TD-Gammon is a game-learning architecture for playing backgammon. It involves the use of a learning algorithm and a feedforward neural network. Credit: Temporal Difference Learning and TD-Gammon

Reinforcement LearningIntroduced 19924 papers

IQ-Learn

Inverse Q-Learning

Inverse Q-Learning (IQ-Learn) is a a simple, stable & data-efficient framework for Imitation Learning (IL), that directly learns soft Q-functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring 15 lines of code on top of existing RL methods. <span class="description-source"Source: IQ-Learn: Inverse soft Q-Learning for Imitation</span

Reinforcement LearningIntroduced 20004 papers

QPT

Quantum Process Tomography

Reinforcement LearningIntroduced 20003 papers

Ape-X DQN

Ape-X DQN is a variant of a DQN with some components of Rainbow-DQN that utilizes distributed prioritized experience replay through the Ape-X architecture.

Reinforcement LearningIntroduced 20003 papers

CLIPort

CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2].

Reinforcement LearningIntroduced 20003 papers

GTrXL

Gated Transformer-XL

Gated Transformer-XL, or GTrXL, is a Transformer-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include: - Placing the layer normalization on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding. - Replacing residual connections with gating layers. The authors' experiments found that GRUs were the most effective form of gating.

Reinforcement LearningIntroduced 20003 papers

DDQL

Double Deep Q-Learning

Reinforcement LearningIntroduced 20003 papers

PWIL

Primal Wasserstein Imitation Learning

Primal Wasserstein Imitation Learning, or PWIL, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.

Reinforcement LearningIntroduced 20003 papers

Sym-NCO

Reinforcement LearningIntroduced 20002 papers

SEED RL

SEED (Scalable, Efficient, Deep-RL) is a scalable reinforcement learning agent. It utilizes an architecture that features centralized inference and an optimized communication layer. SEED adopts two state of the art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning).

Reinforcement LearningIntroduced 20002 papers

TorchBeast

TorchBeast is a platform for reinforcement learning (RL) research in PyTorch. It implements a version of the popular IMPALA algorithm for fast, asynchronous, parallel training of RL agents.

Reinforcement LearningIntroduced 20002 papers

Policy Similarity Metric

Policy Similarity Metric, or PSM, is a similarity metric for measuring behavioral similarity between states in reinforcement learning. It assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information.

Reinforcement LearningIntroduced 20002 papers

myGym

MyGym: Modular Toolkit for Visuomotor Robotic Tasks

We introduce myGym, a toolkit suitable for fast prototyping of neural networks in the area of robotic manipulation and navigation. Our toolbox is fully modular, enabling users to train their algorithms on different robots, environments, and tasks. We also include pretrained neural network modules for the real-time vision that allows training visuomotor tasks with sim2real transfer. The visual modules can be easily retrained using the dataset generation pipeline with domain augmentation and randomization. Moreover, myGym provides automatic evaluation methods and baselines that help the user to directly compare their trained model with the state-of-the-art algorithms. We additionally present a novel metric, called learnability, to compare the general learning capability of algorithms in different settings, where the complexity of the environment, robot, and the task is systematically manipulated. The learnability score tracks differences between the performance of algorithms in increasingly challenging setup conditions, and thus allows the user to compare different models in a more systematic fashion. The code is accessible at https://github.com/incognite-lab/myGym

Reinforcement LearningIntroduced 20002 papers

ACTKR

ACKTR, or Actor Critic with Kronecker-factored Trust Region, is an actor-critic method for reinforcement learning that applies trust region optimization using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region.

Reinforcement LearningIntroduced 20002 papers

NoisyNet-DQN

NoisyNet-DQN is a modification of a DQN that utilises noisy linear layers for exploration instead of -greedy exploration as in the original DQN formulation.

Reinforcement LearningIntroduced 20002 papers

Robust Predictable Control

Robust Predictable Control, or RPC, is an RL algorithm for learning policies that uses only a few bits of information. RPC brings together ideas from information bottlenecks, model-based RL, and bits-back coding. The main idea of RPC is that if the agent can accurately predict the future, then the agent will not need to observe as many bits from future observations. Precisely, the agent will learn a latent dynamics model that predicts the next representation using the current representation and action. In addition to predicting the future, the agent can also decrease the number of bits by changing its behavior. States where the dynamics are hard to predict will require more bits, so the agent will prefer visiting states where its learned model can accurately predict the next state.