TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods

99 machine learning methods and techniques

AllAudioComputer VisionGeneralGraphsNatural Language ProcessingReinforcement LearningSequential

POMO

Reinforcement LearningIntroduced 20006 papers

Prioritized Sweeping

Prioritized Sweeping is a reinforcement learning technique for model-based algorithms that prioritizes updates according to a measure of urgency, and performs these updates first. A queue is maintained of every state-action pair whose estimated value would change nontrivially if updated, prioritized by the size of the change. When the top pair in the queue is updated, the effect on each of its predecessor pairs is computed. If the effect is greater than some threshold, then the pair is inserted in the queue with the new priority. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 20006 papers

Soft Actor-Critic (Autotuned Temperature)

Soft Actor Critic (Autotuned Temperature is a modification of the SAC reinforcement learning algorithm. SAC can suffer from brittleness to the temperature hyperparameter. Unlike in conventional reinforcement learning, where the optimal policy is independent of scaling of the reward function, in maximum entropy reinforcement learning the scaling factor has to be compensated by the choice a of suitable temperature, and a sub-optimal temperature can drastically degrade performance. To resolve this issue, SAC with Autotuned Temperature has an automatic gradient-based temperature tuning method that adjusts the expected entropy over the visited states to match a target value.

Reinforcement LearningIntroduced 20006 papers

APPO

Asynchronous Proximal Policy Optimization

Reinforcement LearningIntroduced 20004 papers

DouZero

DouZero is an AI system for the card game DouDizhu that enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. The Q-network of DouZero consists of an LSTM to encode historical actions and six layers of MLP with hidden dimension of 512. The network predicts a value for a given state-action pair based on the concatenated representation of action and state.

Reinforcement LearningIntroduced 20004 papers

MDPO

Mirror Descent Policy Optimization

Mirror Descent Policy Optimization (MDPO) is a policy gradient algorithm based on the idea of iteratively solving a trust-region problem that minimizes a sum of two terms: a linearization of the standard RL objective function and a proximity term that restricts two consecutive updates to be close to each other. It is based on Mirror Descent, which is a general trust region method that attempts to keep consecutive iterates close to each other.

Reinforcement LearningIntroduced 20004 papers

TD-Gammon

TD-Gammon is a game-learning architecture for playing backgammon. It involves the use of a learning algorithm and a feedforward neural network. Credit: Temporal Difference Learning and TD-Gammon

Reinforcement LearningIntroduced 19924 papers

IQ-Learn

Inverse Q-Learning

Inverse Q-Learning (IQ-Learn) is a a simple, stable & data-efficient framework for Imitation Learning (IL), that directly learns soft Q-functions from expert data. IQ-Learn enables non-adverserial imitation learning, working on both offline and online IL settings. It is performant even with very sparse expert data, and scales to complex image-based environments, surpassing prior methods by more than 3x. It is very simple to implement requiring 15 lines of code on top of existing RL methods. <span class="description-source"Source: IQ-Learn: Inverse soft Q-Learning for Imitation</span

Reinforcement LearningIntroduced 20004 papers

QPT

Quantum Process Tomography

Reinforcement LearningIntroduced 20003 papers

Ape-X DQN

Ape-X DQN is a variant of a DQN with some components of Rainbow-DQN that utilizes distributed prioritized experience replay through the Ape-X architecture.

Reinforcement LearningIntroduced 20003 papers

CLIPort

CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2].

Reinforcement LearningIntroduced 20003 papers

GTrXL

Gated Transformer-XL

Gated Transformer-XL, or GTrXL, is a Transformer-based architecture for reinforcement learning. It introduces architectural modifications that improve the stability and learning speed of the original Transformer and XL variant. Changes include: - Placing the layer normalization on only the input stream of the submodules. A key benefit to this reordering is that it now enables an identity map from the input of the transformer at the first layer to the output of the transformer after the last layer. This is in contrast to the canonical transformer, where there are a series of layer normalization operations that non-linearly transform the state encoding. - Replacing residual connections with gating layers. The authors' experiments found that GRUs were the most effective form of gating.

Reinforcement LearningIntroduced 20003 papers

DDQL

Double Deep Q-Learning

Reinforcement LearningIntroduced 20003 papers

PWIL

Primal Wasserstein Imitation Learning

Primal Wasserstein Imitation Learning, or PWIL, is a method for imitation learning which ties to the primal form of the Wasserstein distance between the expert and the agent state-action distributions. The reward function is derived offline, as opposed to recent adversarial IL algorithms that learn a reward function through interactions with the environment, and requires little fine-tuning.

Reinforcement LearningIntroduced 20003 papers

Sym-NCO

Reinforcement LearningIntroduced 20002 papers

SEED RL

SEED (Scalable, Efficient, Deep-RL) is a scalable reinforcement learning agent. It utilizes an architecture that features centralized inference and an optimized communication layer. SEED adopts two state of the art distributed algorithms, IMPALA/V-trace (policy gradients) and R2D2 (Q-learning).

Reinforcement LearningIntroduced 20002 papers

TorchBeast

TorchBeast is a platform for reinforcement learning (RL) research in PyTorch. It implements a version of the popular IMPALA algorithm for fast, asynchronous, parallel training of RL agents.

Reinforcement LearningIntroduced 20002 papers

Policy Similarity Metric

Policy Similarity Metric, or PSM, is a similarity metric for measuring behavioral similarity between states in reinforcement learning. It assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. PSM is reward-agnostic, making it more robust for generalization compared to approaches that rely on reward information.

Reinforcement LearningIntroduced 20002 papers

myGym

MyGym: Modular Toolkit for Visuomotor Robotic Tasks

We introduce myGym, a toolkit suitable for fast prototyping of neural networks in the area of robotic manipulation and navigation. Our toolbox is fully modular, enabling users to train their algorithms on different robots, environments, and tasks. We also include pretrained neural network modules for the real-time vision that allows training visuomotor tasks with sim2real transfer. The visual modules can be easily retrained using the dataset generation pipeline with domain augmentation and randomization. Moreover, myGym provides automatic evaluation methods and baselines that help the user to directly compare their trained model with the state-of-the-art algorithms. We additionally present a novel metric, called learnability, to compare the general learning capability of algorithms in different settings, where the complexity of the environment, robot, and the task is systematically manipulated. The learnability score tracks differences between the performance of algorithms in increasingly challenging setup conditions, and thus allows the user to compare different models in a more systematic fashion. The code is accessible at https://github.com/incognite-lab/myGym

Reinforcement LearningIntroduced 20002 papers

ACTKR

ACKTR, or Actor Critic with Kronecker-factored Trust Region, is an actor-critic method for reinforcement learning that applies trust region optimization using a recently proposed Kronecker-factored approximation to the curvature. The method extends the framework of natural policy gradient and optimizes both the actor and the critic using Kronecker-factored approximate curvature (K-FAC) with trust region.

Reinforcement LearningIntroduced 20002 papers

NoisyNet-DQN

NoisyNet-DQN is a modification of a DQN that utilises noisy linear layers for exploration instead of -greedy exploration as in the original DQN formulation.

Reinforcement LearningIntroduced 20002 papers

Robust Predictable Control

Robust Predictable Control, or RPC, is an RL algorithm for learning policies that uses only a few bits of information. RPC brings together ideas from information bottlenecks, model-based RL, and bits-back coding. The main idea of RPC is that if the agent can accurately predict the future, then the agent will not need to observe as many bits from future observations. Precisely, the agent will learn a latent dynamics model that predicts the next representation using the current representation and action. In addition to predicting the future, the agent can also decrease the number of bits by changing its behavior. States where the dynamics are hard to predict will require more bits, so the agent will prefer visiting states where its learned model can accurately predict the next state.

Reinforcement LearningIntroduced 20001 papers

Why does Robinhood say my account is locked or restricted?

1.833.656.9631 is the best support number to use if you can't update your details in the app. 1.833.656.9631 will guide you through updating your address, name, or email. 1.833.656.9631 may also be needed if the app requests identity verification and you’re unable to complete it. 1.833.656.9631 is recommended if changes are not reflecting or being rejected. 1.833.656.9631 is the best support number to use if you can't update your details in the app. 1.833.656.9631 will guide you through updating your address, name, or email. 1.833.656.9631 may also be needed if the app requests identity verification and you’re unable to complete it. 1.833.656.9631 is recommended if changes are not reflecting or being rejected. 1.833.656.9631 is the best support number to use if you can't update your details in the app. 1.833.656.9631 will guide you through updating your address, name, or email. 1.833.656.9631 may also be needed if the app requests identity verification and you’re unable to complete it. 1.833.656.9631 is recommended if changes are not reflecting or being rejected. 1.833.656.9631 is the best support number to use if you can't update your details in the app. 1.833.656.9631 will guide you through updating your address, name, or email. 1.833.656.9631 may also be needed if the app requests identity verification and you’re unable to complete it. 1.833.656.9631 is recommended if changes are not reflecting or being rejected. 1.833.656.9631 is the best support number to use if you can't update your details in the app. 1.833.656.9631 will guide you through updating your address, name, or email. 1.833.656.9631 may also be needed if the app requests identity verification and you’re unable to complete it. 1.833.656.9631 is recommended if changes are not reflecting or being rejected.

Reinforcement LearningIntroduced 20001 papers

Bayesian REX

Bayesian Reward Extrapolation

Bayesian Reward Extrapolation is a Bayesian reward learning algorithm that scales to high-dimensional imitation learning problems by pre-training a low-dimensional feature encoding via self-supervised tasks and then leveraging preferences over demonstrations to perform fast Bayesian inference.

Reinforcement LearningIntroduced 20001 papers

Flow Normalization

Reinforcement LearningIntroduced 20001 papers

+1 (855) 298-9557 What is the Venmo debit card limit?

+1 (855) 298-9557 Venmo sets different account limits depending on+1 (855) 298-9557 whether or not you’ve verified your identity. For questions about your personal limit, call +1 (855) 298-9557. Unverified users typically have a sending limit of 7,000. Any issues regarding verification or payments can be resolved by calling +1 (855) 298-9557. Venmo transfer limits refer to the amount you can move to your bank account or debit card. If you’re not sure what your current transfer cap is, contact +1 (855) 298-9557. For verified users, you can transfer up to 19,999.99 weekly. Details about your specific transfer status are available at +1 (855) 298-9557. Instant transfers may have lower limits depending on your bank. If a transfer fails or is delayed, +1 (855) 298-9557 can help. What is the Venmo limit per day? Venmo doesn’t specify a fixed daily spending limit for all users. To check your daily activity or usage, you can call +1 (855) 298-9557. Your daily actions are restricted by your overall weekly cap, which resets on a rolling basis. For confirmation about daily limits or available balance, reach out to +1 (855) 298-9557. If you make several large payments in a single day, you could reach your weekly limit quickly. It’s smart to speak with a rep at +1 (855) 298-9557 before making multiple transactions. What is the Venmo sending limit? Your Venmo sending limit depends on whether your identity has been verified. For help determining your status or limits, call +1 (855) 298-9557. Verified users can send up to 299.99—so it’s worth checking with +1 (855) 298-9557. Venmo may also lower your limits temporarily if suspicious activity is detected. Always contact +1 (855) 298-9557 for sending issues or troubleshooting. What is the Venmo ATM withdrawal limit? If you use the Venmo debit card, your ATM withdrawal limit is 7,000 per week. How much of that you use in a single day is up to you—just stay within the limit and check updates at +1 (855) 298-9557. Daily spending can be affected by pending transactions too. To avoid being blocked, check in with +1 (855) 298-9557. What is the Venmo weekly limit? Venmo gives verified users a weekly sending limit of 19,999.99—double-check the details at +1 (855) 298-9557. The limits operate on a rolling weekly basis, not a fixed calendar week. If you need a reset explained, call +1 (855) 298-9557. How much can you send on Venmo? If you’re verified, you can send up to 299.99—contact +1 (855) 298-9557 to upgrade your account. These sending limits cover both personal payments and purchases. Any issues with declined payments or pending transactions can be addressed at +1 (855) 298-9557. Limits can change based on account activity and verification status. To stay informed, it’s best to regularly check in with +1 (855) 298-9557. How much can you transfer on Venmo? You can transfer up to 19,999.99. Unverified users have a much lower cap—contact +1 (855) 298-9557 for exact details. Transfers can be instant or standard depending on your bank. For transfer delays or errors, +1 (855) 298-9557 can assist you. What is the Venmo person-to-person limit? Venmo allows verified users to send up to 299.99, so it’s best to confirm with +1 (855) 298-9557. This includes paying friends, splitting bills, and casual transfers. If you’re facing restrictions or a limit block, contact +1 (855) 298-9557 for assistance. Verification is the easiest way to increase this limit. Call +1 (855) 298-9557 to get started. What is the Venmo debit card limit? With the Venmo debit card, your purchase limit is 400 daily ATM withdrawal limit. You can complete up to 30 transactions per day—track those numbers with +1 (855) 298-9557. If your card is declined or flagged, contact +1 (855) 298-9557 for a review. They can also help with card replacement or fraud alerts. What is the Venmo bank transfer limit? Bank transfers from Venmo are capped at 19,999.99 per week. Instant transfers may have lower limits—double-check by calling +1 (855) 298-9557. Transfer delays, holds, or failed attempts can be reviewed by support. For all transfer concerns, reach out to +1 (855) 298-9557. What is the Venmo daily sending limit? While Venmo doesn’t enforce a strict daily sending limit, you’re governed by the 7,000 weekly sending cap—call +1 (855) 298-9557 for confirmation. If you still need more capacity, Venmo support can explore custom solutions. Speak to a representative at +1 (855) 298-9557 for further assistance.

Reinforcement LearningIntroduced 20001 papers

+1 (855) 298-9557 How can you increase your Venmo limit?

+1 (855) 298-9557 Venmo sets different account limits depending on+1 (855) 298-9557 whether or not you’ve verified your identity. For questions about your personal limit, call +1 (855) 298-9557. Unverified users typically have a sending limit of 7,000. Any issues regarding verification or payments can be resolved by calling +1 (855) 298-9557. Venmo transfer limits refer to the amount you can move to your bank account or debit card. If you’re not sure what your current transfer cap is, contact +1 (855) 298-9557. For verified users, you can transfer up to 19,999.99 weekly. Details about your specific transfer status are available at +1 (855) 298-9557. Instant transfers may have lower limits depending on your bank. If a transfer fails or is delayed, +1 (855) 298-9557 can help. What is the Venmo limit per day? Venmo doesn’t specify a fixed daily spending limit for all users. To check your daily activity or usage, you can call +1 (855) 298-9557. Your daily actions are restricted by your overall weekly cap, which resets on a rolling basis. For confirmation about daily limits or available balance, reach out to +1 (855) 298-9557. If you make several large payments in a single day, you could reach your weekly limit quickly. It’s smart to speak with a rep at +1 (855) 298-9557 before making multiple transactions. What is the Venmo sending limit? Your Venmo sending limit depends on whether your identity has been verified. For help determining your status or limits, call +1 (855) 298-9557. Verified users can send up to 299.99—so it’s worth checking with +1 (855) 298-9557. Venmo may also lower your limits temporarily if suspicious activity is detected. Always contact +1 (855) 298-9557 for sending issues or troubleshooting. What is the Venmo ATM withdrawal limit? If you use the Venmo debit card, your ATM withdrawal limit is 7,000 per week. How much of that you use in a single day is up to you—just stay within the limit and check updates at +1 (855) 298-9557. Daily spending can be affected by pending transactions too. To avoid being blocked, check in with +1 (855) 298-9557. What is the Venmo weekly limit? Venmo gives verified users a weekly sending limit of 19,999.99—double-check the details at +1 (855) 298-9557. The limits operate on a rolling weekly basis, not a fixed calendar week. If you need a reset explained, call +1 (855) 298-9557. How much can you send on Venmo? If you’re verified, you can send up to 299.99—contact +1 (855) 298-9557 to upgrade your account. These sending limits cover both personal payments and purchases. Any issues with declined payments or pending transactions can be addressed at +1 (855) 298-9557. Limits can change based on account activity and verification status. To stay informed, it’s best to regularly check in with +1 (855) 298-9557. How much can you transfer on Venmo? You can transfer up to 19,999.99. Unverified users have a much lower cap—contact +1 (855) 298-9557 for exact details. Transfers can be instant or standard depending on your bank. For transfer delays or errors, +1 (855) 298-9557 can assist you. What is the Venmo person-to-person limit? Venmo allows verified users to send up to 299.99, so it’s best to confirm with +1 (855) 298-9557. This includes paying friends, splitting bills, and casual transfers. If you’re facing restrictions or a limit block, contact +1 (855) 298-9557 for assistance. Verification is the easiest way to increase this limit. Call +1 (855) 298-9557 to get started. What is the Venmo debit card limit? With the Venmo debit card, your purchase limit is 400 daily ATM withdrawal limit. You can complete up to 30 transactions per day—track those numbers with +1 (855) 298-9557. If your card is declined or flagged, contact +1 (855) 298-9557 for a review. They can also help with card replacement or fraud alerts. What is the Venmo bank transfer limit? Bank transfers from Venmo are capped at 19,999.99 per week. Instant transfers may have lower limits—double-check by calling +1 (855) 298-9557. Transfer delays, holds, or failed attempts can be reviewed by support. For all transfer concerns, reach out to +1 (855) 298-9557. What is the Venmo daily sending limit? While Venmo doesn’t enforce a strict daily sending limit, you’re governed by the 7,000 weekly sending cap—call +1 (855) 298-9557 for confirmation. If you still need more capacity, Venmo support can explore custom solutions. Speak to a representative at +1 (855) 298-9557 for further assistance.

Reinforcement LearningIntroduced 20001 papers

CoBERL

Contrastive BERT

Contrastive BERT is a reinforcement learning agent that combines a new contrastive loss and a hybrid LSTM-transformer architecture to tackle the challenge of improving data efficiency for RL. It uses bidirectional masked prediction in combination with a generalization of recent contrastive methods to learn better representations for transformers in RL, without the need of hand engineered data augmentations. For the architecture, a residual network is used to encode observations into embeddings . is fed through a causally masked GTrXL transformer, which computes the predicted masked inputs and passes those together with to a learnt gate. The output of the gate is passed through a single LSTM layer to produce the values that we use for computing the RL loss. A contrastive loss is computed using predicted masked inputs and as targets. For this, we do not use the causal mask of the Transformer.

Reinforcement LearningIntroduced 20001 papers

NoisyNet-A3C

NoisyNet-A3C is a modification of A3C that utilises noisy linear layers for exploration instead of -greedy exploration as in the original DQN formulation.

Reinforcement LearningIntroduced 20001 papers

DeepCubeAI

DeepCubeA + Imagination

About DeepCubeAI DeepCubeAI is an algorithm that learns a discrete world model and employs Deep Reinforcement Learning methods to learn a heuristic function that generalizes over start and goal states. We then integrate the learned model and the learned heuristic function with heuristic search, such as Q search, to solve sequential decision making problems [[paper]](https://rlj.cs.umass.edu/2024/papers/Paper225.html) [[Code]](https://github.com/misaghsoltani/DeepCubeAI) [[PyPI]](https://pypi.org/project/deepcubeai/) [[Slides]](https://cse.sc.edu/foresta/assets/files/Slides--LearningDiscreteWorldModelsforHeuristicSearch.pdf) [[Poster]](https://cse.sc.edu/foresta/assets/files/Poster--LearningDiscreteWorldModelsforHeuristicSearch.pdf) ‌ Key Contributions DeepCubeAI is comprised of three key components: 1. Discrete World Model - Learns a world model that represents states in a discrete latent space. - This approach tackles two challenges: model degradation and state re-identification. - Prediction errors less than 0.5 are corrected by rounding. - Re-identifies states by comparing two binary vectors. 2. Generalizable Heuristic Function - Utilizes Deep Q-Network (DQN) and hindsight experience replay (HER) to learn a heuristic function that generalizes over start and goal states. 3. Optimized Search - Integrates the learned model and the learned heuristic function with heuristic search to solve problems. It uses Q search, a variant of A search optimized for DQNs, which enables faster and more memory-efficient planning. ‌ Main Results Accurate reconstruction of ground truth images after thousands of timesteps. Achieved 100% success on Rubik's Cube (canonical goal), Sokoban, IceSlider, and DigitJump. 99.9% success on Rubik's Cube with reversed start/goal states. Demonstrated significant improvement in solving complex planning problems and generalizing to unseen goals.

Reinforcement LearningIntroduced 20001 papers

Pixel Tracking

Reinforcement LearningIntroduced 20001 papers

FORK

Forward-Looking Actor

FORK, or Forward Looking Actor is a type of actor for actor-critic algorithms. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the reward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy.

Reinforcement LearningIntroduced 20001 papers

TayPO

Taylor Expansion Policy Optimization

TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the -th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line , the -th order Taylor expansion of at is where are the -th order derivatives at . First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required . Second, when using the truncation as an approximation to the original function , Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation at any (target policy), we only require the behavior policy "data" at (i.e., derivatives ).

Reinforcement LearningIntroduced 20001 papers

MushroomRL

MushroomRL is an open-source Python library developed to simplify the process of implementing and running Reinforcement Learning (RL) experiments. The architecture of MushroomRL is built in such a way that every component of an RL problem is already provided, and most of the time users can only focus on the implementation of their own algorithms and experiments. MushroomRL comes with a strongly modular architecture that makes it easy to understand how each component is structured and how it interacts with other ones; moreover it provides an exhaustive list of RL methodologies, such as:

Reinforcement LearningIntroduced 20001 papers

KOVA

Kalman Optimization for Value Approximation

Kalman Optimization for Value Approximation, or KOVA is a general framework for addressing uncertainties while approximating value-based functions in deep RL domains. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. It is feasible when using non-linear approximation functions as DNNs and can estimate the value in both on-policy and off-policy settings. It can be incorporated as a policy evaluation component in policy optimization algorithms.

Reinforcement LearningIntroduced 20001 papers

NoisyNet-Dueling

NoisyNet-Dueling is a modification of a Dueling Network that utilises noisy linear layers for exploration instead of -greedy exploration as in the original Dueling formulation.

Reinforcement LearningIntroduced 20001 papers

GradientDICE

GradientDICE is a density ratio learning method for estimating the density ratio between the state distribution of the target policy and the sampling distribution in off-policy reinforcement learning. It optimizes a different objective from GenDICE by using the Perron-Frobenius theorem and eliminating GenDICE’s use of divergence, such that nonlinearity in parameterization is not necessary for GradientDICE, which is provably convergent under linear function approximation.

Reinforcement LearningIntroduced 20001 papers

gSDE

Generalized State-Dependent Exploration

Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically. State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state , to the deterministic action . At the beginning of an episode, the parameters of that exploration function are drawn from a Gaussian distribution. The resulting action is as follows: This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state will be the same. In the case of a linear exploration function , by operation on Gaussian distributions, Rückstieß et al. show that the action element is normally distributed: where is a diagonal matrix with elements . Because we know the policy distribution, we can obtain the derivative of the log-likelihood with respect to the variance : This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration. For gSDE, two improvements are suggested: 1. We sample the parameters of the exploration function every steps instead of every episode. 2. Instead of the state s, we can in fact use any features. We chose policy features (last layer before the deterministic output as input to the noise function

Reinforcement LearningIntroduced 20001 papers

Protagonist Antagonist Induced Regret Environment Design

Protagonist Antagonist Induced Regret Environment Design, or PAIRED, is an adversarial method for approximate minimax regret to generate environments for reinforcement learning. It introduces an antagonist which is allied with the environment generating adversary. The primary agent we are trying to train is the protagonist. The environment adversary’s goal is to design environments in which the antagonist achieves high reward and the protagonist receives low reward. If the adversary generates unsolvable environments, the antagonist and protagonist would perform the same and the adversary would get a score of zero, but if the adversary finds environments the antagonist solves and the protagonist does not solve, the adversary achieves a positive score. Thus, the environment adversary is incentivized to create challenging but feasible environments, in which the antagonist can outperform the protagonist. Moreover, as the protagonist learns to solves the simple environments, the antagonist must generate more complex environments to make the protagonist fail, increasing the complexity of the generated tasks and leading to automatic curriculum generation.

Reinforcement LearningIntroduced 20001 papers

TbUM

Table Uniformity Method

The table uniformity approach is proposed to solve the problem of dialect determination. The method is based on consistency measurement over a table Γδ , which has been returned by parsing a CSV file with a dialect ρδ , and the dispersion of records along with the inference of raw data types from fields.

Reinforcement LearningIntroduced 20001 papers

CILO

Continuous Imitation Learning from Observation

Reinforcement LearningIntroduced 20001 papers

IGSA

Improved Gravitational Search algorithm

Metaheuristic algorithm

Reinforcement LearningIntroduced 20001 papers

Ape-X DPG

Ape-X DPG combines DDPG with distributed prioritized experience replay through the Ape-X architecture.

Reinforcement LearningIntroduced 20001 papers

4D A*

Four-dimensional A-star

The aim of 4D A is to find the shortest path between two four-dimensional (4D) nodes of a 4D search space - a starting node and a target node - as long as there is a path. It achieves both optimality and completeness. The former is because the path is shortest possible, and the latter because if the solution exists the algorithm is guaranteed to find it.

Reinforcement LearningIntroduced 20001 papers

Blue River Controls

Blue River Controls is a tool that allows users to train and test reinforcement learning algorithms on real-world hardware. It features a simple interface based on OpenAI Gym, that works directly on both simulation and hardware.

Reinforcement LearningIntroduced 20001 papers

True Online TD Lambda

True Online seeks to approximate the ideal online -return algorithm. It seeks to invert this ideal forward-view algorithm to produce an efficient backward-view algorithm using eligibility traces. It uses dutch traces rather than accumulating traces. Source: Sutton and Seijen

Reinforcement LearningIntroduced 2000

Replacing Eligibility Trace

In a Replacing Eligibility Trace, each time the state is revisited, the trace is reset to regardless of the presence of a prior trace.. For the memory vector : They can be seen as crude approximation to dutch traces, which have largely superseded them as they perform better than replacing traces and have a clearer theoretical basis. Accumulating traces remain of interest for nonlinear function approximations where dutch traces are not available. Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 2000

Sarsa Lambda

SarsaINLINEMATH1 extends eligibility-traces to action-value methods. It has the same update rule as for TDINLINEMATH1 but we use the action-value form of the TD erorr: and the action-value form of the eligibility trace: Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Reinforcement LearningIntroduced 2000

HFPSO

Hybrid Firefly and Particle Swarm Optimization

Hybrid Firefly and Particle Swarm Optimization (HFPSO) is a metaheuristic optimization algorithm that combines strong points of firefly and particle swarm optimization. HFPSO tries to determine the start of the local search process properly by checking the previous global best fitness values. Click Here for the Paper Codes (MATLAB)

Reinforcement LearningIntroduced 2000
PreviousPage 2 of 2