Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

TD_INLINE_MATH_1 is a generalisation of TD_INLINE_MATH_2 reinforcement learning algorithms, but it employs an eligibility trace $\lambda$ and $\lambda$ -weighted returns. The eligibility trace vector is initialized to zero at the beginning of the episode, and it is incremented on each time step by the value gradient, and then fades away by $\gamma\lambda$ :

$\textbf{z}\_{-1} = \mathbf{0}$ $\textbf{z}\_{t} = \gamma\lambda\textbf{z}\_{t-1} + \nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right), 0 \leq t \leq T$

The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right)$ is the feature vector.

The TD error for state-value prediction is:

$\delta\_{t} = R\_{t+1} + \gamma\hat{v}\left\(S\_{t+1}, \mathbf{w}\_{t}\right) - \hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right)$

In TD_INLINE_MATH_1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:

$\mathbf{w}\_{t+1} = \mathbf{w}\_{t} + \alpha\delta\mathbf{z}\_{t}$

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Description

$\textbf{z}\_{-1} = \mathbf{0}$ $\textbf{z}\_{t} = \gamma\lambda\textbf{z}\_{t-1} + \nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right), 0 \leq t \leq T$

The eligibility trace keeps track of which components of the weight vector contribute to recent state valuations. Here $\nabla\hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right)$ is the feature vector.

The TD error for state-value prediction is:

$\delta\_{t} = R\_{t+1} + \gamma\hat{v}\left\(S\_{t+1}, \mathbf{w}\_{t}\right) - \hat{v}\left(S\_{t}, \mathbf{w}\_{t}\right)$

In TD_INLINE_MATH_1, the weight vector is updated on each step proportional to the scalar TD error and the vector eligibility trace:

$\mathbf{w}\_{t+1} = \mathbf{w}\_{t} + \alpha\delta\mathbf{z}\_{t}$

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

TD Lambda

Description

Papers Using This Method

TD Lambda

Description

Papers Using This Method