Papers With Code 2 | ML Benchmarks, SotA Results & Code

Description

TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the $k$ -th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line $f : \mathbb{R} \rightarrow \mathbb{R}$ , the $k$ -th order Taylor expansion of $f\left(x\right)$ at $x\_{0}$ is

$f\_{k}\left(x\right) = f\left(x\_{0}\right)+\sum^{k}\_{i=1}\left[f^{(i)}\left(x\_{0}\right)/i!\right]\left(x−x\_{0}\right)^{i}$

where $f^{(i)}\left(x\_{0}\right)$ are the $i$ -th order derivatives at $x\_{0}$ . First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required $|x − x\_{0}| < R\left(f, x\_{0}\right)^{1}$ . Second, when using the truncation as an approximation to the original function $f\_{K}\left(x\right) \approx f\left(x\right)$ , Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation $f\_{K}\left(x\right)$ at any $x$ (target policy), we only require the behavior policy "data" at $x\_{0}$ (i.e., derivatives $f^{(i)}\left(x\_{0}\right)$ ).

Description

$f\_{k}\left(x\right) = f\left(x\_{0}\right)+\sum^{k}\_{i=1}\left[f^{(i)}\left(x\_{0}\right)/i!\right]\left(x−x\_{0}\right)^{i}$

TayPO

Description

Papers Using This Method

TayPO

Description

Papers Using This Method