TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/TayPO

TayPO

Taylor Expansion Policy Optimization

Reinforcement LearningIntroduced 20001 papers
Source Paper

Description

TayPO, or Taylor Expansion Policy Optimization, refers to a set of algorithms that apply the kkk-th order Taylor expansions for policy optimization. This generalizes prior work, including TRPO as a special case. It can be thought of unifying ideas from trust-region policy optimization and off-policy corrections. Taylor expansions share high-level similarities with both trust region policy search and off-policy corrections. To get high-level intuitions of such similarities, consider a simple 1D example of Taylor expansions. Given a sufficiently smooth real-valued function on the real line f:R→Rf : \mathbb{R} \rightarrow \mathbb{R}f:R→R, the kkk-th order Taylor expansion of f(x)f\left(x\right)f(x) at x_0x\_{0}x_0 is

f_k(x)=f(x_0)+∑k_i=1[f(i)(x_0)/i!](x−x_0)if\_{k}\left(x\right) = f\left(x\_{0}\right)+\sum^{k}\_{i=1}\left[f^{(i)}\left(x\_{0}\right)/i!\right]\left(x−x\_{0}\right)^{i}f_k(x)=f(x_0)+∑k_i=1[f(i)(x_0)/i!](x−x_0)i

where f(i)(x_0)f^{(i)}\left(x\_{0}\right)f(i)(x_0) are the iii-th order derivatives at x_0x\_{0}x_0. First, a common feature shared by Taylor expansions and trust-region policy search is the inherent notion of a trust region constraint. Indeed, in order for convergence to take place, a trust-region constraint is required ∣x−x_0∣<R(f,x_0)1|x − x\_{0}| < R\left(f, x\_{0}\right)^{1}∣x−x_0∣<R(f,x_0)1. Second, when using the truncation as an approximation to the original function f_K(x)≈f(x)f\_{K}\left(x\right) \approx f\left(x\right)f_K(x)≈f(x), Taylor expansions satisfy the requirement of off-policy evaluations: evaluate target policy with behavior data. Indeed, to evaluate the truncation f_K(x)f\_{K}\left(x\right)f_K(x) at any xxx (target policy), we only require the behavior policy "data" at x_0x\_{0}x_0 (i.e., derivatives f(i)(x_0)f^{(i)}\left(x\_{0}\right)f(i)(x_0)).

Papers Using This Method

Taylor Expansion Policy Optimization2020-03-13