TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/gSDE

gSDE

Generalized State-Dependent Exploration

Reinforcement LearningIntroduced 20001 papers
Source Paper

Description

Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.

State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state s_ts\_{t}s_t, to the deterministic action μ(s_t)\mu\left(\mathbf{s}\_{t}\right)μ(s_t). At the beginning of an episode, the parameters θ_ϵ\theta\_{\epsilon}θ_ϵ of that exploration function are drawn from a Gaussian distribution. The resulting action a_t\mathbf{a}\_{t}a_t is as follows:

a_t=μ(s_t;θ_μ)+ϵ(s_t;θ_ϵ),θ_ϵ∼N(0,σ2)\mathbf{a}\_{t}=\mu\left(\mathbf{s}\_{t} ; \theta\_{\mu}\right)+\epsilon\left(\mathbf{s}\_{t} ; \theta\_{\epsilon}\right), \quad \theta\_{\epsilon} \sim \mathcal{N}\left(0, \sigma^{2}\right)a_t=μ(s_t;θ_μ)+ϵ(s_t;θ_ϵ),θ_ϵ∼N(0,σ2)

This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state sss will be the same.

In the case of a linear exploration function ϵ(s;θ_ϵ)=θ_ϵs\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{s}ϵ(s;θ_ϵ)=θ_ϵs, by operation on Gaussian distributions, Rückstieß et al. show that the action element a_j\mathbf{a}\_{j}a_j is normally distributed:

π]j(a_j∣s)∼N(μ_j(s),σ_j^2)\pi]_{j}\left(\mathbf{a}\_{j} \mid \mathbf{s}\right) \sim \mathcal{N}\left(\mu\_{j}(\mathbf{s}), \hat{\sigma\_{j}}^{2}\right)π]j​(a_j∣s)∼N(μ_j(s),σ_j^​2)

where σ^\hat{\sigma}σ^ is a diagonal matrix with elements σ^_j=∑_i(σ_ijs_i)2\hat{\sigma}\_{j}=\sqrt{\sum\_{i}\left(\sigma\_{i j} \mathbf{s}\_{i}\right)^{2}}σ^_j=∑_i(σ_ijs_i)2​.

Because we know the policy distribution, we can obtain the derivative of the log-likelihood log⁡π(a∣s)\log \pi(\mathbf{a} \mid \mathbf{s})logπ(a∣s) with respect to the variance σ\sigmaσ :

∂log⁡π(a∣s)∂σij=(a_j−μ_j)2−σ_j^2σ^_j3s_i2σ_ijσj^\frac{\partial \log \pi(\mathbf{a} \mid \mathbf{s})}{\partial \sigma_{i j}}=\frac{\left(\mathbf{a}\_{j}-\mu\_{j}\right)^{2}-\hat{\sigma\_{j}}^{2}}{\hat{\sigma}\_{j}^{3}} \frac{\mathbf{s}\_{i}^{2} \sigma\_{i j}}{\hat{\sigma_{j}}}∂σij​∂logπ(a∣s)​=σ^_j3(a_j−μ_j)2−σ_j^​2​σj​^​s_i2σ_ij​

This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt σ\sigmaσ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.

For gSDE, two improvements are suggested:

  1. We sample the parameters θ_ϵ\theta\_{\epsilon}θ_ϵ of the exploration function every nnn steps instead of every episode.
  2. Instead of the state s, we can in fact use any features. We chose policy features z_μ(s;θ_z_μ)\mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta\_{\mathbf{z}\_{\mu}}\right)z_μ(s;θ_z_μ) (last layer before the deterministic output μ(s)=θ_μz_μ(s;θz_μ))\left.\mu(\mathbf{s})=\theta\_{\mu} \mathbf{z}\_{\mu}\left(\mathbf{s} ; \theta_{\mathbf{z}\_{\mu}}\right)\right)μ(s)=θ_μz_μ(s;θz_μ​)) as input to the noise function ϵ(s;θ_ϵ)=θ_ϵz_μ(s)\epsilon\left(\mathbf{s} ; \theta\_{\epsilon}\right)=\theta\_{\epsilon} \mathbf{z}\_{\mu}(\mathbf{s})ϵ(s;θ_ϵ)=θ_ϵz_μ(s)

Papers Using This Method

Smooth Exploration for Robotic Reinforcement Learning2020-05-12