Generalized State-Dependent Exploration, or gSDE, is an exploration method for reinforcement learning that uses more general features and re-sampling the noise periodically.
State-Dependent Exploration (SDE) is an intermediate solution for exploration that consists in adding noise as a function of the state s_t, to the deterministic action μ(s_t). At the beginning of an episode, the parameters θ_ϵ of that exploration function are drawn from a Gaussian distribution. The resulting action a_t is as follows:
a_t=μ(s_t;θ_μ)+ϵ(s_t;θ_ϵ),θ_ϵ∼N(0,σ2)
This episode-based exploration is smoother and more consistent than the unstructured step-based exploration. Thus, during one episode, instead of oscillating around a mean value, the action a for a given state s will be the same.
In the case of a linear exploration function ϵ(s;θ_ϵ)=θ_ϵs, by operation on Gaussian distributions, Rückstieß et al. show that the action element a_j is normally distributed:
π]j(a_j∣s)∼N(μ_j(s),σ_j^2)
where σ^ is a diagonal matrix with elements σ^_j=∑_i(σ_ijs_i)2.
Because we know the policy distribution, we can obtain the derivative of the log-likelihood logπ(a∣s) with respect to the variance σ :
This can be easily plugged into the likelihood ratio gradient estimator, which allows to adapt σ during training. SDE is therefore compatible with standard policy gradient methods, while addressing most shortcomings of the unstructured exploration.
For gSDE, two improvements are suggested:
We sample the parameters θ_ϵ of the exploration function every n steps instead of every episode.
Instead of the state s, we can in fact use any features. We chose policy features z_μ(s;θ_z_μ) (last layer before the deterministic output μ(s)=θ_μz_μ(s;θz_μ)) as input to the noise function ϵ(s;θ_ϵ)=θ_ϵz_μ(s)