Thus, we can use Monte Carlo method to obtain the gradient. But the probability density of the trajectory p(τ;θ) is still an unknown. Let’s derive it further.
Sample a trajectory τ from the environment and πθ(a∣s)
For time step t∈{0,1,...,T} in τ:
2.1. Calculate the state return R←∑i=tTγi−tri
2.2. Calculate the gradient ▽θJ(θ)=R▽θlogπθ(a∣s)
2.3. Update θ←θ+α▽θJ(θ)
Policy Gradient with Baseline
We have already obtained the unbiased estimator for the mean of ▽θJ(θ), but the variance can be out of control. We can prove that the following estimator is also unbiased, given any baseline function b(s) which is independent of a.
▽θJ(θ)=Eτ∼p(τ;θ)[(R(τ)−b(s))▽θlogp(τ;θ)]
A good candidate for b(s) is vπ(s). The reward term can be replaced with an “advantage” function.
A(t)=i=t∑T[γi−tri]−vπ(st)
Intuitively, we can think of the advantage function as how much the reward taken by an action was better than the average reward in the state. The variance is reduced in this way. Say, we have rewards in two steps as 0 and 100. Now the advantages may be only 1 and 2.
Sample a trajectory τ from the environment and πθ(a∣s)
For time step t∈{0,1,...,T} in τ:
2.1. Calculate the state return R=∑i=tTγi−tri
2.2 Calculate δ=R−v(st;w)
2.3. Update w←w+βδ▽wv(st;w)
2.4. Update θ←θ+αδ▽θlogπθ(a∣s)
The Actor-Critic Method
So far, our policy gradient estimator was unbiased. Now, in order to further reduce the variance, we start to introduce some bias by bootstrapping. We estimate the rewards at a time step t by rt+vw(st+1). And, we don’t need to sample the whole trajectory to perform an update.
The Actor-Critic Algorithm
Init θ, w
Repeat until converges: Assign s with a start state s←s0 While state s is not a terminal: Sample action a from the policy πθ(a∣s) Perform action a on the environment to get r and s′ Calculate δ=r+γvw(s′)−vw(s) Update w←w+βδ▽wvw(st) Update θ←θ+αδ▽θlogπθ(a∣s) Assign s←s′
Somehow, we didn’t take the derivative of r+γvw(s′) w.r.t. w, cos it was the target of our regression problem.
Softmax Function for Discrete Actions Policy
Let the action space has k discrete actions {a0,a1,..,ai..,ak}. The stochastic policy denotes the probability of the ith action given the the state s, parameterised by θ. π(ai∣s;θ)=∑jkeϕj(x,θ)eϕi(x,θ)
Where x is a feature vector of the state s and ϕ(x,θ) is a neural network parameterised by θ and it outputs a vector representing the preferences of the k actions.
For j=i, ▽θπ(ai∣s;θ)=∑2eϕi∑−eϕieϕj▽θϕj(x,θ)=πi(1−πj)▽θϕj(x,θ)
For j≠i, ▽θπ(ai∣s;θ)=∑20−eϕieϕj▽θϕj(x,θ)=πi(0−πj)▽θϕj(x,θ)
Thus, ▽θπ(ai∣s;θ)=πi(Ii−π)▽θϕ(x,θ)
where Ii∈ℜk is a vector with 1 at the ith element at 0 elsewhere.
Gaussian Density Function for Continuous Action Policy
Let the action space has k continuous action variables, a∈ℜk, a=[a0,a1,..,ai..,ak]T. The stochastic policy function π∈ℜ is a probability density for the action variables, parameterised by θ. The probability density function is multivariate Gaussian distribution with a variable mean μ∈ℜk and a constant covariance Σ∈ℜk×k.
No comments:
Post a Comment