For an event A, its Shannon Information Content is defined as: logp(A)1
Intuitively, it measures the “surprise” or “unlikeliness” for an event or an outcome of probability distribution. An event that always happen gives Shannon Information Content 0, while the event with 0 probability gives infinity.
Shannon Entropy
We can think of Shannon Entropy H(X) as the measurement of “surprise”, uncertainty, or randomness of a random variable X or the distribution p(x) associated to it.
H(p(x))=x∑p(x)logp(x)1=Ex∼p(x)[logp(x)1]
By looking at the equation, we can interrupt it as the expectation of how “surprise” the outcomes of the distribution p(x) is, by sampling the outcomes from the distribution p(x) itself.
Intuitively, distributions (discrete) with more possible outcomes will have higher Entropy (randomness). A distribution with equal likeliness in each outcome will also have higher Entropy (uncertainty) than the distribution with biased likeliness.
KL Divergence
The KL Divergence measures how diverge that two distributions p(x) and q(x) are.
Similar to Shannon Entropy, we can use the following formula to measure the expectation of “surprise” or “unlikeliness” of a distribution q(x), but taking samples from another distribution p(x).
Ex∼p(x)[logq(x)1]
If q(x)≡p(x), this just gives the entropy H of p(x). But it will always greater (more randomness) than H(p) when p≠q.
The KL Divergence DKL(p∣∣q) is measuring the excess of this “suprise” with the baseline of H(p).
Thus, we can use Monte Carlo method to obtain the gradient. But the probability density of the trajectory p(τ;θ) is still an unknown. Let’s derive it further.
Sample a trajectory τ from the environment and πθ(a∣s)
For time step t∈{0,1,...,T} in τ:
2.1. Calculate the state return R←∑i=tTγi−tri
2.2. Calculate the gradient ▽θJ(θ)=R▽θlogπθ(a∣s)
2.3. Update θ←θ+α▽θJ(θ)
Policy Gradient with Baseline
We have already obtained the unbiased estimator for the mean of ▽θJ(θ), but the variance can be out of control. We can prove that the following estimator is also unbiased, given any baseline function b(s) which is independent of a.
▽θJ(θ)=Eτ∼p(τ;θ)[(R(τ)−b(s))▽θlogp(τ;θ)]
A good candidate for b(s) is vπ(s). The reward term can be replaced with an “advantage” function.
A(t)=i=t∑T[γi−tri]−vπ(st)
Intuitively, we can think of the advantage function as how much the reward taken by an action was better than the average reward in the state. The variance is reduced in this way. Say, we have rewards in two steps as 0 and 100. Now the advantages may be only 1 and 2.
Sample a trajectory τ from the environment and πθ(a∣s)
For time step t∈{0,1,...,T} in τ:
2.1. Calculate the state return R=∑i=tTγi−tri
2.2 Calculate δ=R−v(st;w)
2.3. Update w←w+βδ▽wv(st;w)
2.4. Update θ←θ+αδ▽θlogπθ(a∣s)
The Actor-Critic Method
So far, our policy gradient estimator was unbiased. Now, in order to further reduce the variance, we start to introduce some bias by bootstrapping. We estimate the rewards at a time step t by rt+vw(st+1). And, we don’t need to sample the whole trajectory to perform an update.
The Actor-Critic Algorithm
Init θ, w
Repeat until converges: Assign s with a start state s←s0 While state s is not a terminal: Sample action a from the policy πθ(a∣s) Perform action a on the environment to get r and s′ Calculate δ=r+γvw(s′)−vw(s) Update w←w+βδ▽wvw(st) Update θ←θ+αδ▽θlogπθ(a∣s) Assign s←s′
Somehow, we didn’t take the derivative of r+γvw(s′) w.r.t. w, cos it was the target of our regression problem.
Softmax Function for Discrete Actions Policy
Let the action space has k discrete actions {a0,a1,..,ai..,ak}. The stochastic policy denotes the probability of the ith action given the the state s, parameterised by θ. π(ai∣s;θ)=∑jkeϕj(x,θ)eϕi(x,θ)
Where x is a feature vector of the state s and ϕ(x,θ) is a neural network parameterised by θ and it outputs a vector representing the preferences of the k actions.
For j=i, ▽θπ(ai∣s;θ)=∑2eϕi∑−eϕieϕj▽θϕj(x,θ)=πi(1−πj)▽θϕj(x,θ)
For j≠i, ▽θπ(ai∣s;θ)=∑20−eϕieϕj▽θϕj(x,θ)=πi(0−πj)▽θϕj(x,θ)
Thus, ▽θπ(ai∣s;θ)=πi(Ii−π)▽θϕ(x,θ)
where Ii∈ℜk is a vector with 1 at the ith element at 0 elsewhere.
Gaussian Density Function for Continuous Action Policy
Let the action space has k continuous action variables, a∈ℜk, a=[a0,a1,..,ai..,ak]T. The stochastic policy function π∈ℜ is a probability density for the action variables, parameterised by θ. The probability density function is multivariate Gaussian distribution with a variable mean μ∈ℜk and a constant covariance Σ∈ℜk×k.