Monday, June 25, 2018

Shannon Entropy and KL Divergence

Shannon Entropy and KL Divergence

Shannon Information Content

For an event AA, its Shannon Information Content is defined as:
log1p(A) \log \frac{1}{p(A)}

Intuitively, it measures the “surprise” or “unlikeliness” for an event or an outcome of probability distribution. An event that always happen gives Shannon Information Content 0, while the event with 0 probability gives infinity.

Shannon Entropy

We can think of Shannon Entropy H(X)H(X) as the measurement of “surprise”, uncertainty, or randomness of a random variable XX or the distribution p(x)p(x) associated to it.

H(p(x))=xp(x)log1p(x)=Exp(x)[log1p(x)] H(p(x)) = \sum_x p(x) \log \frac{1}{p(x)} = E_{x \sim p(x)}[\log \frac{1}{p(x)}]

By looking at the equation, we can interrupt it as the expectation of how “surprise” the outcomes of the distribution p(x)p(x) is, by sampling the outcomes from the distribution p(x)p(x) itself.

Intuitively, distributions (discrete) with more possible outcomes will have higher Entropy (randomness). A distribution with equal likeliness in each outcome will also have higher Entropy (uncertainty) than the distribution with biased likeliness.

KL Divergence

The KL Divergence measures how diverge that two distributions p(x)p(x) and q(x)q(x) are.

Similar to Shannon Entropy, we can use the following formula to measure the expectation of “surprise” or “unlikeliness” of a distribution q(x)q(x), but taking samples from another distribution p(x)p(x).

Exp(x)[log1q(x)] E_{x \sim p(x)}[\log \frac{1}{q(x)}]

If q(x)p(x)q(x) \equiv p(x), this just gives the entropy HH of p(x)p(x). But it will always greater (more randomness) than H(p)H(p) when pqp \ne q.

The KL Divergence DKL(pq)D_{KL}(p||q) is measuring the excess of this “suprise” with the baseline of H(p)H(p).

DKL(pq)=Exp(x)[log1q(x)]Exp(x)[log1p(x)] D_{KL}(p||q) = E_{x \sim p(x)}[\log \frac{1}{q(x)}] - E_{x \sim p(x)}[\log \frac{1}{p(x)}]

=xp(x)(log1q(x)log1p(x))=xp(x)logp(x)q(x) = \sum_x p(x) (\log \frac{1}{q(x)} - \log \frac{1}{p(x)}) = \sum_x p(x) \log \frac{p(x)}{q(x)}

Note that DKL(pq)DKL(pq)D_{KL}(p||q) \ne D_{KL}(p||q).

Suppose we have two experiments:

  1. Flipping a fair coin, with head and tail half-half
  2. Flipping a biased coin that will only return head

Let p(x)p(x) and q(x)q(x) be the distributions of the two experiment outcomes respectively.

H(p)=1;H(q)=0H(p) = 1; H(q) =0

Exq(x)[log21p(x)]=1E_{x \sim q(x)}[\log_2 \frac{1}{p(x)}] = 1

because p(head)=0.5p(head) = 0.5. But,

Exp(x)[log1q(x)]E_{x \sim p(x)}[\log \frac{1}{q(x)}] \rightarrow \infty

because q(tail)=0q(tail) = 0.
Thus, DKL(qp)D_{KL}(q||p) equals to 1, but DKL(pq)D_{KL}(p||q) explodes to infinity.

Thursday, June 21, 2018

Policy Based Reinforcement Learning (I)

Policy Based Reinforcement Learning (I)

Pre-requisites

  1. Unsupervised Learning
  2. Value Based Reinforcement Learning

Basic Idea

Symbol Meanings
siSs_i \in S the states
aiAa_i \in A the actions
τ\tau a sampled trajectory (s0,a0,r0..,st,at,rt..,sT,aT,rT)(s_0, a_0, r_0 .., s_t, a_t, r_t .., s_T, a_T, r_T )
p(τ;θ)p(\tau;\theta) the probability density of trajectory τ\tau parametirised by θ\theta
πθ(as)\pi_{\theta}(a \vert s) the stochastic policy given state s parameterised by θ\theta
γ\gamma the reward discount rate
R(τ)R(\tau) the total rewards from the trajectory τ\tau, i.e. tTγt1rt\sum_t^T \gamma^{t-1}r_t

We define an objective function:
J(θ)=Eτp(τ;θ)[R(τ)]=τR(τ)p(τ;θ)dτ J(\theta) = E_{\tau \sim p(\tau;\theta) } [R(\tau)] = \int_\tau R(\tau) p(τ;θ) d\tau

that is the expected value of R(τ)R(\tau) over τ\tau sampled from p(τ;θ)p(\tau;\theta).

We can find the optimal policy parameter θ\theta by ascending this gradient:
θJ(θ)=τR(τ)θp(τ;θ)dτ \triangledown_\theta J(\theta) = \int_\tau R(τ) \triangledown_\theta p(τ;θ) d\tau

We can prove analytically that the gradient is an expected value of somethings.

θp(τ;θ)=p(τ;θ)θp(τ;θ)p(τ;θ)=p(τ;θ)θlogp(τ;θ)\triangledown_\theta p(τ;θ) = p(τ;θ) \frac{\triangledown_\theta p(τ;θ)}{p(τ;θ)} = p(τ;θ) \triangledown_\theta \log p(τ;θ)

θJ(θ)=τR(τ)(θlogp(τ;θ))p(τ;θ)dτ\triangledown_\theta J(\theta) = \int_\tau R(τ)(\triangledown_\theta \log p(τ;θ)) p(τ;θ) d\tau

=Eτp(τ;θ)[R(τ)θlogp(τ;θ)] = E_{\tau \sim p(\tau;\theta)} [R(τ) \triangledown_\theta \log p(τ;θ)]

Thus, we can use Monte Carlo method to obtain the gradient. But the probability density of the trajectory p(τ;θ)p(τ;θ) is still an unknown. Let’s derive it further.

p(τ;θ)=P(S0)t>0P(st+1,rt+1st,at)πθ(atst) p(τ;θ) = P(S_0)\prod_{t>0} P(s_{t+1}, r_{t+1} | s_t, a_t) \pi_{\theta}(a_t |s_t)

logp(τ;θ)=logt>0[P(s0)+P(st+1,rt+1st,at)+πθ(atst)] \log p(τ;θ) = \log \sum_{t>0} [P(s_0) + P(s_{t+1}, r_{t+1} | s_t, a_t) + \pi_{\theta}(a_t |s_t)]

θlogp(τ;θ)=t>0θlogπθ(atst) \triangledown_\theta \log p(τ;θ) = \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)

θJ(θ)=Eτp(τ;θ)[R(τ)t>0θlogπθ(atst)] \triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [R(τ) \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)]

Now, the gradient depends on the stochastic policy function and total rewards only.

Variance Reduction by Causality

As the rewards before an action have nothing to do with the action, we rewrite the gradient as:

θJ(θ)=Eτp(τ;θ)[t>0θlogπθ(atst)ttγttrt]\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [\sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t) \sum_{t' \ge t} \gamma^{t'-t}r_{t'} ]

The REINFORCE Algorithm

Repeat until converges:

  1. Sample a trajectory τ\tau from the environment and πθ(as)\pi_\theta(a|s)
  2. For time step t{0,1,...,T}t \in \{0, 1, ..., T\} in τ\tau:
    2.1. Calculate the state return Ri=tTγitriR \leftarrow \sum_{i=t}^T \gamma^{i-t} r_i
    2.2. Calculate the gradient θJ(θ)=Rθlogπθ(as)\triangledown_\theta J(\theta) = R \triangledown_\theta \log \pi_\theta(a|s)
    2.3. Update θθ+αθJ(θ)\theta \leftarrow \theta + \alpha \triangledown_\theta J(\theta)

Policy Gradient with Baseline

We have already obtained the unbiased estimator for the mean of θJ(θ)\triangledown_\theta J(\theta), but the variance can be out of control. We can prove that the following estimator is also unbiased, given any baseline function b(s)b(s) which is independent of aa.

θJ(θ)=Eτp(τ;θ)[(R(τ)b(s))θlogp(τ;θ)] \triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [(R(τ) - b(s)) \triangledown_\theta \log p (\tau;\theta)]

A good candidate for b(s)b(s) is vπ(s)v^\pi(s). The reward term can be replaced with an “advantage” function.

A(t)=i=tT[γitri]vπ(st) A(t) = \sum_{i=t}^T [\gamma^{i-t} r_i ]-v^\pi(s_t)

Intuitively, we can think of the advantage function as how much the reward taken by an action was better than the average reward in the state. The variance is reduced in this way. Say, we have rewards in two steps as 0 and 100. Now the advantages may be only 1 and 2.

Reference:
[Sutton & Barto RL Book, 2018]
[UC Berkeley CS294 - Lectures 4 &5, 2017 Fall]

REINFORCE Algorithm with Baseline

Init θ\theta, ww
Repeat until converges:

  1. Sample a trajectory τ\tau from the environment and πθ(as)\pi_\theta(a|s)
  2. For time step t{0,1,...,T}t \in \{0, 1, ..., T\} in τ\tau:
    2.1. Calculate the state return R=i=tTγitriR = \sum_{i=t}^T \gamma^{i-t} r_i
    2.2 Calculate δ=Rv(st;w)\delta = R - v(s_t; w)
    2.3. Update ww+βδwv(st;w)w \leftarrow w + \beta \delta \triangledown_w v(s_t;w)
    2.4. Update θθ+αδθlogπθ(as)\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)

The Actor-Critic Method

So far, our policy gradient estimator was unbiased. Now, in order to further reduce the variance, we start to introduce some bias by bootstrapping. We estimate the rewards at a time step tt by rt+vw(st+1)r_t + v_w(s_{t+1}). And, we don’t need to sample the whole trajectory to perform an update.

The Actor-Critic Algorithm

Init θ\theta, ww
Repeat until converges:
\quad Assign ss with a start state ss0s \leftarrow s_0
\quad While state ss is not a terminal:
\quad \quad Sample action aa from the policy πθ(as)\pi_\theta(a|s)
\quad \quad Perform action aa on the environment to get rr and ss'
\quad \quad Calculate δ=r+γvw(s)vw(s)\delta = r + \gamma v_w(s') - v_w(s)
\quad \quad Update ww+βδwvw(st)w \leftarrow w + \beta \delta \triangledown_w v_w(s_t)
\quad \quad Update θθ+αδθlogπθ(as)\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)
\quad \quad Assign sss \leftarrow s'

Somehow, we didn’t take the derivative of r+γvw(s)r + \gamma v_w(s') w.r.t. ww, cos it was the target of our regression problem.

Softmax Function for Discrete Actions Policy

Let the action space has kk discrete actions {a0,a1,..,ai..,ak}\{a_0, a_1, .., a_i .., a_k\}. The stochastic policy denotes the probability of the ithi^{th} action given the the state ss, parameterised by θ\theta.
π(ais;θ)=eϕi(x,θ)jkeϕj(x,θ) \pi (a_i|s; \theta) = \frac{e^{\phi_i(x, \theta)}} {\sum_j^k e^{\phi_j(x, \theta)}}

Where xx is a feature vector of the state ss and ϕ(x,θ)\phi(x, \theta) is a neural network parameterised by θ\theta and it outputs a vector representing the preferences of the kk actions.

For j=ij=i,
θπ(ais;θ)=eϕieϕieϕj2θϕj(x,θ)=πi(1πj)θϕj(x,θ) \triangledown_\theta \pi (a_i|s; \theta) = \frac {e^{\phi_i}\sum - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (1-\pi_j) \triangledown_\theta \phi_j(x,\theta)

For jij \ne i,
θπ(ais;θ)=0eϕieϕj2θϕj(x,θ)=πi(0πj)θϕj(x,θ) \triangledown_\theta \pi (a_i|s; \theta) = \frac {0 - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (0-\pi_j) \triangledown_\theta \phi_j(x,\theta)

Thus,
θπ(ais;θ)=πi(Iiπ)θϕ(x,θ) \triangledown_\theta \pi (a_i|s; \theta) = \pi_i(\mathbb{I}_{i} -\pi) \triangledown_\theta \phi(x,\theta)

where Iik\mathbb{I}_i \in \Re^k is a vector with 1 at the ithi^{th} element at 0 elsewhere.

Reference:
The Softmax function and its derivative

Gaussian Density Function for Continuous Action Policy

Let the action space has kk continuous action variables, aka \in \Re^k, a=[a0,a1,..,ai..,ak]Ta = [a_0, a_1, .., a_i .., a_k]^T. The stochastic policy function π\pi \in \Re is a probability density for the action variables, parameterised by θ\theta. The probability density function is multivariate Gaussian distribution with a variable mean μk\mu \in \Re^k and a constant covariance Σk×k\Sigma \in \Re^{k\times k}.

π(as;θ)=N(a;μ(x,θ),Σ)=1(2π)kΣe12(aμ)TΣ1(aμ) \pi (a|s; \theta) = \mathbb{N}(a; \mu(x, \theta), \Sigma) = \frac{1}{\sqrt{(2\pi)^k|\Sigma|}} e^{-\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu)}

logπ(as;θ)=12(aμ)TΣ1(aμ)+const log \pi(a|s; \theta) = -\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu) + const

θlogπ(as;θ)=(Σ1(aμ))Tθμ(x;θ) \triangledown_\theta log \pi(a|s; \theta) = (\Sigma^{-1}(a-\mu))^T \triangledown_\theta \mu(x; \theta)

where the mean μ(x;θ)\mu(x;\theta) is contributed by a neural network parameterised by θ\theta and takes the feature vector xx extracted from state ss as input.

Reference:
The Matrix Cookbook

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...