Reinforce Me: June 2018

Shannon Entropy and KL Divergence

Shannon Information Content

For an event $A$ , its Shannon Information Content is defined as:
$\log \frac{1}{p(A)}$

Intuitively, it measures the “surprise” or “unlikeliness” for an event or an outcome of probability distribution. An event that always happen gives Shannon Information Content 0, while the event with 0 probability gives infinity.

Shannon Entropy

We can think of Shannon Entropy $H(X)$ as the measurement of “surprise”, uncertainty, or randomness of a random variable $X$ or the distribution $p(x)$ associated to it.

$H(p(x)) = \sum_x p(x) \log \frac{1}{p(x)} = E_{x \sim p(x)}[\log \frac{1}{p(x)}]$

By looking at the equation, we can interrupt it as the expectation of how “surprise” the outcomes of the distribution $p(x)$ is, by sampling the outcomes from the distribution $p(x)$ itself.

Intuitively, distributions (discrete) with more possible outcomes will have higher Entropy (randomness). A distribution with equal likeliness in each outcome will also have higher Entropy (uncertainty) than the distribution with biased likeliness.

KL Divergence

The KL Divergence measures how diverge that two distributions $p(x)$ and $q(x)$ are.

Similar to Shannon Entropy, we can use the following formula to measure the expectation of “surprise” or “unlikeliness” of a distribution $q(x)$ , but taking samples from another distribution $p(x)$ .

$E_{x \sim p(x)}[\log \frac{1}{q(x)}]$

If $q(x) \equiv p(x)$ , this just gives the entropy $H$ of $p(x)$ . But it will always greater (more randomness) than $H(p)$ when $p \ne q$ .

The KL Divergence $D_{KL}(p||q)$ is measuring the excess of this “suprise” with the baseline of $H(p)$ .

$D_{KL}(p||q) = E_{x \sim p(x)}[\log \frac{1}{q(x)}] - E_{x \sim p(x)}[\log \frac{1}{p(x)}]$

$= \sum_x p(x) (\log \frac{1}{q(x)} - \log \frac{1}{p(x)}) = \sum_x p(x) \log \frac{p(x)}{q(x)}$

Note that $D_{KL}(p||q) \ne D_{KL}(p||q)$ .

Suppose we have two experiments:

Flipping a fair coin, with head and tail half-half
Flipping a biased coin that will only return head

Let $p(x)$ and $q(x)$ be the distributions of the two experiment outcomes respectively.

$H(p) = 1; H(q) =0$

$E_{x \sim q(x)}[\log_2 \frac{1}{p(x)}] = 1$

because $p(head) = 0.5$ . But,

$E_{x \sim p(x)}[\log \frac{1}{q(x)}] \rightarrow \infty$

because $q(tail) = 0$ .
Thus, $D_{KL}(q||p)$ equals to 1, but $D_{KL}(p||q)$ explodes to infinity.

Policy Based Reinforcement Learning (I)

Pre-requisites

Unsupervised Learning
Value Based Reinforcement Learning

Basic Idea

Symbol	Meanings
$s_i \in S$	the states
$a_i \in A$	the actions
$\tau$	a sampled trajectory $(s_0, a_0, r_0 .., s_t, a_t, r_t .., s_T, a_T, r_T )$
$p(\tau;\theta)$	the probability density of trajectory $\tau$ parametirised by $\theta$
$\pi_{\theta}(a \vert s)$	the stochastic policy given state s parameterised by $\theta$
$\gamma$	the reward discount rate
$R(\tau)$	the total rewards from the trajectory $\tau$ , i.e. $\sum_t^T \gamma^{t-1}r_t$

We define an objective function:
$J(\theta) = E_{\tau \sim p(\tau;\theta) } [R(\tau)] = \int_\tau R(\tau) p(τ;θ) d\tau$

that is the expected value of $R(\tau)$ over $\tau$ sampled from $p(\tau;\theta)$ .

We can find the optimal policy parameter $\theta$ by ascending this gradient:
$\triangledown_\theta J(\theta) = \int_\tau R(τ) \triangledown_\theta p(τ;θ) d\tau$

We can prove analytically that the gradient is an expected value of somethings.

$\triangledown_\theta p(τ;θ) = p(τ;θ) \frac{\triangledown_\theta p(τ;θ)}{p(τ;θ)} = p(τ;θ) \triangledown_\theta \log p(τ;θ)$

$\triangledown_\theta J(\theta) = \int_\tau R(τ)(\triangledown_\theta \log p(τ;θ)) p(τ;θ) d\tau$

$= E_{\tau \sim p(\tau;\theta)} [R(τ) \triangledown_\theta \log p(τ;θ)]$

Thus, we can use Monte Carlo method to obtain the gradient. But the probability density of the trajectory $p(τ;θ)$ is still an unknown. Let’s derive it further.

$p(τ;θ) = P(S_0)\prod_{t>0} P(s_{t+1}, r_{t+1} | s_t, a_t) \pi_{\theta}(a_t |s_t)$

$\log p(τ;θ) = \log \sum_{t>0} [P(s_0) + P(s_{t+1}, r_{t+1} | s_t, a_t) + \pi_{\theta}(a_t |s_t)]$

$\triangledown_\theta \log p(τ;θ) = \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)$

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [R(τ) \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)]$

Now, the gradient depends on the stochastic policy function and total rewards only.

Variance Reduction by Causality

As the rewards before an action have nothing to do with the action, we rewrite the gradient as:

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [\sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t) \sum_{t' \ge t} \gamma^{t'-t}r_{t'} ]$

The REINFORCE Algorithm

Repeat until converges:

Sample a trajectory $\tau$ from the environment and $\pi_\theta(a|s)$

For time step $t \in \{0, 1, ..., T\}$ in $\tau$ :
2.1. Calculate the state return $R \leftarrow \sum_{i=t}^T \gamma^{i-t} r_i$
2.2. Calculate the gradient $\triangledown_\theta J(\theta) = R \triangledown_\theta \log \pi_\theta(a|s)$
2.3. Update $\theta \leftarrow \theta + \alpha \triangledown_\theta J(\theta)$

Policy Gradient with Baseline

We have already obtained the unbiased estimator for the mean of $\triangledown_\theta J(\theta)$ , but the variance can be out of control. We can prove that the following estimator is also unbiased, given any baseline function $b(s)$ which is independent of $a$ .

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [(R(τ) - b(s)) \triangledown_\theta \log p (\tau;\theta)]$

A good candidate for $b(s)$ is $v^\pi(s)$ . The reward term can be replaced with an “advantage” function.

$A(t) = \sum_{i=t}^T [\gamma^{i-t} r_i ]-v^\pi(s_t)$

Intuitively, we can think of the advantage function as how much the reward taken by an action was better than the average reward in the state. The variance is reduced in this way. Say, we have rewards in two steps as 0 and 100. Now the advantages may be only 1 and 2.

Reference:
[Sutton & Barto RL Book, 2018]
[UC Berkeley CS294 - Lectures 4 &5, 2017 Fall]

REINFORCE Algorithm with Baseline

Init $\theta$ , $w$
Repeat until converges:

Sample a trajectory $\tau$ from the environment and $\pi_\theta(a|s)$

For time step $t \in \{0, 1, ..., T\}$ in $\tau$ :
2.1. Calculate the state return $R = \sum_{i=t}^T \gamma^{i-t} r_i$
2.2 Calculate $\delta = R - v(s_t; w)$
2.3. Update $w \leftarrow w + \beta \delta \triangledown_w v(s_t;w)$
2.4. Update $\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)$

The Actor-Critic Method

So far, our policy gradient estimator was unbiased. Now, in order to further reduce the variance, we start to introduce some bias by bootstrapping. We estimate the rewards at a time step $t$ by $r_t + v_w(s_{t+1})$ . And, we don’t need to sample the whole trajectory to perform an update.

The Actor-Critic Algorithm

Init $\theta$ , $w$
Repeat until converges:
$\quad$ Assign $s$ with a start state $s \leftarrow s_0$
$\quad$ While state $s$ is not a terminal:
$\quad$ $\quad$ Sample action $a$ from the policy $\pi_\theta(a|s)$
$\quad$ $\quad$ Perform action $a$ on the environment to get $r$ and $s'$
$\quad$ $\quad$ Calculate $\delta = r + \gamma v_w(s') - v_w(s)$
$\quad$ $\quad$ Update $w \leftarrow w + \beta \delta \triangledown_w v_w(s_t)$
$\quad$ $\quad$ Update $\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)$
$\quad$ $\quad$ Assign $s \leftarrow s'$

Somehow, we didn’t take the derivative of $r + \gamma v_w(s')$ w.r.t. $w$ , cos it was the target of our regression problem.

Softmax Function for Discrete Actions Policy

Let the action space has $k$ discrete actions $\{a_0, a_1, .., a_i .., a_k\}$ . The stochastic policy denotes the probability of the $i^{th}$ action given the the state $s$ , parameterised by $\theta$ .
$\pi (a_i|s; \theta) = \frac{e^{\phi_i(x, \theta)}} {\sum_j^k e^{\phi_j(x, \theta)}}$

Where $x$ is a feature vector of the state $s$ and $\phi(x, \theta)$ is a neural network parameterised by $\theta$ and it outputs a vector representing the preferences of the $k$ actions.

For $j=i$ ,
$\triangledown_\theta \pi (a_i|s; \theta) = \frac {e^{\phi_i}\sum - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (1-\pi_j) \triangledown_\theta \phi_j(x,\theta)$

For $j \ne i$ ,
$\triangledown_\theta \pi (a_i|s; \theta) = \frac {0 - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (0-\pi_j) \triangledown_\theta \phi_j(x,\theta)$

Thus,
$\triangledown_\theta \pi (a_i|s; \theta) = \pi_i(\mathbb{I}_{i} -\pi) \triangledown_\theta \phi(x,\theta)$

where $\mathbb{I}_i \in \Re^k$ is a vector with 1 at the $i^{th}$ element at 0 elsewhere.

Reference:
The Softmax function and its derivative

Gaussian Density Function for Continuous Action Policy

Let the action space has $k$ continuous action variables, $a \in \Re^k$ , $a = [a_0, a_1, .., a_i .., a_k]^T$ . The stochastic policy function $\pi \in \Re$ is a probability density for the action variables, parameterised by $\theta$ . The probability density function is multivariate Gaussian distribution with a variable mean $\mu \in \Re^k$ and a constant covariance $\Sigma \in \Re^{k\times k}$ .

$\pi (a|s; \theta) = \mathbb{N}(a; \mu(x, \theta), \Sigma) = \frac{1}{\sqrt{(2\pi)^k|\Sigma|}} e^{-\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu)}$

$log \pi(a|s; \theta) = -\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu) + const$

$\triangledown_\theta log \pi(a|s; \theta) = (\Sigma^{-1}(a-\mu))^T \triangledown_\theta \mu(x; \theta)$

where the mean $\mu(x;\theta)$ is contributed by a neural network parameterised by $\theta$ and takes the feature vector $x$ extracted from state $s$ as input.

Reference:
The Matrix Cookbook

Reinforce Me

Monday, June 25, 2018