Reinforce Me: Policy Based Reinforcement Learning (I)

Policy Based Reinforcement Learning (I)

Pre-requisites

Unsupervised Learning
Value Based Reinforcement Learning

Basic Idea

Symbol	Meanings
$s_i \in S$	the states
$a_i \in A$	the actions
$\tau$	a sampled trajectory $(s_0, a_0, r_0 .., s_t, a_t, r_t .., s_T, a_T, r_T )$
$p(\tau;\theta)$	the probability density of trajectory $\tau$ parametirised by $\theta$
$\pi_{\theta}(a \vert s)$	the stochastic policy given state s parameterised by $\theta$
$\gamma$	the reward discount rate
$R(\tau)$	the total rewards from the trajectory $\tau$ , i.e. $\sum_t^T \gamma^{t-1}r_t$

We define an objective function:
$J(\theta) = E_{\tau \sim p(\tau;\theta) } [R(\tau)] = \int_\tau R(\tau) p(τ;θ) d\tau$

that is the expected value of $R(\tau)$ over $\tau$ sampled from $p(\tau;\theta)$ .

We can find the optimal policy parameter $\theta$ by ascending this gradient:
$\triangledown_\theta J(\theta) = \int_\tau R(τ) \triangledown_\theta p(τ;θ) d\tau$

We can prove analytically that the gradient is an expected value of somethings.

$\triangledown_\theta p(τ;θ) = p(τ;θ) \frac{\triangledown_\theta p(τ;θ)}{p(τ;θ)} = p(τ;θ) \triangledown_\theta \log p(τ;θ)$

$\triangledown_\theta J(\theta) = \int_\tau R(τ)(\triangledown_\theta \log p(τ;θ)) p(τ;θ) d\tau$

$= E_{\tau \sim p(\tau;\theta)} [R(τ) \triangledown_\theta \log p(τ;θ)]$

Thus, we can use Monte Carlo method to obtain the gradient. But the probability density of the trajectory $p(τ;θ)$ is still an unknown. Let’s derive it further.

$p(τ;θ) = P(S_0)\prod_{t>0} P(s_{t+1}, r_{t+1} | s_t, a_t) \pi_{\theta}(a_t |s_t)$

$\log p(τ;θ) = \log \sum_{t>0} [P(s_0) + P(s_{t+1}, r_{t+1} | s_t, a_t) + \pi_{\theta}(a_t |s_t)]$

$\triangledown_\theta \log p(τ;θ) = \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)$

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [R(τ) \sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t)]$

Now, the gradient depends on the stochastic policy function and total rewards only.

Variance Reduction by Causality

As the rewards before an action have nothing to do with the action, we rewrite the gradient as:

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [\sum_{t>0} \triangledown_\theta \log \pi_{\theta}(a_t |s_t) \sum_{t' \ge t} \gamma^{t'-t}r_{t'} ]$

The REINFORCE Algorithm

Repeat until converges:

Sample a trajectory $\tau$ from the environment and $\pi_\theta(a|s)$

For time step $t \in \{0, 1, ..., T\}$ in $\tau$ :
2.1. Calculate the state return $R \leftarrow \sum_{i=t}^T \gamma^{i-t} r_i$
2.2. Calculate the gradient $\triangledown_\theta J(\theta) = R \triangledown_\theta \log \pi_\theta(a|s)$
2.3. Update $\theta \leftarrow \theta + \alpha \triangledown_\theta J(\theta)$

Policy Gradient with Baseline

We have already obtained the unbiased estimator for the mean of $\triangledown_\theta J(\theta)$ , but the variance can be out of control. We can prove that the following estimator is also unbiased, given any baseline function $b(s)$ which is independent of $a$ .

$\triangledown_\theta J(\theta) = E_{\tau \sim p(\tau;\theta)} [(R(τ) - b(s)) \triangledown_\theta \log p (\tau;\theta)]$

A good candidate for $b(s)$ is $v^\pi(s)$ . The reward term can be replaced with an “advantage” function.

$A(t) = \sum_{i=t}^T [\gamma^{i-t} r_i ]-v^\pi(s_t)$

Intuitively, we can think of the advantage function as how much the reward taken by an action was better than the average reward in the state. The variance is reduced in this way. Say, we have rewards in two steps as 0 and 100. Now the advantages may be only 1 and 2.

Reference:
[Sutton & Barto RL Book, 2018]
[UC Berkeley CS294 - Lectures 4 &5, 2017 Fall]

REINFORCE Algorithm with Baseline

Init $\theta$ , $w$
Repeat until converges:

Sample a trajectory $\tau$ from the environment and $\pi_\theta(a|s)$

For time step $t \in \{0, 1, ..., T\}$ in $\tau$ :
2.1. Calculate the state return $R = \sum_{i=t}^T \gamma^{i-t} r_i$
2.2 Calculate $\delta = R - v(s_t; w)$
2.3. Update $w \leftarrow w + \beta \delta \triangledown_w v(s_t;w)$
2.4. Update $\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)$

The Actor-Critic Method

So far, our policy gradient estimator was unbiased. Now, in order to further reduce the variance, we start to introduce some bias by bootstrapping. We estimate the rewards at a time step $t$ by $r_t + v_w(s_{t+1})$ . And, we don’t need to sample the whole trajectory to perform an update.

The Actor-Critic Algorithm

Init $\theta$ , $w$
Repeat until converges:
$\quad$ Assign $s$ with a start state $s \leftarrow s_0$
$\quad$ While state $s$ is not a terminal:
$\quad$ $\quad$ Sample action $a$ from the policy $\pi_\theta(a|s)$
$\quad$ $\quad$ Perform action $a$ on the environment to get $r$ and $s'$
$\quad$ $\quad$ Calculate $\delta = r + \gamma v_w(s') - v_w(s)$
$\quad$ $\quad$ Update $w \leftarrow w + \beta \delta \triangledown_w v_w(s_t)$
$\quad$ $\quad$ Update $\theta \leftarrow \theta + \alpha \delta \triangledown_\theta \log \pi_\theta(a|s)$
$\quad$ $\quad$ Assign $s \leftarrow s'$

Somehow, we didn’t take the derivative of $r + \gamma v_w(s')$ w.r.t. $w$ , cos it was the target of our regression problem.

Softmax Function for Discrete Actions Policy

Let the action space has $k$ discrete actions $\{a_0, a_1, .., a_i .., a_k\}$ . The stochastic policy denotes the probability of the $i^{th}$ action given the the state $s$ , parameterised by $\theta$ .
$\pi (a_i|s; \theta) = \frac{e^{\phi_i(x, \theta)}} {\sum_j^k e^{\phi_j(x, \theta)}}$

Where $x$ is a feature vector of the state $s$ and $\phi(x, \theta)$ is a neural network parameterised by $\theta$ and it outputs a vector representing the preferences of the $k$ actions.

For $j=i$ ,
$\triangledown_\theta \pi (a_i|s; \theta) = \frac {e^{\phi_i}\sum - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (1-\pi_j) \triangledown_\theta \phi_j(x,\theta)$

For $j \ne i$ ,
$\triangledown_\theta \pi (a_i|s; \theta) = \frac {0 - e^{\phi_i}e^{\phi_j} } {\sum^2} \triangledown_\theta \phi_j(x,\theta) = \pi_i (0-\pi_j) \triangledown_\theta \phi_j(x,\theta)$

Thus,
$\triangledown_\theta \pi (a_i|s; \theta) = \pi_i(\mathbb{I}_{i} -\pi) \triangledown_\theta \phi(x,\theta)$

where $\mathbb{I}_i \in \Re^k$ is a vector with 1 at the $i^{th}$ element at 0 elsewhere.

Reference:
The Softmax function and its derivative

Gaussian Density Function for Continuous Action Policy

Let the action space has $k$ continuous action variables, $a \in \Re^k$ , $a = [a_0, a_1, .., a_i .., a_k]^T$ . The stochastic policy function $\pi \in \Re$ is a probability density for the action variables, parameterised by $\theta$ . The probability density function is multivariate Gaussian distribution with a variable mean $\mu \in \Re^k$ and a constant covariance $\Sigma \in \Re^{k\times k}$ .

$\pi (a|s; \theta) = \mathbb{N}(a; \mu(x, \theta), \Sigma) = \frac{1}{\sqrt{(2\pi)^k|\Sigma|}} e^{-\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu)}$

$log \pi(a|s; \theta) = -\frac{1}{2}(a-\mu)^T\Sigma^{-1}(a-\mu) + const$

$\triangledown_\theta log \pi(a|s; \theta) = (\Sigma^{-1}(a-\mu))^T \triangledown_\theta \mu(x; \theta)$

where the mean $\mu(x;\theta)$ is contributed by a neural network parameterised by $\theta$ and takes the feature vector $x$ extracted from state $s$ as input.

Reference:
The Matrix Cookbook

Reinforce Me

Thursday, June 21, 2018