Reinforce Me: Shannon Entropy and KL Divergence

Shannon Entropy and KL Divergence

Shannon Information Content

For an event $A$ , its Shannon Information Content is defined as:
$\log \frac{1}{p(A)}$

Intuitively, it measures the “surprise” or “unlikeliness” for an event or an outcome of probability distribution. An event that always happen gives Shannon Information Content 0, while the event with 0 probability gives infinity.

Shannon Entropy

We can think of Shannon Entropy $H(X)$ as the measurement of “surprise”, uncertainty, or randomness of a random variable $X$ or the distribution $p(x)$ associated to it.

$H(p(x)) = \sum_x p(x) \log \frac{1}{p(x)} = E_{x \sim p(x)}[\log \frac{1}{p(x)}]$

By looking at the equation, we can interrupt it as the expectation of how “surprise” the outcomes of the distribution $p(x)$ is, by sampling the outcomes from the distribution $p(x)$ itself.

Intuitively, distributions (discrete) with more possible outcomes will have higher Entropy (randomness). A distribution with equal likeliness in each outcome will also have higher Entropy (uncertainty) than the distribution with biased likeliness.

KL Divergence

The KL Divergence measures how diverge that two distributions $p(x)$ and $q(x)$ are.

Similar to Shannon Entropy, we can use the following formula to measure the expectation of “surprise” or “unlikeliness” of a distribution $q(x)$ , but taking samples from another distribution $p(x)$ .

$E_{x \sim p(x)}[\log \frac{1}{q(x)}]$

If $q(x) \equiv p(x)$ , this just gives the entropy $H$ of $p(x)$ . But it will always greater (more randomness) than $H(p)$ when $p \ne q$ .

The KL Divergence $D_{KL}(p||q)$ is measuring the excess of this “suprise” with the baseline of $H(p)$ .

$D_{KL}(p||q) = E_{x \sim p(x)}[\log \frac{1}{q(x)}] - E_{x \sim p(x)}[\log \frac{1}{p(x)}]$

$= \sum_x p(x) (\log \frac{1}{q(x)} - \log \frac{1}{p(x)}) = \sum_x p(x) \log \frac{p(x)}{q(x)}$

Note that $D_{KL}(p||q) \ne D_{KL}(p||q)$ .

Suppose we have two experiments:

Flipping a fair coin, with head and tail half-half
Flipping a biased coin that will only return head

Let $p(x)$ and $q(x)$ be the distributions of the two experiment outcomes respectively.

$H(p) = 1; H(q) =0$

$E_{x \sim q(x)}[\log_2 \frac{1}{p(x)}] = 1$

because $p(head) = 0.5$ . But,

$E_{x \sim p(x)}[\log \frac{1}{q(x)}] \rightarrow \infty$

because $q(tail) = 0$ .
Thus, $D_{KL}(q||p)$ equals to 1, but $D_{KL}(p||q)$ explodes to infinity.

Reinforce Me

Monday, June 25, 2018

Shannon Entropy and KL Divergence

Shannon Information Content

Shannon Entropy

KL Divergence

No comments:

Post a Comment

Principle Component Analysis