Monday, June 25, 2018

Shannon Entropy and KL Divergence

Shannon Entropy and KL Divergence

Shannon Information Content

For an event AA, its Shannon Information Content is defined as:
log1p(A) \log \frac{1}{p(A)}

Intuitively, it measures the “surprise” or “unlikeliness” for an event or an outcome of probability distribution. An event that always happen gives Shannon Information Content 0, while the event with 0 probability gives infinity.

Shannon Entropy

We can think of Shannon Entropy H(X)H(X) as the measurement of “surprise”, uncertainty, or randomness of a random variable XX or the distribution p(x)p(x) associated to it.

H(p(x))=xp(x)log1p(x)=Exp(x)[log1p(x)] H(p(x)) = \sum_x p(x) \log \frac{1}{p(x)} = E_{x \sim p(x)}[\log \frac{1}{p(x)}]

By looking at the equation, we can interrupt it as the expectation of how “surprise” the outcomes of the distribution p(x)p(x) is, by sampling the outcomes from the distribution p(x)p(x) itself.

Intuitively, distributions (discrete) with more possible outcomes will have higher Entropy (randomness). A distribution with equal likeliness in each outcome will also have higher Entropy (uncertainty) than the distribution with biased likeliness.

KL Divergence

The KL Divergence measures how diverge that two distributions p(x)p(x) and q(x)q(x) are.

Similar to Shannon Entropy, we can use the following formula to measure the expectation of “surprise” or “unlikeliness” of a distribution q(x)q(x), but taking samples from another distribution p(x)p(x).

Exp(x)[log1q(x)] E_{x \sim p(x)}[\log \frac{1}{q(x)}]

If q(x)p(x)q(x) \equiv p(x), this just gives the entropy HH of p(x)p(x). But it will always greater (more randomness) than H(p)H(p) when pqp \ne q.

The KL Divergence DKL(pq)D_{KL}(p||q) is measuring the excess of this “suprise” with the baseline of H(p)H(p).

DKL(pq)=Exp(x)[log1q(x)]Exp(x)[log1p(x)] D_{KL}(p||q) = E_{x \sim p(x)}[\log \frac{1}{q(x)}] - E_{x \sim p(x)}[\log \frac{1}{p(x)}]

=xp(x)(log1q(x)log1p(x))=xp(x)logp(x)q(x) = \sum_x p(x) (\log \frac{1}{q(x)} - \log \frac{1}{p(x)}) = \sum_x p(x) \log \frac{p(x)}{q(x)}

Note that DKL(pq)DKL(pq)D_{KL}(p||q) \ne D_{KL}(p||q).

Suppose we have two experiments:

  1. Flipping a fair coin, with head and tail half-half
  2. Flipping a biased coin that will only return head

Let p(x)p(x) and q(x)q(x) be the distributions of the two experiment outcomes respectively.

H(p)=1;H(q)=0H(p) = 1; H(q) =0

Exq(x)[log21p(x)]=1E_{x \sim q(x)}[\log_2 \frac{1}{p(x)}] = 1

because p(head)=0.5p(head) = 0.5. But,

Exp(x)[log1q(x)]E_{x \sim p(x)}[\log \frac{1}{q(x)}] \rightarrow \infty

because q(tail)=0q(tail) = 0.
Thus, DKL(qp)D_{KL}(q||p) equals to 1, but DKL(pq)D_{KL}(p||q) explodes to infinity.

No comments:

Post a Comment

Principle Component Analysis

Principle Component Analysis Eigenvector Decomposition Let A ∈ R n × n A \in \R^{n \times n} A ∈ R n × n be an n by n...