Why is KL Divergence Positive?

Notebook authored by Prof. Ani Adhikari, with minor modifications for Data 100 by Prof. F. Perez.

In [1]:
import numpy as np

import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from ipywidgets import interact

Preliminary: Jensen's Inequality

Let $g$ be a convex function. Then $E(g(X)) \ge g(E(X))$.

Because $g$ is convex, its graph lies below the secant line at every point.

In [4]:
x0, x1 = 0.05, 3

@interact(a=(2*x0, 1), b=(1.1, 0.95*x1))
def secplot(a, b):

    # Compute -log(x)
    x = np.linspace(x0, x1, 200)
    g = -np.log(x)

    # Compute secant through a,b
    ga, gb = -np.log(a), -np.log(b)
    m = (gb-ga)/(b-a)  # slope
    k = ga - m*a  # intercept

    secant = m*x + k

    # Plot all
    plt.figure(figsize=(10, 7))
    plt.plot(x, g, lw=2, color='darkblue', label=r'$g(X) = -\log(X)$')
    plt.plot(x, secant, lw=2, color='red', label=f'Secant through $(a,b) = ({a:.1f},{b:.1f})$')
    plt.scatter([a, b], [ga, gb], s=100, color='orange', edgecolors= 'black', linewidths=2)
    plt.legend()
    plt.ylim(-1.5, 3.5)
    plt.xlabel('x')
    plt.ylabel('y');

As we saw before, the secant is a linear interpolant that can be written as

$$ t g(a) + (1-t) g(b) $$

for $t \in (0,1)$. The function at any point between $a$ and $b$ is given by

$$ g(t a + (1-t)b) $$

so our convexity condition is that

$$t g(a) + (1-t) g(b) \ge g(t a + (1-t)b)$$

The secant is a weighted average of the value of the function between $a$ and $b$, and this makes the convexity condition a statement of Jensen's inequality for two points. If we generalize this to a weighted average over all the points in the $(a,b)$ interval, weighted by a probability distribution, we end up with teh full version of Jensen's inequality stated as

$$ E[g(X)] \ge g(E[X]). $$

The Sign of the Kullback-Leibler Divergence

I'll work in the discrete case. In the continuous case replace probabilities by densities and sums by integrals.

$$ \begin{align*} D_{KL}(p || q) ~ &= ~ E_{p}\big{(}\log\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} \\ & \ge ~ \log \big{(} E_{p}\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} ~~~ \text{by Jensen, because } g(x) = -\log(x) \text{ is a convex function} \\ &= ~ \log \big{(} \sum_{\text{all x}} \frac{p(x)}{q(x)} \cdot p(x) \big{)} \\ &= ~ -\log \big{(} \sum_{\text{all x}} p(x) \big{)} \\ &= ~ -\log(1) \\ &= ~ 0 \end{align*} $$