Notebook authored by Prof. Ani Adhikari, with minor modifications for Data 100 by Prof. F. Perez.
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from ipywidgets import interact
Let $g$ be a convex function. Then $E(g(X)) \ge g(E(X))$.
Because $g$ is convex, its graph lies below the secant line at every point.
x0, x1 = 0.05, 3
@interact(a=(2*x0, 1), b=(1.1, 0.95*x1))
def secplot(a, b):
# Compute -log(x)
x = np.linspace(x0, x1, 200)
g = -np.log(x)
# Compute secant through a,b
ga, gb = -np.log(a), -np.log(b)
m = (gb-ga)/(b-a) # slope
k = ga - m*a # intercept
secant = m*x + k
# Plot all
plt.figure(figsize=(10, 7))
plt.plot(x, g, lw=2, color='darkblue', label=r'$g(X) = -\log(X)$')
plt.plot(x, secant, lw=2, color='red', label=f'Secant through $(a,b) = ({a:.1f},{b:.1f})$')
plt.scatter([a, b], [ga, gb], s=100, color='orange', edgecolors= 'black', linewidths=2)
plt.legend()
plt.ylim(-1.5, 3.5)
plt.xlabel('x')
plt.ylabel('y');
As we saw before, the secant is a linear interpolant that can be written as
$$ t g(a) + (1-t) g(b) $$
for $t \in (0,1)$. The function at any point between $a$ and $b$ is given by
$$ g(t a + (1-t)b) $$
so our convexity condition is that
$$t g(a) + (1-t) g(b) \ge g(t a + (1-t)b)$$
The secant is a weighted average of the value of the function between $a$ and $b$, and this makes the convexity condition a statement of Jensen's inequality for two points. If we generalize this to a weighted average over all the points in the $(a,b)$ interval, weighted by a probability distribution, we end up with teh full version of Jensen's inequality stated as
$$ E[g(X)] \ge g(E[X]). $$
I'll work in the discrete case. In the continuous case replace probabilities by densities and sums by integrals.
$$ \begin{align*} D_{KL}(p || q) ~ &= ~ E_{p}\big{(}\log\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} \\ & \ge ~ \log \big{(} E_{p}\big{(} \frac{p(X)}{q(X)} \big{)} \big{)} ~~~ \text{by Jensen, because } g(x) = -\log(x) \text{ is a convex function} \\ &= ~ \log \big{(} \sum_{\text{all x}} \frac{p(x)}{q(x)} \cdot p(x) \big{)} \\ &= ~ -\log \big{(} \sum_{\text{all x}} p(x) \big{)} \\ &= ~ -\log(1) \\ &= ~ 0 \end{align*} $$