2025-07-15

KL divergence

The Kullback-Leibler (KL) divergence comes up a lot in ML–like a lot. Most recently, I learned it’s used to compute the Inception Score.

what is KL divergence?

Currently, my best answer is, “it’s a way to measure how similar two distributions are.” But similar can mean many things, and more importantly, I don’t know why it’s supposed to measure similarity. Let’s get to the bottom of this¹.

KL divergence is defined as:

\[D_\text{KL}(P \mid \mid Q) = \sum_{x \in \mathcal{X}}P(x)\log \frac{P(x)}{Q(x)}\]

Let’s re-write it and hope we end up with something more digestible.

From log rules, we can re-write \(\log \frac{P(x)}{Q(x)}\) as \(\log P(x) - \log Q(x)\):

\[= \sum_{x \in \mathcal{X}} P(x)[\log P(x) - \log Q(x)]\]

Expanding it further, we get:

\[= \sum_{x \in \mathcal{X}} P(x)\log P(x) -\sum_{x \in \mathcal{X}}P(x)\log Q(x)\]

entropy

Let’s take a detour for a minute to talk about entropy. The Shannon entropy is defined as:

\[H(X) = - \sum_{i} P(x_i)\log P(x_i)\]

Again, I don’t have a good sense of what this formula represents². So let’s start from the ground up. In information theory, entropy is often referred to as a measure of surprise.

What does it mean to be “surprised?” Intuitively, we are surprised when an unlikely event occurs. Given a probability of an event, \(P(E)\), surprise should be large if \(P(E)\) is small, and vice-versa–think winning the lottery vs losing the lottery.

Mathematically, surprisal is defined as³:

\[I(E) = \log \left( \frac{1}{P(E)} \right)\] \[= -\log P(E)\]

For example, when \(P(E)\) is 1, then \(I(E) = \log 1 = 0\). When a guaranteed event occurs, there is no surprise. If \(P(E)=0\), then the log, and subsequently surprise, is undefined: we can’t measure how surprised we will be for something that will never happen.

So we can say:

\[H(X) = \sum_{i} P(x_i)I(x_i)\]

Where \(I(x_i) = -\log P(x_i)\), the surprisal for each event/outcome.

In this form, it’s clear that entropy is the expectation of the surprisal function for a discrete random variable.

\[H(X) = \mathbb{E}_{x \sim P}[I(x)]\]

In other words, the entropy is the average surprise we experience when sampling from the random variable \(X\).

For a simple \(P\), we can easily calculate the entropy. Imagine a simple case where we are flipping a coin. Let’s compare the entropy for a fair coin and a weighted coin.

Coin	P(Heads)	P(Tails)	Entropy⁴
Fair	0.5	0.5	1 bit
Weighted	0.9	0.1	0.47 bits

A fair coin has the highest possible entropy: 1 bit. This 90/10 weighted coin has a lower entropy because in most cases we will get heads, and since heads are not surprising, on average we will be less surprised.

cross-entropy

In the example above, we knew the true distribution of the coin, \(P\). But what if we picked up a coin on the floor and didn’t know if it was fair or not?

We can make a guess and assume it’s a 60/40 coin and try to calculate the expected surprise. This time we call it cross-entropy: the average surprise when we sample from \(P\), but measure surprise using a different distribution \(Q\).

\[H(P,Q) = \mathbb{E}_{x \sim P}[-\log Q(x)]\] \[= -\sum_{i} P(x_i)\log Q(x_i)\]

Let’s write out a specific example. Let’s say we have a fair coin, but we believe it is actually weighted, at 60/40 heads to tails. The cross-entropy is then:

\[H(P,Q) = -(0.5 \log 0.6 + 0.5 \log 0.4)\] \[= 1.03\]

What if we believe an even more incorrect distribution, say 90/10?

\[H(P,Q) = -(0.5 \log 0.9 + 0.5 \log 0.1)\] \[= 1.74\]

What if we also believe it is a fair coin?

\[H(P,Q) = -(0.5 \log 0.5 + 0.5 \log 0.5)\] \[= 1 = H(P)\]

Interestingly, it seems that \(H(P,Q) \geq H(P)\), only equal when \(P=Q\). So we can say:

\[H(P,Q) = H(P) + C\]

Cross-entropy is sum of the inherent surprise originating from \(P\), and an error term, \(C\). This error term is the additional surprise incurred from assuming \(Q\) when in reality \(P\) is true.

Or in other words, \(C\) is the cost of believing \(Q\) when \(P\) is actually true. So when \(Q=P\), \(C=0\).

If we solve for this error term, we get:

\[C = H(P,Q) - H(P)\]

connecting the dots

In the beginning of this post, we said KL divergence can be written as:

\[= \sum_{x \in \mathcal{X}} P(x)\log P(x) -\sum_{x \in \mathcal{X}}P(x)\log Q(x)\]

After re-arranging and compressing, it is equivalent to:

\[= H(P,Q) - H(P)\]

This means that

\[D_\text{KL}(P \mid \mid Q) = H(P,Q) - H(P) = C\]

So the KL divergence is just measuring the extra surprise or cost of believing \(Q\) when \(P\) is true.

recap

To summarize, both KL divergence and entropy start by understanding surprise. Luckily, surprise is intuitive–rare events are more surprising.

We try to quantify how surprised we are by a random variable with entropy. When we make an assumption about the behavior of a random variable, we may be wrong. This can lead us to be surprised when we shouldn’t be.

For example, we assume the coin is biased, 90/10, but it turns out its fair, so we end up seeing a lot more tails than we expect, and thus we are more surprised.

Cross-entropy combines two things: the inherent surprise of the true distribution and the extra surprise caused by believing a wrong model. KL divergence simply measures the extra surprise.

final thoughts

why KL divergence to compare distributions?

why KL divergence is asymmetric?

using KL divergence as a loss function

For reference, I’m going off the Wikipedia page. ↩
Also going off the Wikipedia page for entropy. ↩
Surprisal is also called information content, which is why we use notation \(I(E)\). ↩
Using log base 2, which gives us the standard unit for entropy, the bit. ↩