2025-07-02

fréchet inception distance

I recently watched MIT’s Introduction to Flow Matching and Diffusion Models course¹. It’s made me interested in generative models and their applications in protein engineering. But before diving into generative models for proteins, I wanted to start off with something simpler, and luckily one of the course assignments is training a flow matching model on MNIST.

After training this model, I was super excited and showed my PI my generated digits. I was also showing him how, with classifier-free guidance, you can change the guidance value to get “better” images. He asked me if there’s a quantitative way to measure how much “better” the images are, and I didn’t have a great answer². I just assumed you go off of vibes.

But after reading some more image generation papers like Imagen³, I realized that Fréchet inception distance (FID)⁴ existed and was used to assess image quality (ah, of course, just evaluating on vibes is not enough). So, let’s figure out what it is and how it works.

inception score

FID was introduced in 2017⁵ to address the limitations of the Inception score (IS). IS is based on the outputs of a pretrained Inception v3 model. Let me think through each section.

We apply the Inception model to every generated image to get the conditional label distribution \(p(y\mid \mathbf{x})\). Images that contain meaningful objects should have a conditional label distribution \(p(y\mid \mathbf{x})\) with low entropy.

What is this saying? \(\mathbf{x}\) is the generated image, and \(y\) is the class label. I know classification models will output logits, and if we softmax those logits, we’ll get a vector where each element corresponds to a probability for a given class, and the sum of the vector is 1: this is our \(p(y\mid \mathbf{x})\).

So what does it mean for \(p(y\mid \mathbf{x})\) to have low entropy? From physics, low entropy refers to an organized state. Here, I’m guessing it means we have a clean peak in probability for one class. For example, if we had 3 classes (cat, dog, bird), a low entropy label distribution would be [0, 1, 0]. A high entropy distribution would be [0.33, 0.33, 0.33] – this might happen because the image looks blurry and Inception can’t easily classify the image, and therefore can’t assign a clear label.

Moreover, we expect the model to generate varied images, so the marginal \(\int p(y \mid \mathbf{x}=G(z)) dz\) should have high entropy. Combining these two requirements, the metric that we propose is: \(\exp(\mathbb{E}_\mathbf{x} \text{KL}(p(y \mid \mathbf{x})\mid\mid p(y)))\), where we exponentiate results so the values are easier to compare.

From the GAN paper, \(G\) is the generator, so \(\mathbf{x}=G(z)\) just refers to the generated images. From the flow matching lectures, I know when the word marginal appears, it’s related to looking at the behavior of the entire dataset. The marginal is \(p(y) = \int p(y \mid \mathbf{x}=G(z))p(z)dz\)⁶.

Using Monte Carlo integration, we can estimate the marginal distribution \(\hat{p}(y)\).

\[\hat{p}(y) = \frac{1}{N} \sum_{i=1}^{N} p(y \mid \mathbf{x}_i),\quad \mathbf{x}_i = G(z_i)\]

Simply put, we average the label distribution \(p(y \mid \mathbf{x})\) from many generated images. If \(\hat{p}(y)\) has high entropy, it means on average, the marginal label distribution is spread out and no single class dominates. This suggests the generated images are diverse.

Almost done–now we calculate \(\text{KL}(p(y \mid \mathbf{x})\mid\mid \hat{p}(y))\). This is the KL divergence between the conditional label distribution and the marginal label distribution. I’ll save the discussion on KL divergence for later–for now, think of it as measuring how different the conditional is from the marginal.

A high KL divergence occurs when each individual image can be confidently classified (low entropy conditional distribution) but on average the generated sampels are diverse (high entropy marginal distribution). To get the expectation, we can average the KL divergence across the generated samples, and then lastly, exponentiate this average to get the Inception score.

FID

The Inception score has the disadvantage that it does not use the statistics of real world samples and compare it to the statistics of synthetic samples.

This is the limitation that FID tries to address.

Let \(p(.)\) be the distribution of model samples and \(p_w(.)\) the distribution of the samples from real world. The equality \(p(.)=p_w(.)\) holds except for a non-measurable set if and only if \(\int p(.) f(x) dx=\int p_w(.) f(x) dx\) for a basis \(f(.)\) spanning the function space in which \(p(.)\) and \(p_w(.)\) live.

Let’s break this down into smaller components. While not always true, in general, we want our generative model to make things that are realistic. Thus, we want to measure how much our distribution of generative samples matches the distribution of real samples. Here, we want to define what it means for the two distributions to be the same.

Instead of comparing the distributions directly, which is non-trivial, we want to indirectly compare them through their expectations. More specifically, the expected values of functions \(f(x)\), that measures some property of the data.

You can imagine a scenario where a single \(f(x)\) outputs the same property for two clearly different samples, for example, the mean of x. Thus another requirement is that \(f(x)\) spans the function space of \(p(.)\) and \(p_w(.)\), meaning that it must be expressive enough to capture differences and appropriately evaluate them.

These equalities of expectations are used to describe distributions by moments or cumulants, where \(f(x)\) are polynomials of the data \(x\). We replace \(x\) by the coding layer of an Inception model in order to obtain vision-relevant features and consider polynomials of the coding unit functions. For practical reasons we only consider the first two polynomials, that is, the first two moments: mean and covariance. The Gaussian is the maximum entropy distribution for given mean and covariance, therefore we assume the coding units to follow a multidimensional Gaussian.

The moments of the distribution are one possible choice for the set of functions \(f(x)\). The first moment is \(\mathbb{E}[x]\), the second moment is \(\mathbb{E}[x^2]\), and so on. Just as using more terms improves the approximation for Taylor series, using more moments provides a more complete comparison of two distributions. The more moments shared, the more confident we can be about the similarity of distributions.

In FID, we assume that the first two polynomials sufficiently capture the properties of our data. These two moments correspond to the mean and covariance, respectively.

Lastly, instead of computing the moments of \(x\), we compute the moments of the embedding \(z = h(x)\), derived from some pre-trained model \(h\), in this case Inception.

The Gaussian is the maximum entropy distribution for given mean and covariance, therefore we assume the coding units to follow a multidimensional Gaussian. The difference of two Gaussians is measured by the Fréchet distance also known as Wasserstein-2 distance. The Fréchet distance \(d(.,.)\) between the Gaussian with mean and covariance \((\boldsymbol{m}, \boldsymbol{C})\) obtained from \(p(.)\) and the Gaussian \((\boldsymbol{m}_w,\boldsymbol{C}_w)\) obtained from \(p_w(.)\) is called the “Fréchet Inception Distance”

Given a mean and covariance, the Gaussian is the most unbiased choice when no other information is available. A convenient property of Gaussians is that they are fully characterized by the first two moments.

We approximate the distributions of embeddings of generated images \((z)\) and embeddings of real images \((z_w)\) as a multivariate Gaussian. We can then compare the two distributions using the Fréchet distance, yielding the FID:

\[\|\boldsymbol{m}-\boldsymbol{m}_w\|_2^2+ \mathrm{Tr} \bigl(\boldsymbol{C}+\boldsymbol{C}_w-2(\boldsymbol{C}\boldsymbol{C}_w)^{1/2}\bigr)\]

So that was a lot of thinking just to get to a rather simple conclusion–all you have to do is get the mean and covariance of your data and plug it into this equation. But you know what, I enjoyed learning about the story of how we got FID, starting with Inception Score–it’s about the journey, not the destination, as they say.

And more importantly, it got me thinking about how to evaluate generated biomolecules⁷. I didn’t realize it before, but although generative models get a lot of credit, evaluating these model’s outputs are equally as important and interesting.

This was a great series of lectures. The accompanying lecture notes are written very clearly. It was also very fun to go back and forth with 4o while referencing the lecture notes. ↩
Maybe this was covered in the lecture series, but I couldn’t think of an answer on the spot. ↩
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding ↩
Also I just realized the dWJS paper’s distributional conformity score is motivated by FID!! This makes me think it will be useful to study some image generation literature for protein generation. ↩
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium ↩
A realization I had is that the expectation is a marginalization: \(\mathbb{E}_{x\sim p(x)}[f(x)] = \int f(x)p(x)dx\). So another way to write \(\int p(y \mid \mathbf{x}=G(z))p(z)dz\) is: \(\mathbb{E}_{z \sim p(z)}[p(y \mid \mathbf{x}=G(z))]\) ↩
If Inception embeddings are used for FID and measuring image quality, what about for proteins? Can’t we use ESM embeddings to do something similar, like Frechet ESM Distance? So I searched “Frechet ESM Distance” and lo and behold, it’s been used before! See here. And then after some more digging, we also see Frechet ProtT5 and ProteinMPNN Distances. ↩