Evaluation metrics for generative image models
Training generative image models might be challenging, but properly evaluating them might be even more difficult. The most naive metric is human expert judgment. However, this is expensive, time-consuming, and prone to bias. Human experts are subjective to the task setup, motivation, and the feedback they get about their mistakes. There was a lack of an objective function that could be used to evaluate generated images. Therefore, other alternatives have been developed to measure the quality, diversity, and fidelity of generated images. Among the most commonly used ones are Inception Score (IS) and Fréchet inception distance (FID).
Inception Score
The Inception Score was initially proposed in a paper, Improved Techniques for Training GANs 2016. By design, it measures two factors:
- Image fidelity. Each image has to contain a clear object.
- Diversity. The model should be able to generate many different object classes, ideally following a uniform distribution. Each class should be equally likely.
The IS is computed using the following function:
The p(y|x) is a conditional label distribution. The authors of the paper use the Inception network to predict the label for each generated image. If the image contains a clear object, the inception model should return a high probability for one of the classes, and the entropy will be low H(y|x)~0.
The other part of the equation, p(y), is the probability of the inception model predicting each class. Ideally, the probability for each class should be equal, so the entropy should be high H(y)~+∞.
KL is the Kullback–Leibler divergence. It measures how those two distributions are similar to each other. A large value indicates that they are not similar at all. This satisfies the constraint for fidelity and diversity.
Figure 1. The KL divergence measures the similarity between distributions. DKL(P||Q) = ∑xP(x)ln(P(x)/Q(x)). The DKL is not symmetric DKL(P||Q) != DKL(Q||P). Source
The exponential function was introduced only for aesthetic purposes. The higher the IS, the better.
Inception Score drawbacks
Inception Score, however, has some drawbacks. Those were pointed out in the paper A Note on the Inception Score 2018. Here are the main of them:
- Low generalization ability - The IS does not measure image diversity within the class. The GenAI model can generate the same images within a class, and the score will still be high. Therefore, it does not penalize the model for memorizing only a small subset of the training data.
- Sensitivity to model weights. Even small differences in weight values can result in large variances in evaluation results.
The mean Inception Score is 3.5% higher for ImageNet validation images, and 11.5% higher for CIFAR validation images, depending on whether a Keras or Torch implementation of the Inception Network is used.
- Directly optimizing the GenAI model using an Inception Score teaches the model to generate adversarial samples instead of higher-quality images.
Figure 2. Adversarial samples were generated by the diffusion model train with an IS objective. Samples achieved an Inception Score of 900.15, while the maximum IS score is 1000. Source
- The IS is best suited for GenAI models trained on the same dataset as the Inception classification model (the original Inception model was trained on ImageNet). Applying IS to GenAI models trained, for example, on CIFAR-10, might give misleading results. This is because the top 10 classes predicted by the Inception model do not align with the classes present in CIFAR-10. This misalignment of Inception model classes and other dataset classes might result in an incorrect estimation of p(y), the marginal class distribution across the set of generated images X, and of p(y|x).
Fréchet inception distance
Figure 3. The visualization of the FID score computation. The FID is the distance between two normal distributions, with mean and sigma computed based on the first and second momentum of the inception-generated image features.
Because of Inception Score drawbacks, a Fréchet inception distance was proposed for evaluating GenAI models. It captures the similarity of generated images to real ones better than the Inception Score. Similarly to IS, the FID uses a pre-trained Inception-v3 model. However, instead of output probabilities, the FID utilizes the last pooling layer (the one prior to the output classification layer) as a coding layer. 50k images are propagated through the inception model, and momentums are computed on the vectors from the coding layer. To be precise, the first (mean) and the second (covariance matrix) momentum of that layer is computed. These statistics are calculated for a collection of real and fake images. Finally, the distance between these two distributions is calculated using the Frechet distance.
where mu_r and mu_f refer to the mean of real and fake features respectively, C_r and Cf refer to a real and fake covariant matrix of features respectively, Tr is a trace linear algebra operation, and || ||^2 is a sum squared difference. The lower the distance (FID score) the better the image quality.
Figure 4. The lower the FID score the better the image quality and fidelity. Source
In Generating images with sparse representations, the improved version of FID called sFID was proposed. Instead of the last pooling layer, it uses intermediate spatial features in the inception network. sFID correlates better with the human judgment of good-quality images. It has a high correlation with realistic textures, and structures, and reaches diversity, similar to human perception.
Even though sFID is by far the best automatic metric for evaluating GenAI image models, the IS is still widely used alongside. This is because the IS captures better image fidelity to the requested condition.
FID score drawbacks
Recently a paper was published about FID drawbacks: Rethinking FID: Towards a Better Evaluation Metric for Image Generation. Authors identify that FID struggles to capture gradual image quality improvement and does not capture image distortions. They identify the following issues, that should be addressed in order to improve the FID score:
- The inception model produces weak image embeddings. The CLIP embeddings are trained on x400 more training samples, and instead of classification tasks, the CLIP is trained to align images with their descriptions well. As a result, CLIP produces more reacher embeddings than the Inception model.
- FID assumes the normal distribution of special features, while it might not always be true.
- The FID score is biased. The FID score computed for a finite sample set is not the true value of the score and it depends on the number of samples used for score computation. Details can be found in the paper Effectively Unbiased FID and Inception Score and where to find them.
- FID is sample inefficient. It requires a large sample to reliably estimate the d × d covariance matrix for high-dimensionality feature vectors.
As a response to those drawbacks, the authors of the paper propose CLIP-Maximum Mean Discrepancy (CMMD).
CLIP-Maximum Mean Discrepancy
CMMD is an improved version of the FID score. Instead of the Inception embeddings, it uses much reacher CLIP embeddings and substitutes Fréchet distance with a Maximum Mean Discrepancy. MMD was designed to evaluate whether samples come from the same distribution. Compared to the FID score, MMD, when used with a characteristic kernel, does not make any assumptions about the data distribution. Moreover, it is an unbiased estimator and efficiently sampled when working with high-dimensional data.
Figure 5. Authors of the CMMD prove that the FID score may fail to capture gradual improvement of image quality. The pantheon quality is gradually improved from left to right. However as presented on the bottom plot, the FID score increases monotonically. A newly proposed metric CMMD on the other hand successfully captures gradual image improvement. Source
Summary
In this article, we learned about two main metrics for image generative model evaluation: Inception Score and Fréchet inception distance. Inception Score fails to capture image diversity within a single class and should be used with the inception model trained on the same dataset as used for gen model training. FID on the other hand solves the diversity problem of the IS but is less suitable for fidelity to condition evaluation and makes the wrong assumption about image features following normal distribution. The newly proposed CMMD score addresses FID drawbacks by using Maximum Mean Discrepancy, which does not make any assumption about the data distribution. It further improves the FID by using much more CLIP embeddings instead of Inception embeddings. If you would like to learn more about Stable Diffusion, please check out our other blog posts about classification-free diffusion model guidance and ContolNet.