Classifier-free diffusion model guidance
The image generation process can be unconditioned or conditioned by class embeddings or free-form texts. For condition image generation we may want to control the strength of the condition. The guidance scale in Diffusion Models is the same as the temperature in Large Language Models - it controls the tradeoff between creativity and predictability. The higher the guidance scale, the more the model follows the caption, and the lower the scale, the more random the output image we get.
Figure 1. The results of the stable diffusion model are run for different guidance scales for the same prompt. Prompt: ”yellow dog chasing a cat in the rain”, generated using huggingface.co
In Large Language Models, temperature allows to flatten or sharpen the probability distribution over the tokens to be sampled. Unlike temperature sampling, there is no similar mechanism in stable diffusion. Naive solutions such as scaling the model score vectors or decreasing the amount of Gaussian noise added during diffusion sampling don’t give good results. Therefore, different approaches to generate “low temperature” samples from a diffusion model have been proposed. The initial one was called classifier guidance. It added the classification network to the diffusion model in order to control the classification strength of the output image. The method, however, can’t generalize to unseen classes and adds extra training overhead. Therefore, a better approach - classifier-free guidance was proposed. It does not require any additional network or does not limit the model to seen classes.
Background
On the high level, diffusion models produce an image out of the noise. It gradually removes the noise out of the initial noisy image, until the target high-quality image is obtained. The smaller steps it takes during the denoising steps (more steps), the better the output image is.
Sampling goes from time T to 0. Sampling starts with Gaussian noise x_T and produces gradually less noisy samples x_T −1, x_T −2, ... until reaching a final sample x_0, where t represents the timestep. X_t can be thought of as a mixture of a signal x0 with some noise, where the signal-to-noise ratio is determined by the timestep t.
Figure 2. The visualization of the diffusion process. Each step removes a bit of noise from the previous image until the high-quality image is obtained. Source: wikipedia: Diffusion_model
To train the diffusion model, we need to sample image x0, a timestep t, and initial noise. With those values, we are able to compute the noisy sample x_t at the timestep t. The diffusion model learns to predict noise between two neighboring timesteps. θ(xt, t) is the noise added to sample x_t at the time step t. By removing the predicted noise from θ(xt, t) the x_t, we get x_t-1. The network is trained with a mean-squared error objective between the true noise and the predicted noise ||θ(xt, t) − θ||2.
Classifier-free guidance training
As described above, the basic diffusion model produces unconditioned images. It means that the generated is totally random and we don’t have any influence on what it will be. In order to condition a generated image on user input Alex Nichol and Prafulla Dhariwal 2021 proposed to add a class embedding v_i to the timestep embedding e_t, and pass this embedding to residual blocks throughout the model. In current diffusion models, those residual blocks are attention layers, and instead of class embeddings, the model is conditioned on text using text embeddings.
Figure 3. Evolution of diffusion models. 1. Diffusion model without class embedding. 2. Diffusion model with class embedding. 3. Classifier-guided discussion model. 4. Classifier-free diffusion model.
It turned out, that class-conditioned diffusion models suffer from fidelity. Generated images are not that much class-consistent. To further improve constrained diffusion models, Prafulla Dhariwal, Alex Nichol 2021 proposed classifier guidance. They added a classification network along the diffusion model and used the class gradient during the denoising process to guide the model toward the desired class. Although the method worked well and traded diversity for fidelity, it had some drawbacks. The main drawback was that the model could not generalize to classes not seen during training. Another problem was that the classifier had to be trained to classify noisy samples, so it was not possible to plug in a pre-trained generic purposes classifier.
Figure 4. Plot for Fréchet inception distance/Inception score over guidance strengths. Each curve represents a model with unconditional training probability p_uncond. Each point represents a different guidance scale. We can see , that the stronger guidance results with a higher FID/IS score. The lower the FID score the better and the higher IS score the better. The plot clearly shows, that the guidance strength trades of FID for IS. FID penalizes lack of variety. IS, on the other hand, measures how well a model captures the full ImageNet class distribution while producing individual samples that have low entropy for a single class. One drawback of this metric is that it does not reward covering the whole distribution or capturing diversity within a class, and models that memorize a small subset of the full dataset will still have high IS. Source: arxiv.org
In response to those problems, Jonathan Ho & Tim Salimans proposed classifier-free diffusion guidance. They show that jointly training a conditional and an unconditional diffusion model and combining the resulting conditional and unconditional score allows for a trade-off between sample quality and diversity. In detail, the noise at the timestep t θ(xt, t, c) is trained with and without class condition. For the unconditioned model, a null token ∅ is used for the class identifier c (θ(xt, t, c = ∅)). The conditioned and unconditioned models can be trained jointly by randomly setting up c to ∅. For classifier-free guidance, inference parameter w controls the condition strength:
θ_overall(xt, t, c) = (1 + w)θ(zt, t, c) − wθ(xt, t)
We can interpret the equation in a way, that the diversity in the image is subtracted from the certain class distribution. This way a particular class distribution is further enforced.
Figure 5. The effect of guidance strength on a mixture of three Gaussians. Each Gaussian represents a separate class. We can see, that with no guidance, there is some overlap between classes, and therefore, while sampling from one class, it is possible to get samples from other classes. Increasing the guidance strength better separates classes and therefore we increase the fidelity. Source: arxiv.org
Summary
In the above article, we learned why and how to perform classifier-free guidance in diffusion models. When generating images, we want to be able to control what is generated. For that purpose, we should use condition diffusion models. However, simple condition diffusion models suffer from a lack of fidelity and class consistency. We need to sacrifice a bit of diversity for fidelity. Therefore, initially classifier guidance and a further improved version of classifier-free guidance methods were proposed to control the diversity vs fidelity. Diffusion models can be conditioned using other signals as well, for example, other images. This way of diffusion control uses, however, other mechanisms. Control images directly affect convolutional layers and therefore don’t need classifier or classifier-free diffusion guidance. If you want to learn more about controlling diffusion models using other images check out my great blog post about ControlNet. Also, if you are interested in diffusion models you might find worth checking the blogpost about diffusion models evaluation metrics and speeding up diffusion models training using min-SNR.
Reviewed by Rafał Pytel