Interior design with stable diffusion
How can stable diffusion help each of us rearrange the apartment design? In this article, I will give a short introduction to what stable diffusion is and how it works. Then I will explain how we can control shapes generated by stable diffusion. Finally, I will show a toy project example of how the technology can be used to create room design ideas.
Stable Diffusion
Stable diffusion is a generative text-to-image neural network that can create photo-realistic images given text prompts and noise. Released in August 2022, the original model has 859 million parameters and an input image size of 512x512 pixels. Stable diffusion is much more stable in training than GANs and creates the same or higher quality images. The model architecture follows the U-Net structure with skip connections and uses text embeddings incorporated into the model using cross-attention layers. Text embeddings add control to the model so that the user can ask for a specific image. The image is obtained in a recurrent manner. The network tries to predict a small portion of noise in the image and subtracts it from the input image. The image obtained after subtraction is then fed again as input to the model. The loop is repeated until the final, high-quality image is obtained. Usually, there are about 1,000 recurrent calls.
Table 1. Examples of stable diffusion. Given a user-specified prompt, the model generated images. Images were generated using the official Stable Diffusion demo.
Stable diffusion by design is only controlled by the text prompt. However, the model architecture is not limited to that. With slight modifications, it can be controlled with user drawings, segmentation masks, or edges. In this article, I will particularly focus on controlling the stable diffusion model using lines, as it allows the generation of many different interior design ideas given a single room input image. The challenge with controlling the model using modalities other than text comes with training time and required resources. The original stable diffusion model was trained using 2 billion images and 256 Nvidia A100 GPUs, and it took about 150,000 hours. Electricity consumed during training time is more or less equivalent to a market price of about $600,000. Modifying the model architecture and training it from scratch is an expensive task, which is impossible for most people. Here comes ControlNet. The authors proposed an architecture that efficiently tunes the parameters of the original stable diffusion model. The training time is possible on as little as a single RTX 3090, and we can obtain good results already after 16 hours of training with a dataset of size 50k images.
ControlNet
ControlNet builds on top of the Stable Diffusion model. As presented in Figure 1, the U-Net model with text encoder and time encoder is the exact copy of the original Stable Diffusion model. It shares the same weights as the original model, which are frozen over the training time. Additionally, it adds the trainable encoder, which is a copy of the U-Net Stable Diffusion encoder. The encoder is trainable and responsible for the spatial control of the output image. The input to the encoder is noise and edges. Edges represent contours in the target image. Finally, ControlNet adds zero convolutions layers which are connectors between the controlling encoder and the stable diffusion U-Net. The name zero-convolution comes from the weights initialization, which are initialized to zeros. As the authors prove in their publication, weights initialization to zero allow for smooth control encoder impact increase on the controlled U-Net. In the first iteration, the control encoder has no impact on the final output at all. Over the training time, the control encoder gradually gets more and more influence, and the final model output gets more and more aligned with the condition shape.
Figure 1. ControlNet architecture.
ControlNet for Interior Design
For the purpose of this project, I connected multiple interior design datasets:
- IKEA Interior design dataset (298 images)
- Indoor Scene Recognition dataset (2,809)
- House Roomsset (5,250)
- GeoSynth: A Photorealistic Synthetic Indoor Dataset for Scene Understanding (170)
In total, a dataset of 8,527 images was created. Authors of the ControlNet paper suggest using at least 50k images, however, I had a hard time finding more sufficient quality interior design images.
Some images in the dataset have 256x256 pixels resolution. The original stable diffusion model was trained on 512x512 pixels images. To keep the original architecture unchanged. I decided to upscale all images and center crop it to 512x512.
The dataset for ControlNet training consists of the original image, condition image (lines, edges, masks), and text prompts. For each original image, I created a condition image using Hough transform. Initially, I experimented with a canny edge detector, but the resulting condition images had too many contours. The model had a hard time coming up with new designs matching so many unstructured edges. Hough transform outputs only straight lines in the image, which are perfect for interior design mapping.
Finally, I had to create a text prompt for each image. One way of doing so is to just enter an empty prompt label “”. However, as the model was originally trained with prompts, the results might be poor. I decided to use a default prompt for each image “interior design”. The best option would be to use an automatic image labeling network, like for example Bootstrapping Language-Image Pre-training (BLIP), to create a customized text prompt for each image. I keep this approach for future work.
Figure 2. Bedroom design ideas generated by the ControlNet trained on RTX 3090, batch size = 4, for 48 hours.
Figure 3. Bedroom design ideas generated by the ControlNet trained on RTX 3090, batch size = 4, for 48 hours.
Figure 4. Livingroom design ideas generated by the ControlNet trained on RTX 3090, batch size = 4, for 48 hours.
Figure 5. Bedroom design ideas generated by the ControlNet trained on RTX 3090, batch size = 4, for 48 hours.
Summary
As we can see in Figure 2-5, even with a small dataset, the ControlNet gives decent results. The control encoder correctly enforces the network to output similar shape images. The large knowledge of spatial world appearance encoded in a pre-trained stable diffusion model allows us to obtain high-quality images right from the beginning of the training. Training only Control-Net, without modifying weights of the Stable Diffusion U-Net model, makes the model learn just the new image condition without modifying its already possessed knowledge about the world.
In order to improve the results it would be necessary to create a larger dataset with higher-quality images. For better text prompts, one can use the BLIP network. It would allow customizing prompts for each image and improve final model results even more. Furthermore, authors suggest that the training benefits from larger batch sizes, while increasing training time, does not improve results. This approach, however, would require GPU cluster setup, gradient accumulation, or GPU with a larger VRAM than 24GB.