Fine Tuning Stable Diffusion XL (SDXL) with Porsche 911 Dakar

AI-based image generation programs are getting really good, and also really popular. When it comes to generating images of landscapes, cartoons, and art, it’s easy to see why. Programs such as OpenAI’s DALL-E 3, Midjourney, Adobe Firefly and Stability AI’s Stable Diffusion XL (SDXL) model can conjure breathtaking high-quality photos all from a simple text-based prompt. 

But what about the situation when you need to generate photorealistic images of a specific product? We don’t want the image generation model to take any creative license with the design, as people would immediately notice any deviations in the product’s look.  

As an example, let’s ask the model to create images of a widely recognised object such as the latest iPhone and it will do a decent job of generating content; however, some rendering inaccuracies might appear upon further inspection. 

iPhone (gen AI)

“Hand holding an iPhone X” - vanilla SXDL with no training

If you were to stare at this image for half a second, you would immediately be able to identify the device. It doesn’t take long, however, to notice some glaring oddities with the image. For one, it appears that the rear of the iPhone appears as the front-screen. More noticeably, there’s something anatomically incorrect about the way those fingers are gripping the device!  

This is where fine-tuning a SDXL model would come in handy. To try and minimise any design discrepancies between the physical product and the images that the model generates, we can train the model by feeding it lots of high-quality images of the subject we need to generate content for.  

In the next example, let’s say that we are trying to generate new promotional material for the new Porsche 911 Dakar. The Dakar is what happens when you take Porsche’s beloved 911 sports car and turn it into an off-roader. It maintains that classic and sleek look from the original 911, and gives it a sturdier remodel. Let’s see how we can train SDXL to do this for us. 
 

Without Fine-Tuning 

Before we get started, let’s get a feel for how well SDXL can do without giving it any training images to work from. I prompted SDXL with: “A Porsche 911 Dakar driving down a highway” And this is what it generated: 

“A Porsche 911 Dakar driving down a highway” - vanilla SXDL with no training

Not a bad image. The shape of the 911 is good, but it’s not a Dakar model (it looks more like a GT3), and it glaringly jumbles any text on the vehicle. This is not surprising, as there are many models of 911, and because the Dakar is a very recent (2023) model Stable Diffusion almost certainly has not been pre-trained on that specific model.

What happens if we give it better data to work with?  

With Fine tuning  

After saving a dataset of 100+ high quality (publicly available) images of the 911 Dakar, we labelled each image with a detailed description of the photo. When creating descriptions, it’s important to not reference the subject of the training data by its name (in this case, “Porsche 911 Dakar”). Instead, we reference the subject as “TOK”. This helps us separate the model’s pre-trained understanding of what a “Porsche 911 Dakar” might look like from what we are telling it a “Porsche 911 Dakar” should look like. Referencing the training subject as “TOK” forces the model to analyse the subject as if it's something it hasn’t seen before.  

Here are some of the examples we were able to generate: 

While not entirely perfect, it seems to do a good job of generating images that are more consistent with the look of the 911 Dakar, but it still falls short of accurately rendering any logo or license plate text. What if we altered our training dataset to only include images that don’t have any license plates? Let’s see what happens: 

Much better! There are still some slight deviations between these generated images and the real 911 Dakar (for example it refuses to include the red tow-hook). And it still struggles to draw the Porsche logo. Nonetheless, the results are still impressive despite only having a training dataset with 39 images. Imagine the accuracy if we had hundreds or even thousands of images. 

With enough patience, clever-prompting, and a high-quality dataset, it's possible to generate incredibly realistic photos for specific products. SDXL is still relatively new (it was only released mid-2023) and will continue to improve, and other apps (such as Adobe Firefly) will soon be releasing fine-tuning capabilities too. 

Previous
Previous

OpenAI’s Latest Updates: Why Should I Care?

Next
Next

Generative AI enablement for Pinpoint