StableDiffusion: Images from text input
Background information

StableDiffusion: Images from text input

David Lee
23.9.2022
Translation: machine translated

Image generation with artificial intelligence is making progress. StableDiffusion doesn't perform miracles, but it's a grab bag. And like me, you can try it out for yourself.

StableDiffusion is an image generator: you type in a text, and the artificial intelligence (AI) generates an image to go with it. This is how other AI generators work, such as Dall-E 2. However, while Dall-E 2 is currently only available to selected people and only for a fee, StableDiffusion can be used by anyone for free. DiffusionBee for the Mac makes things especially easy: the normally rather complicated installation is done with a simple drag and drop into the application folder.

Different every time

I start by typing in "cheesy giraffe skiing in the Swiss mountains wearing headphones". In other words, a giraffe skiing in the Swiss mountains wearing headphones. The text input works best in English because the data material used to train StableDiffusion is mainly in English.

Every time StableDiffusion generates an image, something different comes out. Even with the same text and settings. With the parameter "Guidance" you can specify how closely the AI should stick to the text default. By default, it is almost at the maximum value - but even then, the results vary greatly.

Wild mixtures yield chabis

The giraffe example brings together different things that normally don't belong together. Such text inputs are of course very appealing - but at the same time they are very difficult for the AI. Because there are no photos, probably not even drawings, to represent it. And the AI is trained on the basis of real pictures.

The problem is also evident in the text "John Oliver marries a cabbage".

While the elements mentioned in the text do appear in the pictures - they do not appear in the form described. Nowhere does John Oliver marry a cabbage. Why do I even come up with such nonsense? Because in a John Oliver video someone tried the same thing with Dall-E 2. Dall-E 2 fails just like StableDiffusion.

Because the AI needs real templates for good results, John Oliver married a cabbage on purpose. After all, one helps where one can.

Nachsitzen für die KI: Die reale Vorlage.
Nachsitzen für die KI: Die reale Vorlage.

What goes well and what goes less

You've probably noticed John Oliver's grotesque eyes. Eyes often turn out askew. Human bodies are sometimes grotesquely distorted. StableDiffusion also has difficulty drawing straight lines
.

Drei Versuche zu «large building with straight geometry».
Drei Versuche zu «large building with straight geometry».

This is more disturbing in photorealistic images than in paintings. Anyway, StableDiffusion's strengths seem to lie in the realm of fantasy images. The site arthub.ai gives a good impression of this.

Here are some pictures to go with the text "a beautiful castle beside a waterfall in the woods, fantasy painting".

In six out of ten attempts, StableDiffusion painted two castles - the AI does not strictly distinguish between singular and plural. This can be very irritating. It's clear to any human that there is typically only one John Oliver getting married at a John Oliver wedding. An AI like StableDiffusion or Dall-E is not aware of anything - it has no background knowledge to interpret inputs correctly. Accordingly, it creates images of two John Olivers marrying each other.

Die KI versteht nicht, was sie da hinmalt.
Die KI versteht nicht, was sie da hinmalt.

StableDiffusion also does little with vague abstract terms. The most inappropriate image in my whole experiment with several hundred images came out on the subject of "Happiness": it pretty much expresses the opposite.

Happiness gemäss StableDiffusion.
Happiness gemäss StableDiffusion.

StableDiffusion was trained with Laion 5B, a database of 5.85 billion text-image pairs. It can be searched online. In the search for "giraffe", most search hits are not photos of real giraffes, but drawings or photos of toys. This is the case with many terms and a possible explanation why StableDiffusion does not do photorealistic representations so well. The training material also contains many memes and other pictures with text, which is why StableDiffusion likes to trace text - without really being able to write.

Bild zu «average online commenter raging and hating on everything».
Bild zu «average online commenter raging and hating on everything».

Top 20: The best song title illustrations

StableDiffusion is addictive. The appeal is that you never know what's going to come out. Because you have to wait a few seconds to minutes for each picture, the tension increases. At some point I got the idea of entering song titles. While I was waiting for a picture, I thought of several more titles that I really wanted to try out. Once I've started, it's hard to stop. Anyway, here are my personal top 20:

20: Dr Funkenstein (George Clinton)

19: Dancing Queen (ABBA)

18: Cosmic Girl (Jamiroquai)

17: Breakfast in America (Supertramp)

16: Shelter From The Storm (Bob Dylan)

15: Yellow River (Christie)

14: Jailhouse Rock (Elvis Presley)

13: Diamonds on the Soles of Her Shoes (Paul Simon)

12: Sexy Motherfucker (Prince)

11: Shine On You Crazy Diamond (Pink Floyd)

10: Material Girl (Madonna)

9: Kiss My Ass (Wolfgang Amadeus Mozart)

8: Sex Machine (James Brown)

7: I Am the Walrus (Beatles)

6: Bad Guy (Billie Eilish)

5: Sultans of Swing (Dire Straits)

4: The Boy in the Bubble (Paul Simon)

3: Highway to Hell (AC/DC)

2: Lucy In The Sky With Diamonds (Beatles)

1: Shiny Happy People (R.E.M.)

30 people like this article


User Avatar
User Avatar

My interest in IT and writing landed me in tech journalism early on (2000). I want to know how we can use technology without being used. Outside of the office, I’m a keen musician who makes up for lacking talent with excessive enthusiasm.


These articles might also interest you

Comments

Avatar