DALL-E output is not amazing yet

Published by marco on

The post Now add a walrus: Prompt engineering in DALL-E 3 by Simon Willison is a story about someone gaslighting himself into believing that LLMs work better than they do.

Case study: pelicans and walruses

Willison prompts “A super posh pelican with a monocle watching the Monaco F1” and gets the following ideas.

So far, so good. It’s really wonderful that you can get something that’s not completely random garbage. However, the bird is only watching the race in the top-right picture. In the first and fourth, it’s definitely facing the fourth wall. It seems to be posh in all of the pictures, to one degree or another—indicated by a monocle or a bowtie or both. The first prompt asks for a “Photo”, but that doesn’t look like a photo. Still, cars, coastline, pelican. OK.

Then he says “More like the first one please”:

Look, I get what he’s done here. He’s trying to show how cool it is that you can make a ‘conversation” out of this by implicitly referencing an image that was in the response to a prior question. This didn’t used to work at all, less than a year ago, but not billions of dollars, thousands of developers, and millions of GPUs have made it possible. Kudos.

I guess the LLM interpreted that it should stick the monocle because the bowties are gone now. Willison is over the moon about how he thinks that it really got what he meant, but … the three new pictures look a lot more like the second picture than the first one (which features the whole pelican). It’s still doing reasonably well but, if a human had produced this, you’d be pretty annoyed that it’s wasting your time. It didn’t understand what you wanted and just made more pictures, but not “more pictures like the first one.”

Next up is “Add a walrus.”

In response, he writes that “[t]hat second one is amazing. [emphasis in original]” Does he mean the one where the walrus is photo-bombed into the foreground? That’s not really amazing, is it? The walrus isn’t watching, but neither is the pelican—but he didn’t ask it to make the walrus “watch”, just to “add” one, which is, I guess, exactly what the LLM gave him. The last one looks nice, but they’re not watching the race at all (just “attending”?), and the background contains speedboats instead of F1 cars. In the third one, the F1 car is in the water, but that’s OK, I guess?

Lowering your expectations

He continues playing with it, and being amazed at how it manages to kind of respond to his input, but shouldn’t we expect better? Maybe he’s amazed that it works at all, but we’ve got to get a bit more critical of this stuff—otherwise, it will continue to just generate medicocre images that only vaguely fulfill the requirements.

It’s the difference between asking a child, an apprentice, or a professional painter for a picture of a tree. You wouldn’t be at all satisfied with the output of a child from an apprentice, nor with that of an apprentice from a professional. I suppose my expectations are higher.

Missing fidelity

I completely agree that the LLM is able to respond to commands, but it’s not useful yet because it’s not able to make a finished product for you. You would have to tweak it to fix it.

And here’s the crucial difference between image-generation and text- or code-generation: it’s really, really hard to tweak the rendered image. Even if you knew how to use vector- or photo-manipulation tools, DALL-E is delivering a completed product, not the source that you would need in order to tweak it further. There are no layers in there. There are no masks. It’s just pixels. It’s a dead-end.

With text, on the other hand, we at least have the possibility of refining it in an editor. The finished product is itself editable at a fine-grained level. It’s entirely possible that you won’t be able to refine the product because you either don’t understand the language in which it’s written or perhaps because you couldn’t have done better yourself (which kind of amounts to the same thing).

I tend to think of code the same as I think of text: for a large number of languages, I can refine it better than the LLM could. If it’s a language or runtime library I’m not familiar with, or not well-versed in, then I may also not be able to “fix it up”, either.

This is the situation that most people find themselves in with code, and in which we all find ourselves with images. Even graphics artists can’t manipulate the output of an image generator, whereas text or code output could conceivably be improved by somebody, even if it’s not the person who prompted the LLM.