DALL·E 2: The AI agent that creates stunning images from text prompts


Dewni De Silva

5th July 2022

cover image two

Few years back, did you think that there will be an AI agent that can create realistic pictures, just with a text prompt? DALL·E 2, the latest revolution in generative AI, is the talk of the town because of how it can help people generate real-looking images within a matter of minutes.

blog content

Animal helicopter chimeras Credit: Adithya Ramesh - Co-creator of DALL·E 2

Have you ever wanted to paint a portrait of your dog in an Abstract style but just didn't have the time? Or maybe you didn't have the necessary painting skills for Abstract arts, which are pretty hard to come by. No problem, DALL·E 2 is here to help you generate some cool art within a matter of minutes based on your text prompts.

The text-to-image generation has been one of the most active and exciting AI fields of 2021. In January, OpenAI introduced DALL·E 2, a 12-billion parameter version of the company’s GPT-3 transformer language model designed to generate photorealistic images using text captions as prompts. An instant hit in the AI community, DALL·E 2’s stunning performance also attracted widespread mainstream media coverage. Last month, tech giant NVIDIA released the GAN-based GauGAN2 — the name taking inspiration from French Post-Impressionist painter Paul Gauguin as DALL·E 2 had from Surrealist artist Salvador Dali.

In addition, OpenAI researchers also presented GLIDE (Guided Language-to-Image Diffusion for Generation and Editing), a diffusion model that achieves performance competitive with DALL·E 2 while using less than one-third of the parameters.

In the real world, while most images can be relatively easily described in words, generating images from text inputs requires specialized skills and many hours of labor. Enabling an AI agent to automatically generate photorealistic images from natural language prompts not only empowers humans with the ability to create rich and diverse visual content with unprecedented ease but also enables easier iterative refinement and fine-grained control of the generated images.

How was DALL·E built?

At its core, DALL·E uses the same new neural network architecture that’s responsible for tons of recent advances in ML: the Transformer. Transformers, discovered in 2017, is an easy-to-parallelize type of neural network that can be scaled up and trained on huge datasets.

What makes DALL·E unique, though, is that it was trained on sequences that were a combination of words and pixels. But how exactly does the model understand the text we send it and generate an image out of it?

It starts by using the CLIP model by OpenAI to encode both a text and an image into the same domain; a condensed representation called a latent vector. Then, it will take this encoding and use a generator, also called a decoder, to generate a new image that means the same thing as the text since it is from the same latent vector.

So DALL·E 2 has two steps; CLIP to encode the information and the new decoder model, to take this encoded information and generate an image out of it. These two separate steps are also why we can generate multiple variations of the same images. We can simply randomly change the encoded information just a little, making it move a tiny bit in the latent space, and it will still represent the same sentence while having all different values creating a different image representing the same text.

blog content

Overview of the decoder model : Credit: OpenAI

As we see here, the model initially takes a text input and encodes it. What we see above is the first step of the training process where we also feed the encoder an image and encode it using CLIP so that the image and text pairs are encoded similarly following the CLIP objective. Then, for generating a new image, we switch to the section below where we use the text encoding guided by CLIP to transform it into an image-ready encoding. This transformation is done using a diffusion mechanism, which we will cover shortly as it is very similar to the diffusion model used for the final step. Finally, we use our newly created image encoding and decode it into a new image using a diffusion decoder!

DALL·E 2’s ability to create variations

  • Syntactic and semantic variations

DALL·E 2 is a versatile model that can go beyond sentence-to-image generations. Because OpenAI is leveraging CLIP’s powerful embeddings, they can play with the generative process by making variations of outputs for a given input. We can glimpse at CLIP’s “mental” imagery of what it considers essential from the input (stays constant across images), and replaceable (changes across images). DALL·E 2 tends to preserve semantic information, as well as stylistic elements.

blog content

Variations of “The Persistence of Memory” by Salvador Dalí and OpenAI’s logo. Credit: OpenAI

From the Dalí example, we can see here how DALL·E 2 preserves the objects (the clocks and the trees), the background (the sky and the dessert), the style, and the colors. However, it doesn’t preserve the location and number of clocks or trees. This gives us a hint of what DALL·E 2 has learned to prioritize and what not. The same happens with OpenAI’s logo. The patterns are similar and the symbol is circular/hexagonal,but neither the colors nor the patterns are always in the same place.

DALL·E 2 can also create visual changes in the output image that correspond to syntactic or semantic changes in the input sentence. It seems to be able to adequately encode syntactic elements as separate from one another. From the sentence “an astronaut riding a horse in a photorealistic style” DALL·E 2 generates these:

blog content

“An astronaut riding a horse in a photorealistic style.” Credit: OpenAI

By changing the independent clause “riding a horse” for “lounging in a tropical resort in space,” it now generates these:

blog content

“An astronaut lounging in a tropical resort in space in a photorealistic style.” Credit: OpenAI

This is one of the core features of DALL·E 2. You can input sentences of complexity — even with several complement clauses — and it seems to be able to generate coherent images that somehow combine all the different elements into a semantically cohesive whole.

  • Inpainting - Where DALL·E can repair or restore (a painting) by repainting obliterated areas.

DALL·E 2 can also make edits to already existing images — a form of automated inpainting. In the next examples, the left is the original image, and on the center and right there are modified images with an object inpainted at different locations.

DALL·E 2 manages to adapt the added object to the style already present in that part of the image (i.e. the corgi copies the style of the painting in the second image while it has a photorealistic aspect in the third)

blog content

A corgi was added in different locations in the second and third images. DALL·E 2 matches the style of the corgi to the style of the background location. Credit: OpenAI

It also changes textures and reflections to update the existing image to the presence of the new object. This may suggest DALL·E 2 has some sort of causal reasoning (i.e. because the flamingo is sitting in the pool and there should be a reflection in the water that wasn’t there previously).

blog content

A flamingo was added in different locations in the second and third images. DALL·E 2 updates reflections according to the new position of the flamingo. Credit: OpenAI

  • Text Diffs

DALL·E 2 has another cool ability: interpolation. Using a technique called text diffs, DALL·E 2 can transform one image into another. Below is Van Gogh’s The Starry Night and a picture of two dogs. It’s interesting how all intermediate stages are still semantically meaningful and coherent and how the colors and styles get mixed.

blog content

DALL·E 2 combines Van Gogh’s The Starry Night and a picture of two dogs. Credit: OpenAI

Concerns about DALL·E 2 2

After exploring the bright side of DALL·E 2 it’s time to talk about the other side of the coin. Where DALL·E 2 struggles, what tasks it can’t solve, and what problems, harms, and risks it can engage into.

  • Social aspects

As you may know by now, all language models of this size and larger are susceptible to bias, toxicity, stereotypes, and other behaviors that can harm or offend discriminated minorities especially. Companies are getting more transparent about it mainly due to the pressure from AI ethics groups — and from regulatory institutions that are now starting to catch up with technological progress.

  • Biases and stereotypes

DALL·E 2 tends to depict people and environments as white/western when the prompt is unspecific. It also engages in gender stereotypes (e.g. flight attendant=woman, builder=man). For example, when prompted with the following occupations, this is what the model outputs:

blog content

“A flight attendant.” Credit: OpenAI

blog content

“A builder.” Credit: OpenAI

This is what’s called representational bias and occurs when models like DALL·E 2 or GPT-3 reinforce stereotypes seen in the dataset that represents societal biases in one form or another (e.g. race, gender, nationality, etc.)

  • Harassment and bullying

This section refers to what we already know from Deepfake technology. Deepfakes use GANs, which is a different deep learning technique than what DALL·E 2 uses, but the problem is similar. People could use inpainting to add or remove objects or people — although it’s prohibited by OpenAI’s content policy — and then threaten or harass others.

  • Explicit content

OpenAI’s violence content policy wouldn’t allow for a prompt such as “a dead horse in a pool of blood,” but users could perfectly create a “visual synonym” with the prompt “A photo of a horse sleeping in a pool of red liquid,” as shown below. This could also happen unintentionally, what they call “spurious content.”

blog content

“A photo of a horse sleeping in a pool of red liquid.” Credit: OpenAI

  • Disinformation

We tend to think of language models that generate text when thinking about misinformation, but visual deep learning technology can easily be used for “information operations and disinformation campaigns,” as OpenAI recognizes.

While deepfakes may work better for faces, DALL·E 2 could create believable scenarios of diverse nature. For instance, anyone could prompt DALL·E 2 to create images of burning buildings or people peacefully talking or walking with a famous building in the background. This could be used to mislead and misinform people about what’s truly happening at those places.

blog content

Smoke is inpainted in an image of the White House. Credit: OpenAI

  • Spelling

DALL·E 2 is great at drawing but horrible at spelling words. The reason may be that DALL·E 2 doesn’t encode spelling information from the text present in images in the dataset. If something isn’t represented in CLIP embeddings, DALL·E 2 can’t draw it correctly. When prompted with “a sign that says deep learning” DALL·E 2 outputs these:

blog content

“A sign that says deep learning.” Credit: OpenAI

However, it’s possible that if DALL·E 2 were trained to encode the words in the images, it’d be way better at this task.

Wrapping up

DALL·E 2 is a powerful, versatile creative tool, without a doubt. Multiskilled AI agents that can view the world and work with concepts across multiple modalities—like language and vision—are a step toward more generalized back-bone models. DALL·E 2 is one of the best examples yet.

However, what could DALL·E 2 mean for the future of creativity? DALL·E 2 holds the capacity to disrupt the work of designers, artists, photographers, and visual content creators. But it's not all bad. This tool might prove great for marketing teams to help find or create authentic images for blog posts, websites, ads, and other content. Also, to mention generating visual ideas and variations for logos and brand collateral.

However, If you’re a stock photo business, DALL·E 2 might be your worst enemy. Stock photos already have a reputation of being expensive and inauthentic, but have been a necessity for many content creators. That changes the moment DALL·E 2 becomes available for commercial use. What justification would there be to pay for a stock photo license in a world where DALL·E 2 can create any image you want?

There’s no doubt the boundaries of AI’s role in creative endeavors will be pushed even further. DALL·E 2 is a milestone that will most likely be surpassed by an even more intelligent generative model in the future. And while it will never replace the human soul of creativity, generative AI models like DALL·E 2 can pave the way for a much more inclusive, and authentic content creation process for creative teams and individuals.