DALL-E: A Stepping Stone to Artificial General Intelligence


DALL-E is already proving to be a stepping stone in AI research. Its novelty lies in the way it was trained – with both text and vision stimuli – unlocking promising new directions of research in a re-emerging field called multimodal AI.

Last January, OpenAI published its latest model DALL-E, a 12-billion parameter version of the natural language processing model GPT-3, generating images rather than text.

Recently DALL-E’s creators have been signaling they will soon open access to DALL-E’s API. Undeniably, this API release will produce various innovative applications. But even before that, DALL-E is already proving to be a stepping stone in AI research. Its novelty lies in the way it was trained – with both text and vision stimuli, unlocking promising new directions of research in a re-emerging field called multimodal AI.

Advanced neural network architecture 

DALL-E is built based on the transformers architecture, an easy-to-parallelize type of neural network that can scale up and be trained on enormous datasets. The model receives both the text and the image as a single stream of data and is trained using maximum likelihood to generate all of the subsequent tokens (i.e., pixels), one after another. To train it, OpenAI created a dataset with 400 million image-text pairs of unlabeled data collected from the internet.

The model demonstrated impressive capabilities. For example, based on the following textual prompt, it was able to control the viewpoint of a scene and the 3D style in which it was rendered:

Source: OpenAI Blog

In another example, it has generated objects synthesized from a variety of unrelated ideas while respecting the shape and form of the object being designed:

Source: OpenAI Blog

These images are 100% synthetical. Unfortunately, with the lack of a standardized benchmark to measure its performance, it’s hard to quantify how successful this model is in comparison to previous GANs and future image generation models.

Not there, yet

The way DALL-E works is that it generates 512 images for each textual prompt. Then, these results are being ranked by another model from OpenAI called CLIP. In short, CLIP can automatically describe images based on their content. It also takes into account textual features, similar to how we humans do. Hence, if it sees an image of an apple with the written word ‘Orange,’ it might label it as an orange even though it’s an apple.

For the following prompt of a “bench in the shape of a peach,” DALL-E generated a picture of a bench with the word peach (top-left corner):

Source: OpenAI Blog

This shows multimodal models such as DALL-E are still prone to bias and typographic attacks. Both topics were explored in OpenAI’s follow-up blog post, where they described the promises and shortcomings of this technology.

Human-like intelligence

Today there is a shared understanding in the AI community that using narrow AI will not get us to human-like performance across different domains. For example, even the state-of-the-art deep learning model for early-stage cancer detection (vision) is limited in its performance when it is missing a patient’s charts (text) from her electronic health records.

This perception is becoming increasingly popular, with Oreilly’s recent Radar Trends report marking multimodality as the next step in AI and other domain experts such as Jeff Dean (Google AI SVP) sharing a similar view.

On the promise of combining language and vision, OpenAI’s Chief Scientist Ilya Sutskever stated that in 2021 OpenAI would strive to build and expose models to new stimuli: “Text alone can express a great deal of information about the world, but it’s incomplete because we live in a visual world as well.” He then adds, “this ability to process text and images together should make models smarter. Humans are exposed to not only what they read but also what they see and hear. If you can expose models to data similar to those absorbed by humans, they should learn concepts in a way that’s more similar to humans”.

Multimodality has been explored in the past and has been picking up interest once again in the last several years, with promising results such as Facebook’s AI Research lab recent paper in the field of automatic speech recognition (ASR), showcasing major progress by combining audio and text.

Final thoughts

In just less than a year since GPT-3’s release, OpenAI is about to release DALL-E as its next state-of-the-art API. This, along with the GPT and CLIP models, gets OpenAI one step closer to its promise of building sustainable and safe artificial general intelligence (AGI). It also creates endless new streams of research surpassing the previous levels of AI performance.

One thing is certain–once released, we can all expect our Twitter feed to be filled with mind-blowing applications, just this time with artificially generated images rather than artificially generated text.

Sahar Mor

About Sahar Mor

Sahar Mor has 12 years of engineering and product management experience, both focused on products with AI in their core. Previously worked as an engineering manager in early-stage startups and at the elite Israeli intelligence unit - 8200.  Currently, he is the founder of AirPaper, a document intelligence API powered by GPT-3. He was a founding Product Manager at Zeitgold, a B2B AI accounting software company. After Zeitgold, he joined as a founding PM/engineer to Levity.ai, a No-Code AutoML platform providing models for image, document, and text tasks.

Leave a Reply

Your email address will not be published. Required fields are marked *