SHARE
Facebook X Pinterest WhatsApp

DALL-E: A Stepping Stone to Artificial General Intelligence

thumbnail
DALL-E: A Stepping Stone to Artificial General Intelligence

Silhouette of man head with electronic brain and business icons over dark blue background. Concept of artificial intelligence. 3d rendering toned image double exposure

DALL-E is already proving to be a stepping stone in AI research. Its novelty lies in the way it was trained – with both text and vision stimuli – unlocking promising new directions of research in a re-emerging field called multimodal AI.

Written By
thumbnail
Sahar Mor
Sahar Mor
May 27, 2021

Last January, OpenAI published its latest model DALL-E, a 12-billion parameter version of the natural language processing model GPT-3, generating images rather than text.

Recently DALL-E’s creators have been signaling they will soon open access to DALL-E’s API. Undeniably, this API release will produce various innovative applications. But even before that, DALL-E is already proving to be a stepping stone in AI research. Its novelty lies in the way it was trained – with both text and vision stimuli, unlocking promising new directions of research in a re-emerging field called multimodal AI.

Advanced neural network architecture 

DALL-E is built based on the transformers architecture, an easy-to-parallelize type of neural network that can scale up and be trained on enormous datasets. The model receives both the text and the image as a single stream of data and is trained using maximum likelihood to generate all of the subsequent tokens (i.e., pixels), one after another. To train it, OpenAI created a dataset with 400 million image-text pairs of unlabeled data collected from the internet.

The model demonstrated impressive capabilities. For example, based on the following textual prompt, it was able to control the viewpoint of a scene and the 3D style in which it was rendered:

Source: OpenAI Blog

In another example, it has generated objects synthesized from a variety of unrelated ideas while respecting the shape and form of the object being designed:

Source: OpenAI Blog

These images are 100% synthetical. Unfortunately, with the lack of a standardized benchmark to measure its performance, it’s hard to quantify how successful this model is in comparison to previous GANs and future image generation models.

Advertisement

Not there, yet

The way DALL-E works is that it generates 512 images for each textual prompt. Then, these results are being ranked by another model from OpenAI called CLIP. In short, CLIP can automatically describe images based on their content. It also takes into account textual features, similar to how we humans do. Hence, if it sees an image of an apple with the written word ‘Orange,’ it might label it as an orange even though it’s an apple.

For the following prompt of a “bench in the shape of a peach,” DALL-E generated a picture of a bench with the word peach (top-left corner):

Source: OpenAI Blog

This shows multimodal models such as DALL-E are still prone to bias and typographic attacks. Both topics were explored in OpenAI’s follow-up blog post, where they described the promises and shortcomings of this technology.

Advertisement

Human-like intelligence

Today there is a shared understanding in the AI community that using narrow AI will not get us to human-like performance across different domains. For example, even the state-of-the-art deep learning model for early-stage cancer detection (vision) is limited in its performance when it is missing a patient’s charts (text) from her electronic health records.

This perception is becoming increasingly popular, with Oreilly’s recent Radar Trends report marking multimodality as the next step in AI and other domain experts such as Jeff Dean (Google AI SVP) sharing a similar view.

On the promise of combining language and vision, OpenAI’s Chief Scientist Ilya Sutskever stated that in 2021 OpenAI would strive to build and expose models to new stimuli: “Text alone can express a great deal of information about the world, but it’s incomplete because we live in a visual world as well.” He then adds, “this ability to process text and images together should make models smarter. Humans are exposed to not only what they read but also what they see and hear. If you can expose models to data similar to those absorbed by humans, they should learn concepts in a way that’s more similar to humans”.

Multimodality has been explored in the past and has been picking up interest once again in the last several years, with promising results such as Facebook’s AI Research lab recent paper in the field of automatic speech recognition (ASR), showcasing major progress by combining audio and text.

Advertisement

Final thoughts

In just less than a year since GPT-3’s release, OpenAI is about to release DALL-E as its next state-of-the-art API. This, along with the GPT and CLIP models, gets OpenAI one step closer to its promise of building sustainable and safe artificial general intelligence (AGI). It also creates endless new streams of research surpassing the previous levels of AI performance.

One thing is certain–once released, we can all expect our Twitter feed to be filled with mind-blowing applications, just this time with artificially generated images rather than artificially generated text.

thumbnail
Sahar Mor

Sahar Mor has 12 years of engineering and product management experience, both focused on products with AI in their core. Previously worked as an engineering manager in early-stage startups and at the elite Israeli intelligence unit - 8200.  Currently, he is the founder of AirPaper, a document intelligence API powered by GPT-3. He was a founding Product Manager at Zeitgold, a B2B AI accounting software company. After Zeitgold, he joined as a founding PM/engineer to Levity.ai, a No-Code AutoML platform providing models for image, document, and text tasks.

Recommended for you...

AI Agents Need Keys to Your Kingdom
The Rise of Autonomous BI: How AI Agents Are Transforming Data Discovery and Analysis
Why the Next Evolution in the C-Suite Is a Chief Data, Analytics, and AI Officer
Digital Twins in 2026: From Digital Replicas to Intelligent, AI-Driven Systems

Featured Resources from Cloud Data Insights

The Difficult Reality of Implementing Zero Trust Networking
Misbah Rehman
Jan 6, 2026
Cloud Evolution 2026: Strategic Imperatives for Chief Data Officers
Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.