A Primer Into Generative Text-to-Image Systems

PinIt

Text-to-image systems are the hottest AI trend of 2022 and we expect them to make inroads commercially in the next year, as more become public available.

If you have seen a TikTok, Twitter post or Reddit thread that claims “AI made this image”, its time to start believing them. Text-to-image generation is the hottest AI trend in 2022, with organizations such as OpenAI and Midjourney receiving funding in a market downturn. 

Text-to-image generators are one of the first consumer AI experiences where AI is front and center to the experience. OpenAI’s launch of DALL-E, alongside Midjourney and Stable Diffusion’s open-source platform, have created huge amounts of buzz for the technology, which is considered by some to be AI’s first “killer app”.

See Also: Could AI-Generated Images Be Next Goldmine?

Featured Resource: How edge computing and 5G can help decarbonize power  networks [Download Now]

Progress in the sophistication of automated image captioning in 2015 and 2016 was the first major breakthrough in the development of the text-to-image generators. With this, artificial intelligence models were able to put those captions in a natural language format, which can then be reverse engineered so an AI can readily understand the sentence “a man eating pizza on the sidewalk” and display an image that includes all of the nodes. 

In the next few years, text-to-image generators started to reach the periphery of conversation. Images developed by AI models were sold for hundreds of thousands of dollars at Sotheby’s and there were several marketing campaigns by OpenAI, Google, and others, demonstrating the progress of these systems in generating realistic and unique imagery. 

However, it would not be until 2021 that the scope of this progress became fully realized, with the introduction of DALL-E, which was demoed to various journalists and industry professionals. Even though it would be another year before OpenAI made it available to the world, the buzz about how sophisticated and futuristic the technology felt pushed it into the forefront of discussion. 

The generators and their creators

DALL-E is the most well known of the text-to-image generators, developed by OpenAI. It was built using a modified version of the GPT-3 which generates images instead of text. GPT-3 was already considered the leading standard in text generators, with testers being unable to recognize if the text was generated by an AI or a human. OpenAI announced DALL-E 2 in April 2022, and in September 2022 made the program available to anyone who signed-up. 

Google has also been building a text-to-image generator known as Imagen, which is part of the Google Brain team. Like OpenAI, it has launched an app that lets users create unique images using the AI, but it has restricted user creativity to a few functions. Users can currently either build isometric cities or monsters. Part of this restriction is Google’s focus on avoiding controversies that could potentially spring up if users were able to design anything they want, although OpenAI has implemented a few blockades to avoid this.

Instead of images, Facebook is focusing its energy on a text-to-video generator, called Make A Video. The implications of this are even further reaching than image generation, as videos are better spreaders of misinformation, and are the dominant form of media on the web. Facebook is also not making the system available to the wider public. 

The resources needed to create a highly sophisticated text-to-image generator are massive, and require a lot of AI research. The biggest tech companies seem poised to the lead the pack simply due to the amount of scale needed to make the product leading edge, however, in the text-to-image space, there are a few smaller players that are making meaningful leaps in the sophistication of their technology, and potentially challenging the lead players. 

One of these is Midjourney, an independent research lab founded by the co-founder of Leap Motion, David Holz. The app is currently only accessible through Discord, an online messaging service, and users are able to create images by direct messaging a bot which will return an image. Midjourney runs the service with a freemium model, meaning for most users it is free, but for those that want faster access, more capacity, and access to the newest features, there is a paid tier. 

Stable Diffusion is the other independent lab which is aiming to take on the biggest tech companies, which has been built to be an open-source application, able to run on most consumer hardware with a decent GPU. Open-source libraries gives everyone access to the technology, which makes it far more susceptible to bad actors. Stable Diffusion believes that by working with the community, it can push best practices with generative systems. 

There are more tools out there, including those in private research labs that we won’t know about until they make a public announcement. As the technology gains a larger audience, expect more technology companies to look into building applications or features in apps utilizing it. 

The economics of image generators 

Unlike some artificial intelligence technologies, which make their money through incorporation with other products like analytical tools, speech recognition, or customer support, text-to-image generators look to make their income through premium features. Midjourney offers a paid-for service that enables faster access and more capacity. OpenAI’s DALL-E API lets businesses pay per image generated, with a higher cost for higher resolution image creation. 

It’s easy to see how these services could become profitable, which Midjourney already apparently is. An app which uses DALL-E as its underlying image generator could pay millions to the service, and then shift the cost to the end user who would paid for each image generated or pay a subscription fee for a certain amount of generated images a month. This is similar to the business model already employed by Getty and other stock image and editorial photography companies.

Featured Resource: How edge computing and 5G can help decarbonize power  networks [Download Now]

As the generative systems expand to video and other forms of media, there’s even more markup for access. Google recently published a research post on Infinite Nature, which is an animated 3D flythrough generated from a single still image, which could be utilized by video game developers and animation producers. 

Ethical considerations of image generators 

The future looks bright, but as with most artificial intelligence technologies, businesses need to look at the ethics of releasing this technology to the public and the affect it could have on people. As we saw with self-driving vehicles, which were predicted to be road ready by 2020, there’s a lot of ethical considerations and edge cases that need to be figured out before the public can access it. 

Image generators are not 3,000 pound vehicles however, so while researchers still need to look into every possible angle for misuse, there is a far lower risk of the technology leading to death. Most ethical considerations for text-to-image generators has been on the commercial side, with designers, visual media copyright holders, and others in the industry worried that these systems will lead to a decline in their workload and value. 

It is to be expected, especially in areas such as vector image creation and infographic design. With a library of billions of images, these systems will definitely have on file accurate images of places, people, and events, although Getty and others are fighting against them using copyrighted material without first paying them the license fee. 

Designers and artists may learn to love these generative systems more than the average user however, with some of the research being helped on by artists. Expanding their creativity through AI could be a way to break out of writers, or painters, block. 

The future of generative systems 

Generative systems, with text-to-image at the forefront, have a bright future. Unlike a lot of AI technologies, they are not five or ten years off becoming commercially accessible, but one or two. Google, Meta, and a few others are still taking baby steps into the commercial world, clearly focused on ensuring they don’t face a PR scandal, but it won’t be too long until this technology is available or incorporated into other products, such as Google Search. 

Text-to-video is the next layer of this generative tech, but that may take longer to be available, due to the many implications that can come from generative video. Expect Meta and others to take a longer time to iron out any edge cases where misinformation or racism can be created via these systems, and even longer to commercialize them. 

Looking further on, there could be a time when entire worlds are built off the back of generative systems. Imagine asking a system to build a Tolkien style world with characters from the Agatha Christie universe, or New York with Japanese style architecture. Embed this into a virtual reality system, and you are one step closer to the promised “metaverse” of Mark Zuckerberg’s dreams.

David Curry

About David Curry

David is a technology writer with several years experience covering all aspects of IoT, from technology to networks to security.

Leave a Reply

Your email address will not be published.