Is Gato the Generalist AI We’ve Been Waiting For (or Fearing)?


In contrast to specialist AIs, DeepMind based Gato on a transformer neural network architecture, which is similar to a large language model.

Has Gato taken us one big step closer to the “holy grail” of artificial intelligence: an artificial general intelligence (AGI), which can make decisions and return results on hundreds, or thousands, of different tasks?

That’s the goal of the Gato project from DeepMind, which is based on what the AI industry has learned from specialist AIs like GPT-3 or the video game-playing AlphaGo. These applications are trained on a specific type of data, using billions of parameters and deep learning/neural networks to help the AI application understand context-specific inputs, which makes them enormously effective in a narrow application.

DeepMind says Gato “works as a multi-modal, multi-task, multi-embodiment generalist policy,” which can now perform 604 different varied and complex tasks, including playing Atari games and chatting with people, captioning images, and stacking blocks with a robotic arm. The researchers are the first to admit that Gato isn’t an expert in any of these tasks—a big contrast to AlphaGo/AlphaZero, which outplays the best human Go players in the world—but the breakthrough is in unprecedented diversity of abilities.

Gato: A different kind of ‘transformer’

In contrast to specialist AIs, DeepMind based Gato on a transformer neural network architecture, which is similar to a large language model, except that it uses a single neural sequence across all its possible tasks. A transformer model learns context and meaning by tracking the relationships between data in sequential “tokens.”

It’s like learning the context and meaning of a sentence by analyzing the relationship between the sequence of words or piecing together a complex motion by viewing a sequence of images. It might seem like an obvious solution to training AIs, but transformer models are a relatively new

The team first collected many different types of input data, which they turned into flat sequences of tokens, which were batched and processed by the transformer. This training structure allows Gato to ingest any data that can be converted into a flat sequence, including Atari images paired with discrete or continuous actions, text, proprioception, images, questions, or any combination of the above.

DeepMind says that Gato can decide “based on its context whether to output text, joint torques, button presses, or other tokens.”

And beyond being capable of 604 tasks—an already-impressive figure—the DeepMind team also claims that Gato performs better than an expert half of the time on a whopping 405 of them.

It’s doing all this complex work at a far smaller scale than the specialist AIs. DeepMind trained Gato using 1.2 billion parameters, far smaller than the hundreds of billions of parameters used in language models like GPT-3.

A big step toward AGI?

While some AI experts remain skeptical that artificial general intelligence is even possible, as a whole, the industry seems bullish about the potential of Gato and other transformer-based AI systems.

Metaculus, a website for community-driven forecasts on a diversity of technology issues, just dramatically dropped their prediction for when weakly general AI is publicly known, from June 2033 to January 2027. Based on their definition, an AGI is successful when it’s integrated enough that it can explain its reasoning on an SAT problem or verbally report progress and identify objects while playing a video game. Given that Gato can both play Atari games and return text output, it’s reasonable to assume it could be capable of doing both simultaneously in a few short years.

But Gato’s transformer-based architecture means it’s facing some tough headwinds. According to computer scientist Atlas Wang of the University of Texas, Austin, when talking to Quanta Magazine a few months ago, transformers require more computational power in the pre-training phase, which means there’s more work for their results to become viable.

Transformers also have shorter “memories” about the data they’re being trained on. The longer the sequence of tokens is, the more likely Gato loses the plot, so to speak, on what it’s observing, learning, and trying to accomplish. For now, Gato works on simple processes, like captioning an image, but it will struggle in writing more than a single coherent paragraph. Gato’s current contextual window of 1024 tokens limits its observation and decision-making capabilities.

The DeepMind team also recognizes that risk mitigation for AGI isn’t nearly as well-defined as specialist AIs. AGIs are more likely to be anthropomorphized by their users, who might treat them like living creatures and put misplaced trust in their output. Gato’s knowledge about one domain might create “unexpected or undesired outcomes” when transferred to another context and output.

But they’re also looking forward to improved performance via better hardware and network architectures, plus larger-scale computing power to handle more parameters, to train bigger models on Gato while still making decisions in real-time. AI has transformed enormously in the last five years—GPT-2, GPT-3, and AlphaZero were all released after 2017, making that 2027 prediction feels more like a done deal.

Joel Hans

About Joel Hans

Joel Hans is a copywriter and technical content creator for open source, B2B, and SaaS companies at Commit Copy, bringing experience in infrastructure monitoring, time-series databases, blockchain, streaming analytics, and more. Find him on Twitter @joelhans.

Leave a Reply

Your email address will not be published. Required fields are marked *