SHARE
Facebook X Pinterest WhatsApp

NUS Open-Sources Multi-Modal Language Model, NExT-GPT

thumbnail
NUS Open-Sources Multi-Modal Language Model, NExT-GPT

Composition, which depicts a computer abstraction in binary form in cyberspace.

NExT-GPT’s release offers developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types.

Nov 29, 2023

The NExT Research Center at the National University of Singapore (NUS) has unveiled NExT-GPT, an open-source multi-modal large language model (LLM) designed to process text, images, videos, and audio interchangeably. The model can accept various types of input and generate responses in different formats, making it a versatile AI agent.

Multi-modal capabilities

NExT-GPT offers a chat-based interface that enables users to input text, images, videos, or audio files. The model can understand and respond to these inputs, answering questions or generating content accordingly. This multi-modal AI system combines pre-trained encoders and decoders, including Vicuna and Stable Diffusion, with trainable neural network layers in between. These intermediary layers are trained using a novel technique developed by the NExT team called Modality-switching Instruction Tuning (MosIT).

See also: How to Attract LLM Developers Amidst the AI Boom

Advertisement

Architecture and training

NExT-GPT’s architecture has three tiers: an encoding stage with linear projections, a Vicuna LLM core responsible for generating tokens (including signals for output modalities), and a decoding stage with modality-specific transformer layers and decoders. Notably, most of the model’s parameters, including encoders, decoders, and the Vicuna model, remain frozen during training, with only about 1% being updated. This approach helps reduce training costs while maintaining performance.

The model was trained using instruction-tuning, using a dataset of example dialogues between human users and chatbots. These dialogues covered scenarios involving multiple modalities in both input and output, totaling around 5,000 dialogues.

Advertisement

Performance and evaluation

NExT-GPT was evaluated on various multi-modal generation benchmarks, demonstrating competitive results compared to baseline models. Human judges also rated the model’s output in different scenarios, with image generation scenarios receiving higher scores than video and audio.

The model’s unique feature is its ability to generate modality-signaling tokens when users request specific types of content, such as images, videos, or sounds. These tokens were pre-defined and included in the vocabulary of the LLM during training.

NExT-GPT’s release offers researchers and developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types. NExT-GPT’s open-source availability is a significant contribution to multi-modal AI, enabling developers to create applications that seamlessly integrate text, images, videos, and audio. This model has potential use cases in various domains, from content generation and multimedia analysis to virtual assistants capable of understanding and responding to user requests in their preferred formats.

thumbnail
Elizabeth Wallace

Elizabeth Wallace is a Nashville-based freelance writer with a soft spot for data science and AI and a background in linguistics. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain - clearly - what it is they do.

Recommended for you...

Your AI Is Only as Smart as Your Metadata
Paul Chen
Mar 3, 2026
Why Agentic AI Projects Are Getting Canceled (And How You Can Save Yours)
Akhil Verghese
Mar 2, 2026
Real-time Analytics News for the Week Ending February 28
Platform-First Enterprise AI: Turning Data Islands into Autonomous Intelligence
Arvind Rao
Feb 27, 2026

Featured Resources from Cloud Data Insights

Your AI Is Only as Smart as Your Metadata
Paul Chen
Mar 3, 2026
Why Agentic AI Projects Are Getting Canceled (And How You Can Save Yours)
Akhil Verghese
Mar 2, 2026
Real-time Analytics News for the Week Ending February 28
Platform-First Enterprise AI: Turning Data Islands into Autonomous Intelligence
Arvind Rao
Feb 27, 2026

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.