SHARE
Facebook X Pinterest WhatsApp

NUS Open-Sources Multi-Modal Language Model, NExT-GPT

thumbnail
NUS Open-Sources Multi-Modal Language Model, NExT-GPT

Composition, which depicts a computer abstraction in binary form in cyberspace.

NExT-GPT’s release offers developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types.

Nov 29, 2023

The NExT Research Center at the National University of Singapore (NUS) has unveiled NExT-GPT, an open-source multi-modal large language model (LLM) designed to process text, images, videos, and audio interchangeably. The model can accept various types of input and generate responses in different formats, making it a versatile AI agent.

Multi-modal capabilities

NExT-GPT offers a chat-based interface that enables users to input text, images, videos, or audio files. The model can understand and respond to these inputs, answering questions or generating content accordingly. This multi-modal AI system combines pre-trained encoders and decoders, including Vicuna and Stable Diffusion, with trainable neural network layers in between. These intermediary layers are trained using a novel technique developed by the NExT team called Modality-switching Instruction Tuning (MosIT).

See also: How to Attract LLM Developers Amidst the AI Boom

Advertisement

Architecture and training

NExT-GPT’s architecture has three tiers: an encoding stage with linear projections, a Vicuna LLM core responsible for generating tokens (including signals for output modalities), and a decoding stage with modality-specific transformer layers and decoders. Notably, most of the model’s parameters, including encoders, decoders, and the Vicuna model, remain frozen during training, with only about 1% being updated. This approach helps reduce training costs while maintaining performance.

The model was trained using instruction-tuning, using a dataset of example dialogues between human users and chatbots. These dialogues covered scenarios involving multiple modalities in both input and output, totaling around 5,000 dialogues.

Advertisement

Performance and evaluation

NExT-GPT was evaluated on various multi-modal generation benchmarks, demonstrating competitive results compared to baseline models. Human judges also rated the model’s output in different scenarios, with image generation scenarios receiving higher scores than video and audio.

The model’s unique feature is its ability to generate modality-signaling tokens when users request specific types of content, such as images, videos, or sounds. These tokens were pre-defined and included in the vocabulary of the LLM during training.

NExT-GPT’s release offers researchers and developers a powerful multi-modal language model that can handle diverse inputs and outputs, paving the way for more sophisticated AI applications across different media types. NExT-GPT’s open-source availability is a significant contribution to multi-modal AI, enabling developers to create applications that seamlessly integrate text, images, videos, and audio. This model has potential use cases in various domains, from content generation and multimedia analysis to virtual assistants capable of understanding and responding to user requests in their preferred formats.

thumbnail
Elizabeth Wallace

Elizabeth Wallace is a Nashville-based freelance writer with a soft spot for data science and AI and a background in linguistics. She spent 13 years teaching language in higher ed and now helps startups and other organizations explain - clearly - what it is they do.

Recommended for you...

Model-as-a-Service Part 1: The Basics
If 2025 was the Year of AI Agents, 2026 will be the Year of Multi-agent Systems
AI Agents Need Keys to Your Kingdom
The Rise of Autonomous BI: How AI Agents Are Transforming Data Discovery and Analysis

Featured Resources from Cloud Data Insights

The Difficult Reality of Implementing Zero Trust Networking
Misbah Rehman
Jan 6, 2026
Cloud Evolution 2026: Strategic Imperatives for Chief Data Officers
Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2026 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.