SHARE
Facebook X Pinterest WhatsApp

The Coming Shift from Bigger AI Models to Smaller, Faster Ones

thumbnail
The Coming Shift from Bigger AI Models to Smaller, Faster Ones

The future of AI won’t be built using a single giant brain, but rather a network of distributed micro-intelligences based on smaller AI models all working together at machine speed.

Written By
thumbnail
Jeff Kuo
Jeff Kuo
Dec 22, 2025

Bigger isn’t always better, especially when it comes to AI models. They are larger, more capable, and more resource-intensive, utilizing bigger models to deliver enhanced reasoning, summarization, and even code generation capabilities. The size and scalability of gen AI models have their limits.

Larger models are designed to work best with open-ended problems, which are, by nature, often countered in chats. However, when an AI-powered product, such as a CRM system, is using AI models, the problem that the product is solving is actually very much fixed and highly structured. It has deviated substantially from the original chat format, which would require AI models to define the problem and come up with the steps to a solution themselves.

As we look forward to 2026, we can expect to see a more nimble system design. AI is transitioning from research to production, particularly in enterprise ecosystems, and the limitations of LLMs are beginning to show. Latency, cost, and lack of control are making it more difficult to harness LLMs for fixed business workflows. Using LLMs to address routine business issues is like using a sledgehammer to crack a nut – you don’t need that much AI processing power.

Why Smaller AI Models are Better

Let’s take AI-powered customer support for e-commerce, for example, which is one of the most popular business use cases of GenAI. When implementing an AI customer support agent, the first instinct would be to deploy a large thinking model like GPT-5 Thinking or Sonnet 4.5 to handle the full customer inquiry, since these thinking models are supposedly powerful enough to do everything, including understanding customer tone, interpreting requests, generating empathetic responses, checking inventory, processing returns, and escalating complex issues. However, when this is actually implemented, there are some key issues:

1. The response is slow. Larger thinking models are often slower than smaller models. This may be a smaller problem for email support, but a very big issue for chat support.

2. It’s expensive. Larger models may cost 10 times as much as smaller models, processing the exact same input.

3. It’s inconsistent. Using larger models may correctly answer customer inquiries 90% of the time, but it’s very difficult to improve on the last 10% since we have so little control over “how” the model thinks.

The next wave of AI systems will prioritize architecture over scale. It’s time to adopt smaller, faster, more specialized AI models engineered to work together as modular components to address specific business problems.

Advertisement

The Bigger Brain Fallacy

For the past five years, developers have been focused on optimizing “thinking” AI models that can handle open-ended reasoning using conversational language. LLMs that support such thinking models are great for free-form tasks, such as ideation, creative writing, and complex logic. They are less well-suited for structured, rules-based applications, such as CRM, ERP, and e-commerce, yet organizations are adapting LLMs for rules-based workflows.

The problem space for many business issues is well-defined within a specific workflow. LLMs are ideal for freeform reasoning, but the task of AI is actually usually clearly defined; there is not much free reasoning needed to create a path to the solution; it’s to execute that path efficiently and predictably, with consideration for constraints like cost and latency.

For interactive systems to deal with issues such as routine customer issues, businesses need predictability and consistency, not opaque AI geniuses.

See also: Why Training LLMs on Company-Owned Data Is the Key to Scaling AI

Advertisement

Modular Means More Efficiency

Rather than adopting behemoth AI models, it makes better sense to break the problem into a sequence of narrower AI tasks, each handled by a specific, lightweight AI model. Each of these smaller models performs a discrete, well-defined function. Together, they can be assembled into a composable workflow that outperforms LLMs for well-defined functions. Assembling a swarm of task-specific models optimizes speed, cost, and reliability.

For example, we already have a clear set of rules on how customer inquiries should be processed. Let’s do a high-level overview on how we can use small models to divide and conquer:

1) Intent classification – Use an intent classifier at the beginning with a tiny model. Its only job is to read the customer message and identify what the customer wants, whether it is refunding, order tracking, product info, etc. Policy enforcement – Depending on the intent classifier, run a predefined SOP according to its category. Let’s say the customer is asking for a refund; it can first run a small model to check store return policies. It can either accept or reject the request, ask for more information, or escalate and route to a human support.

2) Data interaction – If the refund is accepted, run a model to generate an action to check and update customer order data in the database.

3) Response generation – Based on the result of the updated order, the AI drafts a response using a small model or even sends a simple reply to the customer using a template without even using AI.

While there are multiple model calls, each one is smaller, faster, and cheaper than using a single LLM. This approach could reduce processing time by 70% and cut costs by over 50%. The simpler the query, the shorter the time and the lower the cost. It’s also easier to debug. Since each function has a specific responsibility, developers can observe and test outcomes. Each component can be individually benchmarked to identify the weak points.

The accuracy of this swarm of smaller models approach is much better than the single larger thinking model approach in most cases, because the smaller models are asked to do one much simpler and specific job, which has a much smaller chance of hallucinating. It also has many fewer output degrees of freedom and a clearer success criterion, which reduces the number of ways that things can go wrong.

See also: Navigating the AI Landscape: Why Multiple LLMs Are Your Best Bet

Advertisement

A Return to Classic Software Principles

Using a modular approach may seem familiar. Rather than treating AI systems as black boxes, this marks a return to classic software engineering, where developers can create transparent and measurable elements.

In one example, each model behaves like a microservice. Observable metrics such as latency, cost per token, and accuracy are tracked at every stage. Classifiers or text generators can be swapped out without having to retrain the entire system. Workflows can be reconfigured based on user context or business logic.

This modular approach aligns AI with modern DevOps practices. Deployment pipelines can be extended to include model components. Monitoring tools can log model-level performance, error rates, and drift. The result is AI development as an iterative engineering approach rather than building a black box. The resulting systems are not only faster and more predictable but also easier to maintain at scale.

The use cases for the largest AI adopters are mostly very suitable for this type of swarm of smaller models approach. The top 30 OpenAI customers have already used more than 1 trillion AI tokens. For most of these companies, AI usage is well-defined, so they would likely benefit from using a swarm of small models.

Duolingo is one of the companies in the top 30 list. The company is utilizing AI for language learning, which doesn’t require much critical thinking. What it does need is consistent ways to generate responses in multiple languages. A swarm of structured, repeatable tasks is all that’s needed.

Generative AI was designed to address the bigger challenge of utilizing natural language processing (NLP). Most AI applications are taking advantage of that capability, but in 2026, we can expect to see a shift from AI model size to system design. The most advanced products will be defined by their architecture rather than the number of parameters. The key to success is intelligently and efficiently orchestrating specialized models to address specific business outcomes.

AI is entering the DevOps era. The future won’t be built using a single giant brain, but a network of distributed micro-intelligences working together at machine speed.

Recommended for you...

Real-time Analytics News for the Week Ending December 20
From Automation to Autonomy: Building the Architecture for Agentic AI
Derek Slager
Dec 20, 2025
5 Defining AI and Real-Time Intelligence Shifts of 2025
Why Your AI Pilot Is Stuck in Purgatory; And What to Do About It

Featured Resources from Cloud Data Insights

Why Network Services Need Automation
The Shared Responsibility Model and Its Impact on Your Security Posture
The Role of Data Governance in ERP Systems
Sandip Roy
Nov 28, 2025
What Is Sovereign AI? Why Nations Are Racing to Build Domestic AI Capabilities
RT Insights Logo

Analysis and market insights on real-time analytics including Big Data, the IoT, and cognitive computing. Business use cases and technologies are discussed.

Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved

Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.