SHARE

The Coming Shift from Bigger AI Models to Smaller, Faster Ones

The future of AI won’t be built using a single giant brain, but rather a network of distributed micro-intelligences based on smaller AI models all working together at machine speed.

Written By

Jeff Kuo

Dec 22, 2025

Bigger isn’t always better, especially when it comes to AI models. They are larger, more capable, and more resource-intensive, utilizing bigger models to deliver enhanced reasoning, summarization, and even code generation capabilities. The size and scalability of gen AI models have their limits.

Larger models are designed to work best with open-ended problems, which are, by nature, often countered in chats. However, when an AI-powered product, such as a CRM system, is using AI models, the problem that the product is solving is actually very much fixed and highly structured. It has deviated substantially from the original chat format, which would require AI models to define the problem and come up with the steps to a solution themselves.

As we look forward to 2026, we can expect to see a more nimble system design. AI is transitioning from research to production, particularly in enterprise ecosystems, and the limitations of LLMs are beginning to show. Latency, cost, and lack of control are making it more difficult to harness LLMs for fixed business workflows. Using LLMs to address routine business issues is like using a sledgehammer to crack a nut – you don’t need that much AI processing power.

Why Smaller AI Models are Better

Let’s take AI-powered customer support for e-commerce, for example, which is one of the most popular business use cases of GenAI. When implementing an AI customer support agent, the first instinct would be to deploy a large thinking model like GPT-5 Thinking or Sonnet 4.5 to handle the full customer inquiry, since these thinking models are supposedly powerful enough to do everything, including understanding customer tone, interpreting requests, generating empathetic responses, checking inventory, processing returns, and escalating complex issues. However, when this is actually implemented, there are some key issues:

1. The response is slow. Larger thinking models are often slower than smaller models. This may be a smaller problem for email support, but a very big issue for chat support.

2. It’s expensive. Larger models may cost 10 times as much as smaller models, processing the exact same input.

3. It’s inconsistent. Using larger models may correctly answer customer inquiries 90% of the time, but it’s very difficult to improve on the last 10% since we have so little control over “how” the model thinks.

The next wave of AI systems will prioritize architecture over scale. It’s time to adopt smaller, faster, more specialized AI models engineered to work together as modular components to address specific business problems.

The Bigger Brain Fallacy

For the past five years, developers have been focused on optimizing “thinking” AI models that can handle open-ended reasoning using conversational language. LLMs that support such thinking models are great for free-form tasks, such as ideation, creative writing, and complex logic. They are less well-suited for structured, rules-based applications, such as CRM, ERP, and e-commerce, yet organizations are adapting LLMs for rules-based workflows.

The problem space for many business issues is well-defined within a specific workflow. LLMs are ideal for freeform reasoning, but the task of AI is actually usually clearly defined; there is not much free reasoning needed to create a path to the solution; it’s to execute that path efficiently and predictably, with consideration for constraints like cost and latency.

For interactive systems to deal with issues such as routine customer issues, businesses need predictability and consistency, not opaque AI geniuses.

Modular Means More Efficiency

Rather than adopting behemoth AI models, it makes better sense to break the problem into a sequence of narrower AI tasks, each handled by a specific, lightweight AI model. Each of these smaller models performs a discrete, well-defined function. Together, they can be assembled into a composable workflow that outperforms LLMs for well-defined functions. Assembling a swarm of task-specific models optimizes speed, cost, and reliability.

For example, we already have a clear set of rules on how customer inquiries should be processed. Let’s do a high-level overview on how we can use small models to divide and conquer:

1) Intent classification – Use an intent classifier at the beginning with a tiny model. Its only job is to read the customer message and identify what the customer wants, whether it is refunding, order tracking, product info, etc. Policy enforcement – Depending on the intent classifier, run a predefined SOP according to its category. Let’s say the customer is asking for a refund; it can first run a small model to check store return policies. It can either accept or reject the request, ask for more information, or escalate and route to a human support.

2) Data interaction – If the refund is accepted, run a model to generate an action to check and update customer order data in the database.

3) Response generation – Based on the result of the updated order, the AI drafts a response using a small model or even sends a simple reply to the customer using a template without even using AI.

While there are multiple model calls, each one is smaller, faster, and cheaper than using a single LLM. This approach could reduce processing time by 70% and cut costs by over 50%. The simpler the query, the shorter the time and the lower the cost. It’s also easier to debug. Since each function has a specific responsibility, developers can observe and test outcomes. Each component can be individually benchmarked to identify the weak points.

The accuracy of this swarm of smaller models approach is much better than the single larger thinking model approach in most cases, because the smaller models are asked to do one much simpler and specific job, which has a much smaller chance of hallucinating. It also has many fewer output degrees of freedom and a clearer success criterion, which reduces the number of ways that things can go wrong.

A Return to Classic Software Principles

Using a modular approach may seem familiar. Rather than treating AI systems as black boxes, this marks a return to classic software engineering, where developers can create transparent and measurable elements.

In one example, each model behaves like a microservice. Observable metrics such as latency, cost per token, and accuracy are tracked at every stage. Classifiers or text generators can be swapped out without having to retrain the entire system. Workflows can be reconfigured based on user context or business logic.

This modular approach aligns AI with modern DevOps practices. Deployment pipelines can be extended to include model components. Monitoring tools can log model-level performance, error rates, and drift. The result is AI development as an iterative engineering approach rather than building a black box. The resulting systems are not only faster and more predictable but also easier to maintain at scale.

The use cases for the largest AI adopters are mostly very suitable for this type of swarm of smaller models approach. The top 30 OpenAI customers have already used more than 1 trillion AI tokens. For most of these companies, AI usage is well-defined, so they would likely benefit from using a swarm of small models.

Duolingo is one of the companies in the top 30 list. The company is utilizing AI for language learning, which doesn’t require much critical thinking. What it does need is consistent ways to generate responses in multiple languages. A swarm of structured, repeatable tasks is all that’s needed.

Generative AI was designed to address the bigger challenge of utilizing natural language processing (NLP). Most AI applications are taking advantage of that capability, but in 2026, we can expect to see a shift from AI model size to system design. The most advanced products will be defined by their architecture rather than the number of parameters. The key to success is intelligently and efficiently orchestrating specialized models to address specific business outcomes.

AI is entering the DevOps era. The future won’t be built using a single giant brain, but a network of distributed micro-intelligences working together at machine speed.

Jeff Kuo

Jeff Kuo is the CEO and founder of Ragic Inc. and has been working in the tech industry since 2003. From 2003 to 2008, they worked as a Developer for Springsoft, where they were responsible for the implementation and maintenance of the Oracle ERP system, as well as the design and development of web applications such as Quotation System, Bug Tracking System, Employee Portal, Customer Support System, and License Management System. In 2008, they founded Ragic.

The Coming Shift from Bigger AI Models to Smaller, Faster Ones

Why Smaller AI Models are Better

The Bigger Brain Fallacy

Modular Means More Efficiency

A Return to Classic Software Principles

Jeff Kuo

Recommended for you...

Featured Resources from Cloud Data Insights

Company

Categories