Unlocking AI’s Potential with a Data-centric Approach


Weaving elements of data-centric and model-centric AI approaches together provides a balanced framework for developing robust AI systems.

AI has been steadily advancing for the past 10 years. And with more developments comes more complexity. AI systems can introduce risks like biases and hallucinations. To help address these challenges, a paradigm shift is emerging — from model-centric AI to data-centric AI.

As with most technological transformations, data is what holds the keys to the kingdom. Data propels machine learning (ML) and natural language processing innovations forward. Think of it like sunlight and plant growth; the sun provides the nutrients and energy that plants require to grow strong and reach their full potential, just as data does to AI projects. Without sufficient and diverse data, ML algorithms cannot effectively learn and evolve.

Leaning into a data-centric approach to building and evolving AI models will enhance overall accuracy and competitive advantage for businesses. Before we dive into the benefits of this method, let’s understand why a model-centric AI approach can steer developer teams in the wrong direction.

See also: Researchers Build Uncertainty Calibrator For AI Models

The Problem with Model-centric AI

Model-centric AI focuses primarily on building and fine-tuning AI algorithms to maximize performance. The emphasis is on optimizing the model’s architecture and parameters, often overlooking the quality and relevance of the data used for training. This can result in models that lack robustness when faced with new data.

Let’s break down the flaws in a couple of real-world scenarios:

Consider a healthcare diagnostic scenario, a model-centric approach might develop an intricate neural network to diagnose diseases from medical images. If the training data is limited or not diverse (e.g., mostly images from a particular demographic or with certain types of equipment), the model may perform poorly in real-world settings where patient demographics and equipment vary. Additionally, as new viruses and diseases are discovered the model needs to be informed of the findings or else it will miss detecting those that might turn out to be fatal.

Suppose a meteorological organization develops an ML model for predicting weather patterns based on historical climate data, satellite imagery, and atmospheric readings. With a model-centric approach, the algorithm’s accuracy is based on predicting weather conditions based on the already available data. At first, the model is accurate in predicting weather conditions. But as time goes on, climate patterns evolve, new data becomes available, and the system’s performance declines. It can’t keep up with sudden drops in temperature or other abrupt climate changes — the model has limited adaptability — because these scenarios weren’t present in the training datasets.

While model-centric approaches have their benefits, they are not without drawbacks, especially when it comes to data quality, fairness, transparency, and adaptability.

Data-centric AI Reigns Supreme

Unlike a model-centric AI approach, data-centric AI prioritizes the quality and relevance of the data used to train and deploy the model. It includes data cleaning, augmentation, and ensuring the data is representative of the real-world scenarios where the model will be deployed.

As AI models mature, diversify, and expand in complexity, their datasets should evolve alongside them to enhance quality and accuracy. This brings models closer to the data rather than shuttling data to the models. Overall, the data-centric AI approach offers a holistic framework for developing AI systems that are increasingly accurate, fair, transparent, and adaptable, ultimately leading to more responsible and impactful applications of AI.

Key benefits of data-centric AI include:

  • Enhanced data quality: Data-centric AI ensures the quality of the data by comprehensively collecting data from diverse sources. It also involves meticulous data cleaning to remove errors and inconsistencies and data augmentation techniques to enrich datasets with relevant information.
  • Mitigate hallucinations: By ensuring the accuracy and reliability of the dataset and addressing biases, including skewed representations of demographics, the risk of hallucinations or incorrect outputs is reduced. Comprehensive training procedures enhance the model’s ability to generalize unseen data and adapt to real-world variability. Continual monitoring and updating of AI models with real-time data helps maintain accuracy and relevance, mitigating the potential for hallucinations over time.
  • Transparency and accountability: By emphasizing documentation of data sources, preprocessing steps, and model training procedures, the AI model and its creators remain transparent and are held accountable. Data-centric approaches also typically incorporate tools for bias detection and mitigation, allowing stakeholders to identify and address potential sources of bias or discrimination. If this approach had been used in the health diagnostic tool, the dataset of candidates and or the disease information would have been more accurate. The required transparency would have promoted accountability of the healthcare diagnostic tool’s data sources, allowing for scrutiny and suitability of the AI system leading to a more responsible and ethical deployment.

Data-centric AI in the Real-world

As an example of a data-centric AI model, we can look at a healthcare organization that is developing an AI system for diagnosing patients’ medical conditions using imaging scans, such as X-rays or MRI images. With a data-centric approach, the organization prioritizes the quality, diversity, and representativeness of the data used to train the AI model. In turn, the dataset is comprehensive with a wide range of medical imaging scans from various patient populations, including diverse demographics, age groups, and medical conditions. The team of developers will employ sophisticated techniques for detecting and mitigating biases in the dataset. Even after it’s up and running, the model undergoes continual monitoring and updating with fresh data to ensure its accuracy. The data-centric AI approach enables the healthcare organization to develop a highly accurate and reliable diagnostic tool that enhances patient care and clinical decision-making.

However, even with the long list of advantages to data-centric AI, combining it with elements of a model-centric approach can lead to even more robust and effective AI systems.

Better Together: Data-centric and Model-centric AI

Weaving elements of data-centric and model-centric AI approaches together provides a balanced framework for developing robust AI systems. This way, high-quality, diverse data to train accurate models is implemented (thanks to data-centric AI), while the internal architecture for performance is optimized (thanks to model-centric AI). By combining these approaches, organizations mitigate limitations, achieve optimal performance and maintain flexibility for a wide array of use cases.

Rahul Pradhan

About Rahul Pradhan

Rahul Pradhan is VP of Product and Strategy at Couchbase, provider of a leading modern database for enterprise applications that 30% of the Fortune 100 depend on. Rahul has over 20 years of experience leading and managing both engineering and product teams focusing on databases, storage, networking, and security technologies in the cloud. Before Couchbase, he led the Product Management and Business Strategy team for Dell EMC's Emerging Technologies and Midrange Storage Divisions to bring all flash NVMe, Cloud, and SDS products to market.

Leave a Reply

Your email address will not be published. Required fields are marked *