How Wide Data, Not Big Data, Provides More Prescriptive Recommendations

For companies to effectively use their data for AI for predictive purposes, a variety of data, and especially wide data, must be used.

In today’s fast-paced digital world, the usage of data is constantly evolving to help companies better understand the insights from massive amounts of both structured and unstructured data repositories. However, while big data can provide business-critical analytics, the data mostly serves to showcase what has happened. Past tense. Increasingly, wide data must be taken into account when performing predictive and prescriptive analytics.

That is where artificial intelligence can help and where the needs of big data for AI diverge. Breaking it down, big data is defined by three factors: volume, velocity, and variety. Volume refers to the size of data available, velocity, on the other hand, refers to the speed at which data arrives and is processed.

However, for companies to effectively use their data for AI for predictive purposes, the need is for a variety of data. With the rise in the adoption of AI across sectors, the ability to access diverse sets of data is paramount and a catalyst for AI algorithms. In other words, keep the data from being too vanilla and spice it up with much more variety.

We call this data with variety “wide data,” which is sourced from an organization’s internal, external, structured, and unstructured data. This is crucial because, in the globalized economy, business performance depends on many parameters.

An example of wide data is to look at two manufacturing plants designing products in different parts of the United States. The geographic location of these two plants will have an impact on production, especially if there are natural occurrences, like snowstorms. Taking into consideration weather and several other disparate external factors, combined with internal data to feed the AI algorithms, will result in more accurate predictions related to inventory, supply chain, and demand for each manufacturing organization. Variety of data offers more correlations and hence better learning for AI algorithms to provide accurate results.

Why not volume?

Volume doesn’t necessarily mean better learning for the algorithm because of the lack in different dependent variables. While theoretically, having a lot of data is important for AI applications, the size is less important than variety for efficient algorithms.

Take, for example, a project we were working on that predicted cancer patient data. There were not that many patients to study, creating only 150 rows of data. This produced a relatively small amount of data to sift through, and it was not considered big data. This raised the question – was there enough data for an AI algorithm to learn and predict outcomes for cancer patients? The answer, in this case, was yes. Because while there were only 150 rows, the biometric, biosensor, and symptoms data constituted thousands of columns, making it a wide data.

The point is variety of the data is more important for AI applications than the volume.

Types of wide data

As mentioned, there are different types of data, when combined, make it a wide data like:

Internal structured data: Data that sits in software applications such as ERP, CRM systems, or even financial systems.
Internal unstructured data: Documents, images, reports, diagrams, and graphs that are part of your internal unstructured data.

External data: Data from external sources, like weather, socio, economic data, census data, stock exchange data, etc.
External unstructured data: News, images, videos, and more from outside the organization’s firewalls.

The CUPP framework: Providing a data strategy

For the successful adoption of AI, it is important to have the best AI practices in place. One such framework is CUPP, which stands for collection, unification, processing, and presentation. Deploying these four steps is the typical way for organizations to start their AI journeys.

Traditional enterprises that may not have had a data strategy or data platforms in place and will benefit from creating a framework like CUPP. A framework is also important when it comes to ensuring veracity, or data accuracy. Organizations require clean, quality data to ascertain the desirable outcome, and veracity speeds this process up phenomenally.

Here are the basic steps that organizations undergo using the CUPP framework:

1. Start with a data census

Utilize data census methodology to discover the data assets that an organization has both internally and externally.

Part of this planning process is to locate sources of structured and unstructured data. Many organizations may be surprised at how much unstructured data they have, and doing a data census allows them to take stock of all their data assets.

2. Know your data assets

Before modeling, it is important to be aware of what an organization already has. This helps to develop AI best practices, but it requires a bit of patience. This will serve as the foundation and provide capability ROI for the adoption of AI applications.

Companies also need to investigate a collection and unification strategy for their internal-external structured and unstructured data.

3. Use machine learning and natural language processing and machine learning to convert and make sense of unstructured content

An organization can take the unstructured data they have sourced and utilize natural language processing to convert it into structured content to train your data.

Organizations should also be encouraged by unstructured content and not be fearful of it because there are now several technologies available for making very meaningful usage of that content.

Key takeaways about the need to use wide data

Wide data could kickstart an AI journey much more rapidly and is crucial for helping organizations to contextualize the insights from a variety of small and large, unstructured, and structured data sources. As technologies evolve and develop, there will be no enterprise that will be spared from data. Data strategies need to be built around obtaining and analyzing a variety of data.