How to Merge Machine Learning and Data Prep

A new crop of data preparation tools, powered by machine learning, are changing the way that businesses prepare their data for optimal use. Here’s what you need to know.

At analytics-capable companies, just about every business department is asking for more data to help them make better decisions. But many of these same companies are also struggling with one common problem: It simply takes too long to transform raw data into meaningful sets that analysts, data scientists, and business users can actually use to their advantage In fact, according to a report from Forrester, 59 percent of these decision-makers say that data prep is the key bottleneck they must over to achieve better business intelligence through analytics.

One of the primary problems is that data preparation is a time-consuming, often-manual process that relies heavily on human labor. In that same Forrester study, 42 percent of analysts claimed to spend a stunning 40 percent of their time fixing and validating data before they can use it. And in situations where they don’t do their due diligence before using the data, they might create their own half-baked solutions, which don’t provide the full picture. And bad data drives well-intentioned professionals to make bad decisions.

But, according to Forrester, a new crop of data preparation tools, powered by machine learning, are changing the way that businesses prepare their data for optimal use. On top of that, these tools are trying to accommodate the specific needs of certain roles, rather than simply trying to be everything for everyone.

Applying AI to make data preparation faster

A number of data preparation tools, such as IBM Watson Analytics, Trifacta, and Datameer now include machine learning algorithms that help minimize the number of manual processes that otherwise bog down the data stream. One of the places where machine learning excels is observing human behavior — particularly the more repetitive and time-consuming tasks —and figuring out how to repeat it, as a form of business process optimization.

Forrester found that these tools can quickly suggest data types and different structures that could be more useful to analysts at the end of the pipeline.

These data preparation tools also watch to better understand and then predict the types of transformations data scientists require. By helping to predict where and when data scientists or other analysts will need these transformations, machine learning can reduce manual efforts and make employees more productive.

Finally, machine learning-enabled data prep tools help suggest new ways to link or clean data sets. Machine learning algorithms excel at correlating data and recognizing patterns, and can help introduce new ways of looking at the data while also reducing the time to prepare it.

For example, an analytics power user, as defined by the Forrester research as someone who is “tech and data savvy” and prefers to work with data independently without the aid of others, would benefit from machine learning that learns from how they work and reinforces those efforts though individualized training. Forrester posits Paxata as one of the primary players in this unique space.

[ Related: Can You Out-Estimate the Zestimate? ]

Other business roles, such as data governance managers or data scientists, require different capabilities, making the ultimate choice of which data prep tool to invest in a difficult one.

The roadmap to better data prep

Instead of focusing too heavily on the specific feature set of one data preparation tool over another, Forrester’s researchers recommend focusing on finding the tool that will help improve collaboration across the spectrum of those who need to work with prepared data. Any additional or unique features can be considered as value-adds on top of stronger collaboration. What else should organizations look for? Here are three areas of focus:

Focus on self-service: An application that allows business analysts to self-service their needs will accelerate their productivity and improve contextual understanding of the meaning behind data, which eliminates the sensation of “panning for gold.” Plus, serving more users with trustworthy data — even those without years of coding experience and data science fundamentals — will only empower the organization.

Prioritize visibility of data preparation: Some tools are better than others at promoting transparency into how the process happens. If the tool and its processes feel opaque, organizations will simply shift the bottleneck of data preparation away from overworked data scientists and toward a SaaS application that no one understands.

Look for strong access control: Some organizations deal with sensitive data held under strict regulations, and data preparation tools should be intelligent enough to enforce strong governance in the way that end users access data, while not slowing them down. Access control capabilities should be both intuitive and strong.

In Forrester’s study, 65 percent of respondents indicated that at least half of their business intelligence applications were homegrown. Sometimes, those tools can be remarkably effective, but in an era where users need to interact with oftentimes sensitive data faster than ever, without the help from IT, dedicated data prep tools with built-in machine learning are an obvious—and rather inevitable — choice moving forward.

More: Machine Learning on RTInsights.com

More: Decision Management on RTInsights.com