Time Consumed by Data Prep: Is This a Bad Thing?


Data professionals are spending too much time on data prep, but the quality assurance that provides ensures projects are working with clean data sets.

To have a responsive, responsible and accurate artificial intelligence or analytics system, one needs data. The catch is, data scientists and analysts are forced to spend more time with data prep than they do in model creation, making it of value to their businesses. This suggests a need for more data engineers and database administrators to handle much of the front-end work that goes into supporting data-driven applications. Importantly, it means a high degree of teamwork is needed to make data analytics practical.

Download Now: Building Real-time Location Applications on Massive Datasets

Ask any data scientist or analyst about the level of support they need to do the jobs they were hired to do. SAS did exactly that, as documented in their recent study of 277 data managers and scientists, which finds data professionals are spending too much time on data preparation, and not enough on model creation. Respondents are spending more of their time (58%) than they would prefer gathering, exploring, managing and cleaning data.

See also: Integration Projects: How Data Prep Benefits from Automation

A typical data science project involves a variety of activities, almost always beginning with preparing data. On average, 11% of data scientists’ or analysts’ time is spent creating computer models. The question is: is this enough?

Data prep may be onerous and takes time away from working on business issues, but it’s necessary, the SAS study’s authors point out. “Regardless of your level in the organization, data management will probably take a large share of your time, even with the development of low code/no code tools and AI and machine learning algorithms being written for it,” they point out. “The likely reason is that the data you have and how you decide what’s relevant is probably specific to your industry and organization. As is the case for how you approach your model-building, knowing which data is relevant and why has a lot to do with the issues you are trying to solve.”

Data scientist and Data Science Bootcamp Leader Patrick Butler agrees, noting that the whole front-end managing and cleaning data process “is an intrinsic part of the modeling process.” Without it, “all the modeling that follows is truly just math.” The quality assurance for the data coming in up front is essential for ensuring that training data is built on clean data sets.

Download Now: Building Real-time Location Applications on Massive Datasets

About Joe McKendrick

Joe McKendrick is RTInsights Industry Editor and industry analyst focusing on artificial intelligence, digital, cloud and Big Data topics. His work also appears in Forbes an Harvard Business Review. Over the last three years, he served as co-chair for the AI Summit in New York, as well as on the organizing committee for IEEE's International Conferences on Edge Computing. (full bio). Follow him on Twitter @joemckendrick.

Leave a Reply

Your email address will not be published. Required fields are marked *