The Evolution, Misconceptions, and Reality of AutoML


AutoML makes AI more accessible by automating complex manual data science processes. But there are caveats to its use. Here are the top 5 myths and realities about AutoML.

As businesses around the world expand AI and machine learning (ML) efforts, one key challenge has been finding AI talent. A majority of medium and large businesses are discovering that data science teams are expensive to hire, especially during challenging economic times.

Data scientists typically work for a handful of Fortune 500 companies, and hiring these data scientists is beyond the reach of most businesses. Data science projects require an interdisciplinary team of data scientists, ML engineers, software architects, BI analysts, and subject matter experts. Given the variety of data science projects, and the substantial amount of data manipulation required, building and retaining this type of team is an arduous task. How can modern enterprises overcome these challenges and scale their data science operation? How can businesses with strong BI practice leverage recent advances in technology to expand predictive analytics?

See also: 3 Inconvenient Truths about AI and ML

That’s where automated machine learning (AutoML) comes into play. AutoML solutions make AI more accessible to everyone by automating complex manual data science processes. By empowering citizen data scientists with advanced analytical tools, companies can reap the benefits of emerging technology. These citizen data scientists can bridge the skill gap, address the labor shortage, and enable companies to leverage the existing resources they already have. Business analytics and BI professionals can leverage AutoML platforms to inject predictive and prescriptive analytics to uncover deep insights, enable business leaders to make decisions with real-time analytics, and optimize business performance driven by data.

The focus of first-generation AutoML platforms, aka AutoML 1.0, has been on building and validating models automatically. These traditional platforms automate only the machine learning component of the process. While useful, AutoML 1.0 platforms have zero impact on the most labor-intensive and challenging part of the data science process – data preparation and feature engineering. The next-generation platforms aka AutoML 2.0 include end-to-end automation. They can do much more – from data preparation, feature engineering to building and deploying models in production. These new platforms are helping development teams reduce the time required to build and deploy ML models from months to days. AutoML 2.0 platforms address hundreds of use cases and dramatically accelerate enterprise AI initiatives by making AI/ML development accessible to BI developers and data engineers, while also accelerating the work of data scientists.

Every new technology, especially in the early days, comes with its share of misconceptions, fallacy, and ambiguity. Here are the top five myths and reality about AutoML:

  • Conflating Feature Selection with Feature Generation: Feature Engineering (FE) can imply many different things, such as manually crafting features, selecting features, and feature extraction. Feature engineering is the most iterative, time-consuming, and resource-intensive process, involving interdisciplinary expertise. It requires technical knowledge but, more importantly, domain knowledge. The data science team builds features by working with domain experts, testing hypotheses, building and evaluating ML models, and repeating the process until the results become acceptable for businesses. In true sense, FE involves exploring features, generating and selecting the best features using relational, transactional, temporal, geo-locational, or text data across multiple tables. Traditional AutoML platforms require data science teams to generate features manually, a very time-consuming process that requires a lot of domain knowledge. AutoML 2.0 platforms provide AI-powered FE that enables any user to automatically build the right features, test hypotheses, and iterate rapidly. FE automation solves the biggest pain point in data science.
  • Underestimating the importance of Data Preparation: Data is spread across multiple databases in multiple formats not suitable for analytics. Many companies lack data infrastructure or do not have enough volume or quality data. Data quality and data management issues are critical given the high reliance on good quality data by AI and ML projects. However, traditionally, the approach companies take in order to solve any data issues requires months of effort. This time and the initial efforts associated with this approach often cause projects to fail after only a few months of investment and work. A typical enterprise data architecture includes master data preparation tools designed for data cleansing, formatting, and standardization before the data is stored in data lakes and data marts for further analysis. This processed data requires further manipulation that is specific to AI/ML pipelines, including additional table joining and further data prep and cleansing. Traditional AutoML platforms require data engineers to write SQL code and perform manual joins to complete these remaining tasks. AutoML 2.0 platforms, on the other hand, perform automatic data pre-processing to help with profiling, cleansing, missing value imputation, and outlier filtering, and help discover complex relationships between tables creating a single flat-file format ready for ML consumption.
  • Assuming Model Accuracy trumps everything: There is a perception that model accuracy is more important than feature transparency and explanation. There can often be disjointed expectations between technical and business teams. The data science teams typically put their focus on model accuracy, whereas the business teams place high importance on metrics such as business insights, financial benefit, and the interpretability of the models produced. This misalignment between the teams results in data science project failures as they are trying to measure completely different metrics. Also, traditional data science initiatives tend to use black-box models that are hard to interpret, lack accountability, and hence difficult to scale. ML platforms and data scientists who use the black box approach end up creating complex features that are based on non-linear mathematical transformations. These features, however, cannot be logically explained. Incorporating these types of features leads to a lack of trust and resistance from business stakeholders and, ultimately, project failure. White-box models (WBMs) provide clear explanations of how they behave, how they produce predictions, and what variables influenced the model. WBMs are preferred in many enterprise use cases because of their transparent ‘inner-working’ modeling process and easily interpretable behavior. In the case of heavily regulated industries such as financial services, insurance, and healthcare, feature explainability is critical.
  • A data science background is mandatory for AutoML: This is a common myth across many business intelligence (BI) and data professionals that AutoML is not meant for BI teams and that it requires a background in algorithms and ML. AutoML 1.0 platforms were cumbersome, lacked user experiences for BI developers, and provided challenging workflows. Even today, many AutoML platforms are geared towards data scientists and require a strong ML background. On the other hand, AutoML 2.0 platforms have end-to-end automation with drag and drop interface and enable anyone to run predictive models with few simple clicks. These new platforms have unleashed a revolution by empowering citizen data scientists – BI analysts, data engineers, and business users to embark on data science projects without requiring data scientists. AutoML 2.0 is the secret weapon the BI community can leverage to build powerful predictive analytics solutions in days – instead of the months typically associated with Augmented Analytics.
  • Not thinking about ML Operationalization and deployment scenarios: It is important to understand the ML operationalization process and think through the deployment options, whether cloud, on-premises, or at the edge. Unless you deploy ML models in production, you will not capture the value of ML. Is the developed process on your AutoML platform immediately available for production through a prediction API? Does it provide end-points to run and control the developed pipeline, and can it be easily integrated with other systems via a single line of API call? What kind of latency will your application require? Does the AutoML platform support real-time analytics, and can you deploy a containerized ML model for real-time prediction service? Automation makes enterprise-level, end-to-end data science operationalization possible with minimum effort and maximum impact, empowering enterprise data science and software/IT teams to operationalize complex data science projects and deliver continued business values. ML operationalization is an emerging frontier, and businesses need to plan about deployment in production environments from the start.

AutoML 2.0 platforms, the second generation of automated machine learning platforms that provide AI-focused data preparation, feature engineering automation, ML automation, and automated production are the future of data science for data-driven enterprises. Full cycle data science automation involves automating the entire data science process and addresses not only scalability issues, but also renders faster delivery of insights (from months to days).

Through full-cycle data science automation, enterprises don’t have to invest in as many skilled data scientists or teams of engineers for each project. AutoML 2.0 also empowers the so-called “citizen” data scientists bringing AI to the masses. Interpretable features help organizations stay accountable for their data-driven decisions and meet regulatory compliance requirements.  With WBMs data science is actionable, explainable, and accountable. This allows domain-experts to interpret models more quickly, especially with transparent outcomes, increasing the effectiveness and efficiency of the process. This “democratization” of AI provides a unique opportunity for enterprises of all sizes to integrate machine learning into business applications with the shortest time-to-market.

Ryohei Fujimaki

About Ryohei Fujimaki

Ryohei Fujimaki, Ph.D., is the Founder and CEO of dotData, a leader in full-cycle data science automation and operationalization for the enterprise. Prior to founding dotData, he was the youngest research fellow ever in NEC Corporation’s 119-year history, the title was honored for only six individuals among 1000+ researchers. During his tenure at NEC, Ryohei was heavily involved in developing many cutting-edge data science solutions with NEC’s global business clients and was instrumental in the successful delivery of several high-profile analytical solutions that are now widely used in the industry. Ryohei received his Ph.D. degree from the University of Tokyo in the field of machine learning and artificial intelligence.

Leave a Reply