Why Can’t ML Engineers Do Their Day Jobs? Hint: Crappy Data


Bogging down ML engineers with poor quality data that requires extensive manual processes impacts product quality and new feature speed to market.

Companies are investing heavily in hiring machine learning talent. They are building a high-powered ‘engine’ to propel their business and create differentiated products. However, today’s C-suite is now realizing that without an efficient process to collect the right data to train ML models, their ML engineers are starved of ‘fuel.’

With the use of ML growing at an incredible pace, CIOs are faced with the challenge of figuring out how to ensure the quality of their automation while keeping pace with competitors. Similar to most traditional business areas, a key aspect of this is forming a strong base of employees with expertise. While it’s nothing new for organizations to have a team of data scientists and data engineers, ML engineers are becoming more common and sought-after talent within these teams. The issue, however, is that once CIOs find and hire ML engineers, their expertise isn’t being leveraged to the full extent.

See also: The Curious Case of Data Annotation and AI

Today’s ML workflows require massive amounts of high-quality, human-labeled data, which are costly and time-consuming to capture. ML Engineers are tasked with designing, coding, training, testing, and deploying computer vision models by working with large amounts of unstructured and structured data sets. But the data isn’t always as “clean” or actionable as one would hope and assume. This is where the breakdown happens. So, where did we go wrong, and how can we get ML engineers back to their day job to add real value to the business?

Dated data collection approach leads to distractions

Data collection is especially complicated when dealing with computer vision applications. With autonomous cars, robotics, and consumer devices, building and deploying the necessary hardware to acquire image data results in long iteration cycles of months or even years. Once the data is acquired, it is labeled by hand. This process is prone to error and bias, and human labelers are fundamentally limited in their ability to classify specific attributes. For instance, the exact 3D position of objects required for many applications, like autonomous robots and smart home devices, cannot be labeled by humans. If we look at stop signs and programming them for an autonomous vehicle, humans are simply not capable of classifying every “3D position” of a potential tree or fence that may interfere with how the stop sign is perceived to the vehicle. This would affect the car’s ability to distinguish whether to stop and could result in an accident.

The current, and dated process results in ML engineers being ‘stuck’ with low-quality and limited training data, which in situations like the above stop sign example is simply unacceptable. Additionally, the model development process often reveals limitations in the training data, requiring the acquisition of new data or updates to the hardware, resulting in additional data acquisition and a long iteration cycle for development. All the while, ML engineers are left waiting and not tackling more strategic thinking that translates back to business-level benefit.

An organization’s ability to compete and innovate is a function of how fast it can learn and iterate. Organizations that are nimble and have streamlined data acquisition and ML development processes will ultimately create high-performing models that are better targeted for their customers and product use-cases.

Unclogging the ML development bottlenecks

Thankfully, solutions aren’t too far off. Several emerging technologies are being developed to address today’s ML development bottlenecks, such as synthetic or computer-generated data to mimic real-world phenomena. This solution shows promise in its ability to disrupt the traditional model development process and offer a new paradigm for computer vision creation. By merging technologies from the visual effects industry and generative neural networks, it is now possible to programmatically create vast amounts of photorealistic image data to train computer vision ML systems. Since the data is generated, information about every pixel is explicitly known, and an expanded set of labels are automatically generated. This enables systems to be built and tested virtually and allows ML engineers to iterate orders of magnitude more quickly since training data can be created on-demand.

In fact, the ability to simulate and train AI systems has already changed the way industrial robotic systems are built. Today, leading companies can virtually train 1,000’s of virtual robot arms in the cloud in machine time.

Unsupervised deep learning techniques also hold a lot of promise as they remove human labeling from the ML development process. In addition to streamlining the development process, these approaches can potentially also reduce human biases. IBM recently reported that there are more than 180 human biases, all of which are at risk of entering our AI systems. When this results in algorithms getting predictions wrong, organizations find themselves unknowingly being discriminatory and inaccurate in their decisions as a result. Take, for example, the recent Amazon recruiting model that made headlines for showing bias against women or Google’s hate speech detection algorithm that discriminated against people of color.

Although promising, current deep learning approaches require tremendous computational resources that are only available to the largest companies that have the available talent, resources, and funds to properly leverage the power of AI to automate decision-making. As the techniques evolve and computational needs are reduced in the years to come, tools like unsupervised deep learning will hopefully enter the mainstream and allow for more wide-spread high-quality ML to eliminate situations where AI biases go bad.

Another technology gaining traction among enterprise companies is ML developer operations (MLOps) tools, such as DataRobot, Neu.ro, and Paperspace. These tools streamline the process of data preparation, model training, and ML model deployment. With proper implementation, these processes free up ML engineers from more mundane tasks and allow them to focus on driving model performance. 

What can ML engineers do with all this free time?

By introducing these new technologies into computer vision development, organizations can free up their highly trained and skilled ML engineers, who should be focusing on driving model performance and developing new models. By bogging down ML engineers with poor quality data that requires extensive manual processes, businesses are unknowingly impacting the quality of products being produced and the speed to market new features.

Similar to how CAD systems are used in other engineering disciplines, it will soon be possible for ML engineers to simulate complex systems and train models virtually.  ML engineers working on smartphone cameras, autonomous vehicles, and complex robotics will begin to create new capabilities at an increasingly rapid rate. Companies adopting these technologies will better leverage their core ML talent and outperform their peers.

Yashar Behzadi

About Yashar Behzadi

Yashar Behzadi is the CEO of Synthesis AI. He is an experienced entrepreneur who has built transformative businesses in AI, medical technology, and IoT markets. He has spent the last 14 years in Silicon Valley building and scaling data-centric technology companies. His work at Proteus Digital Health was recognized by Wired as one of the top 10 technological breakthroughs of 2008 and as a Technology Pioneer by the World Economic Forum. Yashar has over 30 patents and patents pending and a Ph.D. in Bioengineering from UCSD.

Leave a Reply

Your email address will not be published. Required fields are marked *