The demand for AI-enabled applications that deliver increasingly refined results is driving the need for high-quality annotated data to train AI models.
Many continuous intelligence (CI) applications need trained AI models to work. An autonomous vehicle relies on sample data sets that help it differentiate objects and identify road markings and traffic signs. Similarly, an automated video surveillance system needs a data set to learn how to distinguish between a raccoon and an intruder. If the quality of that training data is not right, the performance of the AI models will not be satisfactory.
The booming demand for AI-enabled applications that deliver increasingly refined results is driving need for data that is suitable to train AI models. Meeting the demand is challenging. It is one thing to classify cat images on social media. Building a high-quality dataset for facial recognition or autonomous vehicles is much more complex.
In the past, facial recognition only used several dots on a human face. Now, facial key-point labeling can involve more than 200 dots with dozens used to clearly define each eyebrow, the lips, and the jawline, and more. Such detail is needed to train AI models to determine more than simple things like whether the person is male or female. Models now might also be used to determine race, age, and emotions.
One indication of the need for such data comes from China. There, the data service company Testin set up shop in Hengdian World Studios, also known as “Chinawood,” the largest film studio in Asia. Instead of making motion pictures as other tenants of the facility do, Testin photographs and films actors preforming facial expressions depicting laughing, crying, anger, and more. The images and videos are then used in facial key-point labeling for Chinese AI companies.
Self-driving Cars Need Data, Too
The quest for data to train autonomous systems is also booming. To get a sense of the complexity and level of detail needed for autonomous vehicles, consider the Waymo Open Dataset. The dataset includes high-resolution sensor data collected by Waymo self-driving cars in a wide variety of conditions. The data can be used by companies trying to train AI driving algorithms. This public database includes roughly 3,000 driving scenes, 16.7 hours of video data, 600,000 frames, and approximately 25 million 3D bounding boxes and 22 million 2D bounding boxes. (The most impressive thing about these numbers is that they represent just a tiny fraction of Waymo’s private autonomous driving database.)
A typical high-quality self-driving dataset might include great volumes of metadata and annotations, including such things as:
- pixel-wise semantic annotation
- 3D semantic annotation
- pixel-wise object instance annotation
- fine-grained road segmentation
- moving object trajectory
- high-precision GPS data.
The Role of Data Annotators
Businesses that want to build CI applications that use AI need high-quality data to train the AI models. Such a need has created a new market for data annotation services. The companies that provide such services provide greater value than a public crowdsources service might offer. Instead, this new breed of companies use highly trained data labelers, and many develop their own advanced annotation tools.
The new data labeling companies differentiate themselves from traditional crowdsourcing platforms that offer labeling services. The companies in this new category often tout their offerings as managed data labeling services. They deliver domain-specific labeled data that undergoes quality control.
If funding is a measure of the need or value of these new companies, the services they provide are indeed in great demand. Earlier this year, Scale AI closed $100 million in funding, bringing its valuation above the $1 billion mark. And last month, CloudFactory announced it has raised an additional $65 million in venture funding, bringing its total funding to $78 million.
Why the high level of investment? The human insight such annotation companies provide helps minimize labeling bias and yields data that is more precise and more accurate. This, in turn, helps produce much higher quality data to train AI models. And that leads to more resilient and reliable AI systems.