Innovative tools and data annotation service companies help produce much higher quality data to train AI models in much faster time frames.
Data labeling and annotation are critical in training machine learning (ML) models and artificial intelligence (AI) algorithms that are used in continuous intelligence applications (CI). Could Annotations, a new data annotation tool from IBM gives businesses yet another option to help with this time-consuming, yet vital task.
See also: Data Annotation Feeds the AI Beast
The tool, released on Github, is a fast, easy, and collaborative open-source image annotation tool for teams and individuals. The tool uses AI to automate data annotation, helping to reduce many manual steps of drawing outlines around objects and more.
Cloud Annotations supports uploading both photos and videos. However, there are a few limitations to consider. IBM includes some best practices to ensure businesses get the best results when using the tool. Guidance and suggestions provided by IBM for best use include:
- Object Type: The model is optimized for photographs of objects in the real world. They are unlikely to work well for x-rays, hand drawings, scanned documents, receipts, etc.
- Object Environment: The training data should be as close as possible to the data on which predictions are to be made. For example, if your use case involves blurry and low-resolution images (such as from a security camera), your training data should be composed of blurry, low-resolution images. In general, you should also consider providing multiple angles, resolutions, and backgrounds for your training images.
- Difficulty: The model generally can’t predict labels that humans can’t assign. So, if a human can’t be trained to assign labels by looking at the image for 1-2 seconds, the model likely can’t be trained to do it either.
- Label Count: We recommend at least 50 labels per object category for a usable model, but using 100s or 1000s would provide better results.
- Image Dimensions: The model resizes the image to 300×300 pixels, so keep that in mind when training the model with images where one dimension is much longer than the other.
- Object Size The object of interest’s size should be at least ~5% of the image area to be detected. For example, on the resized 300×300 pixel image, the object should cover ~60×60 pixels.
Expanding the market
Cloud Annotation is the latest tool designed to help with data annotation for items used in ML and AI training. Some tools that offer help in this area include Intel’s Computer Vision Annotation Tool (CVAT) and Google’s Fluid Annotation.
The Computer Vision Annotation Tool (CVAT) is an open-source tool for annotating digital images and videos. The main function of the application is to provide users with convenient annotation instruments. For that purpose, Intel designed CVAT as a versatile service with many features.
CVAT is a browser-based application for both individuals and teams that supports different work scenarios. The main tasks of supervised machine learning can be divided into three groups:
- Object detection
- Image classification
- Image segmentation
CVAT lets users annotate data for each of these cases.
Fluid Annotation first runs an image through a pre-trained semantic segmentation model (Mask-RCNN). This generates around 1000 image segments with their class labels and confidence scores. The segments with the highest confidences are used to initialize the labeling, which is presented to the annotator. Afterward, the annotator can: (1) Change the label of an existing segment choosing from a shortlist generated by the machine. (2) Add a segment to cover a missing object. The machine identifies the most likely pre-generated segments, through which the annotator can scroll and select the best one. (3) Remove an existing segment. (4) Change the depth-order of overlapping segments.
Meeting Market Demands
Businesses that want to build CI applications that use AI need high-quality data to train the AI models. Such a need has created a new market for data annotation tools and services. Complementing tools like IBM’s Cloud Annotation, a booming industry has emerged, comprised of companies that specialize in speedy and highly accurate data annotation services. Some of the companies in this market deliver domain-specific labeled data.
The companies that provide such services provide greater value than a public crowdsources service might offer. Instead, this new breed of companies use highly trained data labelers, and many develop their own advanced annotation tools. Many of those tools are AI-based to work on their own or in tandem with a human operator.
One example of such a company is Samasource, which uses a secured cloud annotation platform, SamaHub, to manage the annotation lifecycle. This includes image upload, annotation, data sampling, and QA, data delivery, and overall collaboration.
Taken together, the innovative tools and data annotation service companies help produce much higher quality data to train AI models in much faster time frames. The availability of high-quality annotated data in a speedier manner can only help businesses build more resilient and reliable AI and CI systems and applications.