The State of AI: Better Computer Vision, Faster NLP


AI has become part of real-life scenarios, including managing national electric grids, supermarket warehousing optimization, drug discovery, and healthcare.

Artificial intelligence (AI) is seeing strides in the areas of image recognition and natural language processing. In additon, there has been record funding this year into AI startups, and IPOs for data infrastructure and cybersecurity companies that help enterprises retool for the AI-first era.

These are some of the observations of Nathan Benaich and Ian Hogarth’s fourth annual and densely packed “State of AI” report reviewing developments in the field over the past year. They report that over the past year, AI has become part of real-life scenarios, including managing national electric grids, automated supermarket warehousing optimization, drug discovery, and healthcare.

The report also tracked the following developments:

Self-supervision is taking over computer vision: The report’s authors point to Facebook AI’s introduction of SEER, a self-supervised model pre-trained on a billion Instagram images that achieves 84.2% accuracy on ImageNet, comfortably surpassing all existing self-supervised models. SEER is also “a good few-shot learner,” they related, noting that: it still achieves 77.9% accuracy on ImageNet when trained with only 10% of the dataset. It also outperforms supervised methods on other tasks like object detection and segmentation.”

Transformers extend into efficient self-attention-based architectures. In addition, Benaich and Hogarth document the rise of “transformers,” or neural network-based deep-learning architectures, as a key part of AI. have emerged as a general-purpose architecture for machine learning, increasingly applied to natural language processing (NLP) and computer vision. “DeepMind’s Perceiver is one such architecture,” they observe.

“Textless” natural language processing emerges. Textless NLP is based on Generative Spoken Language Modeling (GSLM), which enables the “task of learning speech representations directly from raw audio without any labels or text.”

Less is more: watching a few clips is enough to learn how to caption a video. “To solve video-and-language (V&L) tasks like video captioning, a new program called ClipBERT “only uses a few sparsely sampled short clips,” according to Benaich and Hogarth. “It still outperforms existing methods that exploit full-length videos.” At the same time, they note, “a natural improvement of this process would be end-to-end learning of vision and text encoders. But due to the length of the video clips, this is usually computationally unaffordable.”



About Joe McKendrick

Joe McKendrick is RTInsights Industry Editor and industry analyst focusing on artificial intelligence, digital, cloud and Big Data topics. His work also appears in Forbes an Harvard Business Review. Over the last three years, he served as co-chair for the AI Summit in New York, as well as on the organizing committee for IEEE's International Conferences on Edge Computing. (full bio). Follow him on Twitter @joemckendrick.

Leave a Reply

Your email address will not be published. Required fields are marked *