Charting a New Course of Neural Networks with Transformers


A “transformer model” is a neural network architecture consisting of transformer layers capable of modeling long-range sequential dependencies that are suited to leveraging modern computing hardware to reduce the time to train models.

State-of-the-art machine learning and artificial intelligence (AI) systems have achieved significant technological advancements in recent years alongside the technology’s growing interest and widespread demand. We’ve seen the general hype around AI fluctuate with media cycles and new product developments, with the buzz of implementing AI for the sake of implementing it wearing off as companies strive to demonstrate its positive impact on business—emphasizing AI’s ability to augment, not replace.

Emerging now is the concept of transformer-based models. There is speculation surrounding whether transformers, which have gained considerable traction in natural language processing (NLP), will be positioned to “take over” AI, leaving many to wonder what this approach can achieve and how it could transform the pace and direction of technology.

See also: How Businesses Can Integrate Natural Language Processing

Understanding Transformer-Based Models

A “transformer model” is a neural network architecture consisting of transformer layers capable of modeling long-range sequential dependencies, which are suited to leveraging modern computing hardware to reduce the time to train models.

Until recently, when machine learning scientists and developers have been tasked with modeling sequences, like transforming a series of English words into Spanish, recurrent neural network (RNN) layers have been the “go-to” approach for encoding the input sequence and decoding the output sequence. Although RNNs have proven to help identify relationships between elements in sequences, some limitations confine their potential, such as how far back in the sequence they can remember. Thanks to Google and the University of Toronto’s research in the “Attention Is All You Need” published paper, the concept of self-attention has become a key ingredient in the transformer layers used in modern neural network models.

Self-attention involves various projections of the input data. This is often a sequence of words (represented as numerical vectors), each of which is projected into three new vectors: a “key,” a “query,” and a “value.” For a given word, say in language processing, its corresponding “query” is applied to the “key” of every other word in the sequence. This tells the model how relevant that other word is to this present word for the task at hand. By repeating this operation for every word in the sequence, you can produce a scoring heatmap of the relevance of different word combinations. This heatmap is then used to control what information to do with each word that passes through to the next layer of the network.

Consider an example of machine translation, where you translate German into English. If you have a future tense German sentence using the form “ich werde / I will,” the verb is sent to the end of the sentence, which does not occur in English. The self-attention mechanism within a transformer-based machine translation model will be much more capable of utilizing the dependency between these far apart pairs of words than RNN-based models.

Encoder-Only and Encoder-Decoder Models

Transformer models are primarily used for two purposes: encoder-only and encoder-decoder models.

An encoder-only model is best utilized when a developer wishes to encode only some of the data into a compact list of numbers, often referred to as embeddings that can be inputted into downstream models, like detecting sentiment. Today, the “Bidirectional Encoder Representations from Transformer” (BERT) is the most well-known encoder-based Transformer model across the industry. BERT is trained using vast volumes of unlabeled text data, where the objective of the training is to predict words that have been masked out of the input sequence. Such encoder-based models can be combined with further layers for classification purposes.

Alternatively, there are encoder-decoder models. These architectures are common in modern machine learning for specific use-cases like machine translation. This comes into play when translating languages like French to English. The words are inputted as the French language, which is processed by a sequence of “Transformer layers.” The text is encoded into a compact numerical form. Then the decoder part of the model takes the output of the encoder and again sends it through a sequence of Transformer layers to generate the English text, word-by-word, until it predicts the end of the sentence.

Will Transformer-Based Models Replace Modern Machine Learning?

Transformer-based models have begun to have a significant impact outside of NLP. The so-called Vision Transformer (ViT) has achieved impressive accuracy at image classification tasks by treating small patches of an image as elements of a sequence like a sentence of words. They have also been extended further to video understanding, for instance, with approaches like timeSformer. TimeSformer uses self-attention to model relationships between patches within a frame and with patches in other structures in the sequence. Transformer-based models are even now starting to drive forward improvements in accuracy in speech emotion recognition.

Despite their potential, transformer-based models have shortcomings. The self-attention part of transformer-based models has quadratic computational complexity for input data size, making processing long sequences extremely costly, besides being unsuitable for real-time processing. Imagine the challenges this poses, for instance, with processing arbitrarily long telephony conversations.

At the same time, concerns like computational load and suitability for real-time streaming processing are becoming increasingly active areas of research attention by this now highly vibrant scientific community.

The field continues to be incredibly dynamic in developing new techniques and processes to overcome any shortcomings. Previously, for example, fully connected, deep feedforward networks could not effectively capture important features in images, which led to the widespread adoption of convolutional neural networks. With ongoing research, transformer functionality will continue to evolve as a key ingredient in modern machine learning models.

The Future of Transformer-Based Models

The interest in transformers is justified due to widespread improvements in accuracy. These advances will likely be less impactful in enabling new commercial applications. They will help existing and emerging applications get better at what they are designed to achieve and increase accuracy and adoption. Like many of today’s promising technologies, leaning into the strengths of each creates the best possible outcomes, similar to finding human-machine symbiosis in a workplace.

As we look ahead to the future, it remains evident that organizations must not get lost in the hype of emerging technologies and overly estimate their potential. The most promising path forward will be where organizations find synergies across the processes, technologies, and business goals to shape a cohesive and connected tech stack.

Dr. John Kane

About Dr. John Kane

Dr. John Kane is the Head of Signal Processing and Machine Learning at Cogito, the leader in AI Coaching Systems for the enterprise. He has over a decade of expertise in speech science and technology. At Cogito, he leads the research and development of machine learning algorithms to enable the real-time processing of audio, speech, and other behavioral signals. John is an active member of the speech research community, contributing as a reviewer for leading journals and conferences in the space and maintaining open source speech processing tools.

Leave a Reply

Your email address will not be published. Required fields are marked *