Using Smaller ML Models To Train Large Language Models

A new research project by MIT has developed a way to train large language models on smaller machine learning algorithms.

Large language models, which are the frameworks OpenAI, Google, and others have used to build chatbots such as Bard, BlenderBot, and ChatGPT are enormous endeavors, trained on billions of parameters which take a lot time to source, arrange, and feed into the model.

Often, at the start of a new project, developers will begin the process anew, sourcing billions of bits of data to feed into a new large language model (LLM). This is time consuming and can rack up the costs of developing a model, while also harming the environment by running computers for weeks or months to train the model.

Researchers at MIT have developed a way for creators of these LLMs to integrate old models into new development, through a method called Linear Growth Operator (LiGO).

This method uses smaller models, which may run in the millions of parameters, to train a much larger language model. It encodes the knowledge learned during its own training to teach the LLM, which can lead to up to 50 percent reduction in computational cost.

“It’s been estimated that training models at the scale of what ChatGPT is hypothesized to run on could take millions of dollars, just for a single training run. Can we improve the efficiency of these training methods, so we can still get good models in less time and for less money? We propose to do this by leveraging smaller language models that have previously been trained,” said Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science and co-author of the paper.

The LiGO method can be utilized by developers working on vision and language models, often improving their performance and as well as lowering computational costs. It expands the width and depth of the model, by creating a linear map of the operation which transforms input values into output values.

LLMs have continued to increase in size over the past half decade, with Google’s BERT, one of the first notable LLMs to use the transformer mechanism in 2018, being trained on 340 million parameters. By 2020, OpenAI was training GPT-3 on 175 billion parameters, and Google has trained GLaM on 1.2 trillion. OpenAI’s GPT-4 is estimated to have been trained on over 1.5 trillion parameters, although that has not been confirmed by OpenAI.

Finding ways to more efficiently train these LLMs is imperative, especially for developers which do not have the resources or capacity to compete with OpenAI (backed by Microsoft) and Google.

Speaking on the subject of ever-increasing resource needs for LLMs, Kim said: “This has led to an arms race of companies trying to train larger and larger transformers on larger and larger datasets. More so than other architectures, it seems that transformer networks get much better with scaling. We’re just not exactly sure why this is the case.”

Using Smaller ML Models To Train Large Language Models

About David Curry

Leave a Reply Cancel reply

About David Curry

Recommended Articles

Leave a Reply Cancel reply