The new approach can increase the effectiveness of LLM training | MIT News

Examples of large-scale reasoning languages (LLMs) are designed to solve complex problems by breaking them down into a series of small steps. These powerful models are great for challenging tasks like advanced programming and multi-step programming.
But developing inference models requires a large amount of computation and energy due to the inefficiency of the training process. While a few high-powered processors continue to work on complex queries, others in the group remain idle.
Researchers from MIT and elsewhere have found a way to use this compound relaxation time to effectively speed up the training of a mental model.
Their new method automatically trains a small, fast model to predict LLM results for large-scale thinking, which is confirmed by a large model. This reduces the amount of work the reasoning model has to do, speeding up the training process.
The key to this program is its ability to train and feed a small model dynamically, so it only kicks in when certain processors are idle. By using computing resources that would otherwise be wasted, it speeds up training without incurring additional costs.
When tested on multiple logic LLMs, the method doubled the training speed while maintaining accuracy. This may reduce costs and increase energy efficiency for developing advanced LLMs for applications such as predicting financial trends or detecting hazards in power grids.
“People are looking for models that can handle complex tasks. But if that’s the goal of model development, then we need to prioritize efficiency. We found a lossless solution to this problem and built a full stack system that can bring incredible acceleration in performance,” said Qinghao Hu, MIT postdoc and co-author of the paper in this way.
He is joined on the paper by co-author Shang Yang, a graduate student in electrical engineering and computer science (EECS); Junxian Guo, EECS graduate student; senior author Song Han, associate professor at EECS, member of the Research Laboratory of Electronics and distinguished scientist of NVIDIA; and others at NVIDIA, ETH Zurich, MIT-IBM Watson AI Lab, and the University of Massachusetts at Amherst. The research will be presented at the ACM International Conference on Architectural Support for Programming Languages and Applications.
Training bottle
Engineers want logical LLMs to identify and correct errors in their critical thinking process. This ability allows them to answer more complex questions that can outgrow a regular LLM.
To teach them this skill, developers train LLMs to think using a technique called reinforcement learning (RL). The model generates the most possible answers to a question, gets a reward for the best candidate, and is updated based on the best answer. These steps are repeated thousands of times as the model learns.
But the researchers found that the process of generating multiple responses, called elicitation, can consume about 85 percent of the processing time required for RL training.
“Updating the model — which is the real ‘training’ part — takes relatively little time,” Hu said.
This bottleneck occurs in standard RL algorithms because all processors in the training set must complete their answers before moving on to the next step. Because some processors may work on very long responses, others generate short responses and wait for them to finish.
“Our goal was to turn this idle time into acceleration without wasted expense,” added Hu.
They wanted to use an existing method, called speculative decoding, to speed things up. Predictive coding involves training a small model called a drafter to quickly predict the future outcomes of a larger model.
The large model verifies the programmer’s predictions, and the feedback it receives is used for training.
Because a large model can verify all of the programmer’s assumptions at once, rather than generating each output in turn, it speeds up the process.
A flexible solution
But in hypothetical recording, the draft model is usually trained only once and remains static. This makes the process impossible in reinforcement learning, as the mental model is updated thousands of times during training.
A fixed drafter will quickly become old and useless after a few steps.
To overcome this problem, researchers created a revolutionary system known as “Long Tail Control,” or TLT.
The first part of TLT is an adaptive draft trainer, which uses free time on idle processors to train the drafter model on the fly, keeping it in good agreement with the target model without using additional computational resources.
The second component, the adaptive extraction engine, controls predictive coding to automatically select the optimal strategy for each new set of inputs. This process changes the configuration of the inferred predictions based on characteristics of the training workload, such as the number of inputs processed by the draft model and the number of inputs received by the target model during validation.
In addition, the researchers designed the draft model to be simple so that it can be trained quickly. TLT also uses some parts of the mental model training process to train the compiler, resulting in additional gains in speed.
“As soon as other processors finish their short queries and do nothing, we quickly switch them to do a draft model training using the same data they use for the extraction process. The main method is our recording of dynamic predictions – these advantages would not exist without it,” said Hu.
They tested TLT across multiple-thinking LLMs trained using real-world datasets. The program accelerated training by between 70 and 210 percent while maintaining the accuracy of each model.
As an added bonus, the small draft model can easily be used for effective use as a free product.
In the future, researchers want to integrate TLT into many types of training and instructional frameworks and discover new reinforcement learning applications that can be accelerated using this method.
“As reasoning continues to be a major workload that drives the need for reasoning, Qinghao’s TLT is a great way to address the computational bottleneck of training these reasoning models. I think this method will be very useful in the context of successful AI computing,” Han said.
This work was funded by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.



