Artificial intelligence

Research: Platforms that list the latest LLMs can be trusted | MIT News

A company that wants to use a large language model (LLM) to summarize sales reports or evaluate customer inquiries can choose from among hundreds of different LLMs with a variety of different models, each with slightly different functionality.

To narrow down the choices, companies often rely on LLM social networks, which collect user feedback on model interactions to rate the latest LLMs based on how well they perform on specific tasks.

But MIT researchers found that a few user interactions can skew the results, leading someone to mistakenly believe a single LLM is the right choice for a particular use case. Their research shows that removing a small portion of a crowded data set can change which models perform best.

They have developed a quick way to test ranking platforms and determine if they are affected by this problem. The evaluation method identifies individual votes that are most liable to skew the results so that users can evaluate these powerful votes.

The researchers say the work underscores the need for robust techniques for assessing the quality of models. Although they did not focus on mitigation in this study, they offer suggestions that may improve the robustness of these platforms, such as collecting more detailed feedback to create standards.

The study also provides a word of caution to users who may rely on standards when making decisions about LLMs that can have far-reaching and costly implications for a business or organization.

“We were surprised that these high-level areas are so sensitive to this problem. If it turns out that the high-level LLM depends only on two or three parts of the user’s response out of tens of thousands, then one cannot assume that the high-level LLM will always outperform all other LLMs when used,” said Tamara Broderick, associate professor of Electrical Science (Electrics Department) member of the Laboratory for Information and Decision Systems (LIDS) and and the Institute for Data, Systems, and Society; the Computer Science and Artificial Intelligence Laboratory (CSAIL); and the senior author of this study.

He is joined on the paper by lead authors and EECS graduate students Jenny Huang and Yunyi Shen and Dennis Wei, senior research scientist at IBM Research. The research will be presented at the International Conference on Advocacy for Learning.

It discards the data

Although there are many types of LLM degree platforms, the most popular variant asks users to submit a question to two models and choose which LLM provides the best answer.

The platforms combine the results of these comparisons to produce rankings that show which LLM performed best in certain tasks, such as coding or visual perception.

By choosing the best performing LLM, the user probably expects the top level of the model to be the same, which means that it should extend to other models in their similar, but different, system with the new data set.

MIT researchers have traditionally studied areas such as mathematics and economics. That work revealed some cases where dropping a small percentage of the data can change the model’s results, indicating that those research conclusions may not hold beyond their small setting.

The researchers wanted to see if the same analysis could be applied to LLM-level platforms.

“At the end of the day, the user wants to know that they are choosing the best LLM. If only a few commands are driving this scenario, that suggests that rank may not be the be-all and end-all,” said Broderick.

But it will not be possible to check the data drop manually. For example, one ranking they tested had over 57,000 votes. Testing the data for a 0.1 percent drop means removing each set of 57 of the 57,000 votes, (there are more than 10194 subsets), then the rank is recalculated.

Instead, the researchers developed an efficient prediction method, based on their previous work, and modified it to fit LLM-level programs.

“Although we have the idea of ​​proving that the measurement works under certain assumptions, the user doesn’t have to trust that. Our method tells the user the problematic data points at the end, so they can just discard those data points, run the analysis again, and check if they get a change in the standards,” he says.

It’s amazingly sensitive

When the researchers applied their method to standardized preference bases, they were surprised to see how few data points they needed to drop to cause significant changes in the top LLMs. In one case, removing just two votes from more than 57,000, which is 0.0035 percent, changed which model was the best.

A different level platform, using expert annotations and high quality information, was very strong. Here, 83 out of 2,575 tests (about 3 percent) answered the top models.

Their research revealed that many influential votes were likely the result of user error. In some cases, there seems to be a clear answer as to which LLM worked best, but the user chose another model instead, Broderick said.

“We can’t know what was going through the user’s mind at the time, but maybe they clicked wrong or were not paying attention, or they honestly didn’t know which one was better. The key here is that you don’t want noise, user error, or other output to determine which LLM ranked higher,” he adds.

The researchers suggest that collecting more feedback from users, such as confidence levels in each vote, could provide richer information that could help mitigate this problem. Rating platforms may also use human moderators to evaluate crowdsourced responses.

For the researchers’ part, they want to continue to test generalizations in other situations while also developing better prediction methods that can capture more examples of instability.

“The work of Broderick and his students shows how you can get valid estimates of the influence of specific data on the processes that fall on the ground, despite the elusiveness of absolute statistics given the size of modern machine learning models and data sets,” said Jessica Hullman, the Ginni Rometty Professor of Computer Science at Northwestern University, who was not involved in the work. “Recent work provides a glimpse into the strong data dependence of frequently used – but also very fragile – methods for incorporating human preferences and using them to update a model. Seeing how few preferences can actually change the behavior of a well-tuned model may inspire thoughtful ways to collect this data.”

This research was funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the National Science Foundation, Amazon, and a CSAIL seed award.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button