A better way to identify overconfident language types | MIT News

Large-scale linguistic models (LLMs) can produce reliable but imprecise answers, so researchers have developed methods to measure uncertainty to assess the reliability of predictions. One popular method involves sending the same data multiple times to see if the model produces the same response.
But this method measures confidence, and even the most impressive LLM can be mistaken for confidence. Overconfidence can mislead users about the accuracy of the forecast, which can result in detrimental results in high-profile settings such as health care or finance.
To address this shortcoming, MIT researchers have introduced a new method for measuring a different type of uncertainty that reliably identifies confident but incorrect LLM responses.
Their approach involves comparing the response of a target model to responses from a group of similar LLMs. They found that multivariate model discrepancy estimation more accurately captures this type of uncertainty than conventional methods.
They combined their approach with the LLM self-report measure to create a comprehensive uncertainty metric, and tested it on 10 real-world tasks, such as answering questions and mathematical reasoning. This overall uncertainty metric outperformed other measures and was better at identifying unreliable forecasts.
“Your adaptation is used in many different ways to make it cost effective, but if your estimate of uncertainty depends only on the result of one model, there is really no trust. We went back to the beginning to understand the limitations of the current methods and used those as a starting point to design a parallel method that can improve the results of the computer and science,” said Kimidimia. (EECS) graduate student at MIT and lead author of a paper on this approach.
Collaborated on the paper by Veronika Thost, a research scientist at the MIT-IBM Watson AI Lab; Walter Gerych, a former MIT postdoc who is now an assistant professor at Worcester Polytechnic Institute; Mikhail Yurochkin, staff research scientist at the MIT-IBM Watson AI Lab; and senior author Marzyeh Ghassemi, an associate professor at EECS and a member of the Institute of Medical Engineering Sciences and the Information Systems and Decisions Laboratory.
Understanding overconfidence
Many popular methods of quantifying uncertainty include asking the model to score confidence or checking the consistency of its responses to the same information. These methods estimate aleatoric uncertainty, or how much the model is internally confident in its predictions.
However, LLMs can be confident when they are completely wrong. Research has shown that epistemic uncertainty, or the uncertainty that one is using the correct model, can be a better way to assess the true uncertainty when the model is overconfident.
MIT researchers measured epistemic uncertainty by measuring disagreement in the same group of LLMs.
“If I ask ChatGPT the same question many times and it gives me the same answer over and over again, that does not mean that the answer is really correct. If I switch to Claude or Gemini and ask them the same question, and get a different answer, that will give me a feeling of epistemic uncertainty,” explained Hamidieh.
Epistemic uncertainty attempts to capture how far the target model differs from the ideal model for that task. But since it is impossible to create a perfect model, researchers use estimates or estimates that often rely on faulty assumptions.
To improve quantitative uncertainty, the MIT researchers needed a more accurate way to measure epistemic uncertainty.
How to combine
The method they developed involves measuring the difference between a target model and a subset of models of similar size and architecture. They found that comparing semantic similarity, or how close the answers’ meanings are, can provide a better measure of epistemic uncertainty.
In order to achieve a more accurate estimate, researchers needed a set of LLMs that included a variety of responses, that were not significantly different from the target model, and were weighted based on reliability.
“We found that the easiest way to satisfy all these properties is to take models trained by different companies. We tried many different complicated methods, but this very simple method ended up working better,” said Hamidieh.
Once they developed this method of measuring epistemic uncertainty, they combined it with a more general method that measures aleatory uncertainty. This metric of uncertainty (TU) provided the most accurate indication of whether the confidence level of the model was reliable.
“Uncertainty depends on the uncertainty of the given data and how close our model is to the correct model. That’s why summing these two uncertainty metrics will give us a very good estimate,” Hamidieh said.
TU can effectively identify cases where the LLM is misleading, as the epidemiologic uncertainty can confidently flag negative results that the uncertainty might miss. It may also allow researchers to reinforce LLM’s correct responses during training, which may improve performance.
They tested TU using multiple LLMs on 10 common tasks, such as answering questions, summarizing, interpreting, and mathematical reasoning. Their method more effectively identified unreliable forecasts than estimates alone.
Estimating absolute uncertainty often requires fewer queries than calculating aleatoric uncertainty, which can reduce computational cost and save energy.
Their experiments also revealed that epistemic uncertainty works best in tasks with a single correct answer, such as answering factual questions, but may not work well in open-ended tasks.
In the future, researchers can adjust their method to improve its performance on open-ended questions. They may also build on this work by examining other forms of aleatory uncertainty.
This work was funded, in part, by the MIT-IBM Watson AI Lab.



