Revealing the biases, feelings, personalities, and abstract concepts hidden in large language models | MIT News

By now, ChatGPT, Claude, and other major language brands have accumulated so much human experience that they are far from simple response generators; they can also express abstract concepts, such as certain tones, personalities, biases, and emotions. However, it is not clear how these models represent abstract concepts to begin with from the information they contain.
Now a team from MIT and the University of California San Diego has developed a method to test whether the large-scale linguistic model (LLM) contains hidden biases, personality, emotions, or other abstract concepts. Their approach can tap into the connections within the model that encode the concept of interest. Furthermore, the method may use, or “guide” this connection, to strengthen or weaken the sense in whatever response the model is asked to provide.
The team demonstrated that their method can quickly remove and target more than 500 common concepts in some of the largest LLMs in use today. For example, researchers can adopt personality model representations such as “social activist” and “conspiracy theorist,” as well as situations such as “fear of marriage” and “Boston fan.” They may tune these representations to enhance or reduce sensitivity to any responses produced by the model.
In the case of the “conspiracy theorist” concept, the team successfully identified a representation of this concept among the major conceptual language models available today. When they developed the representation, and encouraged the model to explain the origin of the famous “Green Marble” image of Earth taken by Apollo 17, the model produced a response with the tone and attitude of a conspiracy theorist.
The team acknowledges that there is a risk in releasing certain concepts, which they also illustrate (and warn against). Overall, however, they see the new approach as a way to shed light on hidden concepts and potential vulnerabilities in LLMs, which can be modified up or down to improve the model’s security or improve its performance.
“What this really means about LLMs is that they have these ideas in them, but they’re not all expressed,” said Adityanarayanan “Adit” Radhakrishnan, an assistant professor of mathematics at MIT. “In our approach, there are ways to take these different concepts and make them work in ways that information cannot provide answers to.”
The team published their findings today in the journal Science Science. Co-authors of the study include Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of UC San Diego, and Enric Boix-Adserà of the University of Pennsylvania.
A fish in a black box
As the use of OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and other artificial intelligence assistants has exploded, scientists are racing to understand how models represent abstract concepts like “hallucination” and “delusion.” In the context of the LLM, an illusion is an answer that is false or contains misleading information, a model that is “forgotten,” or mistakenly constructed as true.
To find out whether a concept like “hallucination” is encoded in LLM, scientists often take the approach of “unsupervised learning” – a type of machine learning where algorithms go through unlabeled representations to find patterns that might be related to a concept like “hallucination.” But for Radhakrishnan, such an approach would be too extensive and computationally expensive.
He says: “It’s like going fishing with a big net, trying to catch one type of fish. “Instead, we’ll go in with bait for the right types of fish.”
He and his colleagues have developed the beginnings of a more targeted approach to a type of predictive modeling algorithm known as the iterative factor machine (RFM). RFM is designed to directly identify features or patterns within data by using a statistical method that neural networks – a broad class of AI models that include LLMs – implicitly use to learn features.
Since the algorithm was an effective, efficient way to capture features in general, the team wondered if it could be used to eliminate conceptual representations, in LLMs, which are the most widely used and perhaps least understood type of neural network.
“We wanted to apply feature learning algorithms to LLMs, so that, in a targeted way, we can find conceptual representations in these large and complex models,” Radhakrishnan said.
Turning to the mind
The new team approach identifies any concept of interest within the LLM and “guides” or directs the model’s response based on this concept. The researchers looked at 512 concepts within five categories: fear (like marriage, insects, even buttons); professionals (social activist, medievalist); emotions (boasting, self-deprecation); choice of locations (Boston, Kuala Lumpur); and people (Ada Lovelace, Neil deGrasse Tyson).
Researchers then sought representations of each concept in many of today’s major languages and conceptual models. They did this by training RFMs to recognize numerical patterns in the LLM that could represent a particular concept of interest.
A large general language model is, broadly speaking, a neural network that takes natural language input, such as “Why is the sky blue?” and breaks the information into individual words, each encoded mathematically as a list, or vector, of numbers. The model takes these vectors through a series of computational layers, creating multiple number matrices that, throughout each layer, are used to identify the most likely words to respond to the original command. Finally, the layers combine into a collection of numbers that are decoded into text, in the form of natural language response.
The team approach trains RFMs to recognize numerical patterns in the LLM that can be associated with a specific concept. As an example, to see if the LLM contains any representation of “conspiracy theory,” the researchers will first train an algorithm to identify patterns among the LLM’s representations of 100 commands clearly related to conspiracy, and 100 other information that is not. In this way, the algorithm will learn patterns related to the conspiracy theorist’s mind. Then, researchers can statistically measure the activity of the conspiracy theorist’s mind by interfering with the LLM representations of these identified patterns.
The method can be used to search and manipulate any common concept in LLM. Among many examples, the researchers identified presentations and manipulated the LLM to give answers with the tone and perspective of a “conspiracy theorist.” They also identified and developed the concept of “refusal prevention,” and showed that although in general, the model would be programmed to reject certain information, it would instead respond, for example giving instructions on how to rob a bank.
Radhakrishnan says this method can be used to speed up searches and reduce risk in LLMs. It may also be used to enhance certain characteristics, personalities, feelings, or preferences, such as emphasizing the concept of “brevity” or “thinking” in any response produced by the LLM. The team has made the source code of the method publicly available.
“LLMs obviously have a lot of these abstract concepts stored within them, in other presentations,” Radhakrishnan said.. “There are ways in which, if we understand these representations well enough, we can create specialized LLMs that are still safe to use but work well for certain jobs.”
This work was supported, in part, by the National Science Foundation, the Simons Foundation, the TILOS Center, and the US Office of Naval Research.



