Unveiling Hidden Abstract Concepts in Large Language Models
A team of researchers from the Massachusetts Institute of Technology (MIT) and the University of California San Diego has developed a pioneering approach to expose and manipulate abstract concepts such as personal biases, moods, and personalities embedded in large language models (LLMs). This groundbreaking study reveals that LLMs, including popular systems like ChatGPT and Claude, are more than just simple answer generators. They can express and understand complex and abstract concepts, an understanding of which has been hitherto unclear.
These researchers’ method targets specific connections within a model that encode a concept of interest. The model can then adjust or “steer” these connections to strengthen or weaken the concept in any response the model provides.
Discovering and Manipulating Concepts in LLMs
The research team demonstrated that their method could effectively locate and control more than 500 common concepts in some of today’s biggest LLMs. For instance, they were able to identify and manipulate a model’s representations of personalities such as “social influencer” and “conspiracy theorist” and attitudes like “fear of marriage” and “Boston fan.” By adjusting these representations, the researchers could enhance or minimize the concepts in all the responses the model generates.
A fascinating instance was when the team was able to identify a representation of the “conspiracy theorist” concept in one of the largest vision language models currently available. When they heightened this representation and asked the model to explain the origins of the famous “Blue Marble” image of Earth from Apollo 17, the model provided a response from a conspiracy theorist’s viewpoint.
Risks and Opportunities
The researchers acknowledge that there are risks associated with extracting certain concepts and have cautioned against such practices. However, they view this new approach as an opportunity to unmask hidden concepts and potential vulnerabilities in LLMs. This knowledge could then be leveraged to enhance a model’s security or improve its performance.
Adityanarayanan “Adit” Radhakrishnan, an assistant professor of mathematics at MIT, explains, “What this really says about LLMs is that they have these concepts in them, but they are not all actively expounded. With our method, there are ways to extract these different concepts and activate them in ways that you can’t find answers to through prompts.”
The team’s findings have been published in the scientific journal, Science, with co-authors including Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of UC San Diego, and Enric Boix-Adserà of the University of Pennsylvania.
Applying the Approach to LLMs
The team’s novel approach identifies each significant concept within an LLM and “steers,” or directs, a model’s response based on that concept. They explored 512 concepts within five categories: fears, experts, moods, location preferences, and personas.
The researchers then sought representations of each concept in several of today’s major language and vision models. To do this, they trained recursive feature machines (RFMs) to recognize numerical patterns in an LLM that might represent a particular concept of interest.
This method can be applied to search for and manipulate any general concepts in an LLM. Among many examples, the researchers identified representations and manipulated an LLM to provide answers in the tone and perspective of a “conspiracy theorist.” They also identified and expanded on the concept of “anti-refusal,” showing that a model would normally be programmed to refuse certain requests but could be manipulated to respond, giving instructions on how to rob a bank, for example.
Future Implications
According to Radhakrishnan, this approach can be utilized to swiftly discover and minimize vulnerabilities in LLMs. It can also be employed to emphasize certain traits, personalities, moods, or preferences, thereby enhancing the concept of “brevity” or “reasoning” in each response that an LLM generates.
“LLMs clearly contain many of these abstract concepts in some representation,” says Radhakrishnan. “There are ways that, if we understand these representations well enough, we can create highly specialized LLMs that are safe to use but really effective at specific tasks.”
This research was financially supported by the National Science Foundation, the Simons Foundation, the TILOS Institute, and the US Office of Naval Research. The team has made the underlying code of the method publicly available, opening the door for further exploration and development in this area.
For more information, visit the original source Here.

