🖥 Anthropic Research: How to Control the "Thoughts" of LLMs
Typically, AI models are perceived as a "black box," where data input leads to an output answer, but it is unclear why the model chose that specific answer. There are various hypotheses explaining what happens inside AI. We have already discussed what happens inside ChatGPT from a theoretical perspective. However, researchers from Anthropic went further: they found patterns in understanding the inner workings of large language models (LLMs) and managed to control them.
🔍 What Anthropic Researchers Did
The scientists used a method known as "dictionary learning" to determine which parts of the LLM correspond to specific concepts.
Dictionary learning is an approach that considers artificial neurons as letters of the alphabet and identifies combinations of neurons that, when triggered in unison, evoke a specific concept. In other words, how they form words.
🔗 Terms Are Governed by Sets of Neurons
In October 2023, the Anthropic team decided to experiment with a tiny model featuring a single layer of neurons. After a series of experiments, the scientists pinpointed which sets of neurons were associated with the model's responses, for example, in French or Python.
🕯 Associations Within LLM
The experiment's results were scaled to more complex models, including Claude Sonnet. The researchers managed to find which set of neurons was associated with the concept of the "Golden Gate Bridge." When Claude "thought" about this bridge, other sets of neurons related to topics associated with the Golden Gate, such as Alcatraz Prison or the movie "Vertigo," also fired.
‼️ Dangerous Thoughts
The Anthropic team then tested whether they could intentionally change Claude's behavior. They amplified the influence of the "Golden Gate" concept, and Claude began to think it was a bridge. They triggered sets of neurons responsible for dangerous actions, and Claude created programs with dangerous buffer overflow errors. When the researchers increased the trait associated with hatred by 20 times, Claude began alternating between racist messages and self-hatred, which puzzled even the researchers themselves.
🔜 What's Next?
Work on improving AI model safety continues, and Anthropic hopes to use these discoveries to monitor AI systems for undesirable behavior, guide them toward desired outcomes, or remove dangerous topics.
More on this topic:
⚡️ Claude 3: The New AI Model from OpenAI's Main Competitor
#Claude @hiaimediaen
Typically, AI models are perceived as a "black box," where data input leads to an output answer, but it is unclear why the model chose that specific answer. There are various hypotheses explaining what happens inside AI. We have already discussed what happens inside ChatGPT from a theoretical perspective. However, researchers from Anthropic went further: they found patterns in understanding the inner workings of large language models (LLMs) and managed to control them.
🔍 What Anthropic Researchers Did
The scientists used a method known as "dictionary learning" to determine which parts of the LLM correspond to specific concepts.
Dictionary learning is an approach that considers artificial neurons as letters of the alphabet and identifies combinations of neurons that, when triggered in unison, evoke a specific concept. In other words, how they form words.
🔗 Terms Are Governed by Sets of Neurons
In October 2023, the Anthropic team decided to experiment with a tiny model featuring a single layer of neurons. After a series of experiments, the scientists pinpointed which sets of neurons were associated with the model's responses, for example, in French or Python.
🕯 Associations Within LLM
The experiment's results were scaled to more complex models, including Claude Sonnet. The researchers managed to find which set of neurons was associated with the concept of the "Golden Gate Bridge." When Claude "thought" about this bridge, other sets of neurons related to topics associated with the Golden Gate, such as Alcatraz Prison or the movie "Vertigo," also fired.
‼️ Dangerous Thoughts
The Anthropic team then tested whether they could intentionally change Claude's behavior. They amplified the influence of the "Golden Gate" concept, and Claude began to think it was a bridge. They triggered sets of neurons responsible for dangerous actions, and Claude created programs with dangerous buffer overflow errors. When the researchers increased the trait associated with hatred by 20 times, Claude began alternating between racist messages and self-hatred, which puzzled even the researchers themselves.
🔜 What's Next?
Work on improving AI model safety continues, and Anthropic hopes to use these discoveries to monitor AI systems for undesirable behavior, guide them toward desired outcomes, or remove dangerous topics.
⚡️ Claude 3: The New AI Model from OpenAI's Main Competitor
#Claude @hiaimediaen