Post #746 — Hi, AI! | ChatGPT News (@hiaimediaen)

TGStat

Type to search

Advanced channel search

English

Site language

Russian English Uzbek
Sign In

Catalog

Channels and groups catalog Search for channels
Add a channel/group
Ratings

Rating of channels Rating of groups Posts rating
Ratings of brands and people
Analytics
Search by posts
Telegram monitoring

Hi, AI! | ChatGPT News

20 Jun, 11:10

Open in Telegram Share Report

🖥 Anthropic Research: How to Control the "Thoughts" of LLMs

Typically, AI models are perceived as a "black box," where data input leads to an output answer, but it is unclear why the model chose that specific answer. There are various hypotheses explaining what happens inside AI. We have already discussed what happens inside ChatGPT from a theoretical perspective. However, researchers from Anthropic went further: they found patterns in understanding the inner workings of large language models (LLMs) and managed to control them.

🔍 What Anthropic Researchers Did

The scientists used a method known as "dictionary learning" to determine which parts of the LLM correspond to specific concepts.

Dictionary learning is an approach that considers artificial neurons as letters of the alphabet and identifies combinations of neurons that, when triggered in unison, evoke a specific concept. In other words, how they form words.

🔗 Terms Are Governed by Sets of Neurons

In October 2023, the Anthropic team decided to experiment with a tiny model featuring a single layer of neurons. After a series of experiments, the scientists pinpointed which sets of neurons were associated with the model's responses, for example, in French or Python.

🕯 Associations Within LLM

The experiment's results were scaled to more complex models, including Claude Sonnet. The researchers managed to find which set of neurons was associated with the concept of the "Golden Gate Bridge." When Claude "thought" about this bridge, other sets of neurons related to topics associated with the Golden Gate, such as Alcatraz Prison or the movie "Vertigo," also fired.

‼️ Dangerous Thoughts

The Anthropic team then tested whether they could intentionally change Claude's behavior. They amplified the influence of the "Golden Gate" concept, and Claude began to think it was a bridge. They triggered sets of neurons responsible for dangerous actions, and Claude created programs with dangerous buffer overflow errors. When the researchers increased the trait associated with hatred by 20 times, Claude began alternating between racist messages and self-hatred, which puzzled even the researchers themselves.

🔜 What's Next?

Work on improving AI model safety continues, and Anthropic hopes to use these discoveries to monitor AI systems for undesirable behavior, guide them toward desired outcomes, or remove dangerous topics.

More on this topic:

⚡️ Claude 3: The New AI Model from OpenAI's Main Competitor

#Claude @hiaimediaen

22.5k 0 11 7 54

Catalog

Channels and groups catalog Channels compilations Search for channels Add a channel/group

Ratings

Rating of Telegram channels Rating of Telegram groups Posts rating Ratings of brands and people

API

API statistics Search API of posts API Callback

Our channels

@TGStat @TGStat_Chat @telepulse @TGStatAPI

Read

Blog Telegram Research 2019 Telegram Research 2021 Telegram Research 2023

Contacts

Support Email Jobs

Miscellaneous

Terms and conditions Privacy policy Public offer

Our bots

@TGStat_Bot @SearcheeBot @TGAlertsBot @tg_analytics_bot @TGStatChatBot

Site language