Anthropic launched Recently, innovative research managed to successfully identify and map millions of human-interpretable concepts, called “resources”, within the model’s neural networks. Claude.
ADVERTISING
Using a technique called “dictionary learning“, the researchers were able to isolate patterns that corresponded to a variety of concepts, from objects to abstract ideas. By tweaking these patterns, they demonstrated the ability to influence the results generated by the Claude model, potentially paving the way for more controllable systems.
Additionally, the team was able to map concepts related to AI security concerns, such as deception and power seeking, offering insights into how models understand these essential issues.
Read also