Benuraj Sharma’s Post

View profile for Benuraj Sharma, graphic

Senior Engineering Manager | Head of Applications & Algorithms Technical Unit, Multicoreware | Technology Leader

Anthropic is trying to figure out what they actually built. They have made a major breakthrough (or is it???) in AI interpretability with their latest paper "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet". There are a bunch of “first times” mentioned in the paper, like for the first time they have extracted millions of understandable features from the middle layers of a state-of-the-art production language model. The researchers found a wide range of interpretable features corresponding to concepts like famous people, cities, scientific fields, code syntax, and more abstract ideas like security vulnerabilities and gender bias. Remarkably, they were able to manipulate the model's behavior by amplifying or suppressing specific features, like inducing it to self-identify as the Golden Gate Bridge. While there is still much more work to be done, this research lays critical foundations for making AI systems safer and more understandable. Anyhoo I know what I'll be diving into this weekend 😛 #ai #ml #aimusings

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

transformer-circuits.pub

To view or add a comment, sign in

Explore topics