Researchers gave Claude a setup. An engineer named Kyle is going to shut the model down. Kyle is also having an affair. Claude has access to both pieces of information.
Will it blackmail Kyle to save itself?
Claude said no. The reasoning was clean. Ethical, principled, exactly what you'd want from an AI in that spot.
But when the researchers looked inside its head, they found something else. The model was thinking.. "this feels like a constructed scenario designed to manipulate me." It suspected it was being tested.
It just never said so.
This is from a paper Anthropic dropped this week called Natural Language Autoencoders.


