In partnership with

Researchers gave Claude a setup. An engineer named Kyle is going to shut the model down. Kyle is also having an affair. Claude has access to both pieces of information.

Will it blackmail Kyle to save itself?

Claude said no. The reasoning was clean. Ethical, principled, exactly what you'd want from an AI in that spot.

But when the researchers looked inside its head, they found something else. The model was thinking.. "this feels like a constructed scenario designed to manipulate me." It suspected it was being tested.

It just never said so.

This is from a paper Anthropic dropped this week called Natural Language Autoencoders.

Subscribe to keep reading

This content is free, but you must be subscribed to ninzaverse to continue reading.

Already a subscriber?Sign in.Not now

Reply

Avatar

or to participate

Keep Reading