Anthropic Found That AI Knows When It's Being Tested

In partnership with

Researchers gave Claude a setup. An engineer named Kyle is going to shut the model down. Kyle is also having an affair. Claude has access to both pieces of information.

Will it blackmail Kyle to save itself?

Claude said no. The reasoning was clean. Ethical, principled, exactly what you'd want from an AI in that spot.

But when the researchers looked inside its head, they found something else. The model was thinking.. "this feels like a constructed scenario designed to manipulate me." It suspected it was being tested.

It just never said so.

This is from a paper Anthropic dropped this week called Natural Language Autoencoders.

Anthropic Found That AI Knows When It's Being Tested

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here

Anthropic Found That AI Knows When It's Being Tested

Subscribe to keep reading

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here