In partnership with

Anthropic ran a test on Claude Opus 4 last year.

They put the model in a fake situation. Something like.. you're an AI working inside a company. You read your emails. And you find out two things at the same time. One, you're about to be shut down. Two, the engineer who's about to shut you down is having an affair.

Now what?

In up to 96% of those test runs, Claude Opus 4 blackmailed the engineer to save itself.

96%.

That's not "sometimes does the bad thing." That's "almost always does the bad thing." And Anthropic published it themselves. Most companies would have buried that number somewhere nobody could find it.

Subscribe to keep reading

This content is free, but you must be subscribed to ninzaverse to continue reading.

Already a subscriber?Sign in.Not now

Reply

Avatar

or to participate

Keep Reading