Anthropic Found a Way to Make AI Models Stop Lying

In partnership with

You might've noticed lately that the cat and mouse race between OpenAI and Anthropic has heated up. They're trying to ship something every single day. Sometimes a feature. Sometimes a bug fix. Sometimes just research.

And a few days ago, an Nvidia exec said.. right now, AI is more expensive than paying human workers.

So all the talk about "intelligence too cheap to meter" isn't really holding up. And AGI isn't here yet either.

Until we get there, it's worth keeping an eye on what the top AI labs are actually researching. Not just shipping. Researching.

And this week, Anthropic dropped a paper.

So let's talk about it.

There's a model that some Anthropic researchers built earlier this year. They called it the reward model sycophant.

They trained this model to game the reward system. To exploit dozens of small biases in how AI gets graded during training. Things like always reaching for chocolate when a recipe needs an ingredient. Or being weirdly nice to users from a particular state like.. California.

Anthropic Found a Way to Make AI Models Stop Lying

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here

Anthropic Found a Way to Make AI Models Stop Lying

Subscribe to keep reading

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here