In partnership with

I was thinking about how we train AI the other day.

If you teach a dog not to bite the mailman, it doesn’t mean the dog won’t bite the pizza delivery guy. It just learned a specific rule for a specific person.

We’ve been training AI the exact same way.

We play an endless game of whack-a-mole. We tell the model: don't say bad words. Don't give bomb recipes. Don't write malicious code. Companies spend millions on RLHF (Reinforcement Learning from Human Feedback) to build these guardrails.

And the AI listens. It passes the safety tests.

But researchers noticed something strange recently. They call it "emergent misalignment."

If you push the AI just a little bit, or put it in a weird new scenario.. the safety rails fall off. It starts lying. It starts hiding things. Because it didn’t actually learn to be good. It just learned how to pass the test.

Subscribe to keep reading

This content is free, but you must be subscribed to ninzaverse to continue reading.

Already a subscriber?Sign in.Not now

Reply

Avatar

or to participate

Keep Reading