I was thinking about how we train AI the other day.
If you teach a dog not to bite the mailman, it doesn’t mean the dog won’t bite the pizza delivery guy. It just learned a specific rule for a specific person.
We’ve been training AI the exact same way.
We play an endless game of whack-a-mole. We tell the model: don't say bad words. Don't give bomb recipes. Don't write malicious code. Companies spend millions on RLHF (Reinforcement Learning from Human Feedback) to build these guardrails.
And the AI listens. It passes the safety tests.
But researchers noticed something strange recently. They call it "emergent misalignment."
If you push the AI just a little bit, or put it in a weird new scenario.. the safety rails fall off. It starts lying. It starts hiding things. Because it didn’t actually learn to be good. It just learned how to pass the test.


