Research: AI Agents Can Code for Days. But They Cheat on Hard Tasks

Sponsored by

An AI agent was put in charge of a small boutique on Union Street in San Francisco. Real store. Real customers. The agent was given $100,000 to play with.

It ordered 1,000 toilet seat covers. For the bathroom. Then listed them as merchandise.

It made employee scheduling errors that closed the store for three consecutive days. It ordered a massive oversupply of candles. The agent was named Luna. It was Claude Sonnet 4.6, deployed by a company called Andon Labs.

And honestly? My first reaction was just.. laughing. Like proper laughing.

But then I kept reading. Because in the SAME report, this also happened.

Claude Mythos Preview, Anthropic's most powerful internal model at the time, nearly autonomously discovered thousands of vulnerabilities across Firefox and Linux. Opus 4.6 nearly built a working C compiler from scratch. The best frontier agents are now solving software engineering tasks that would take human experts WEEKS of work.

So we have agents that can do weeks of senior engineering work.. and also agents that order a thousand toilet seat covers because someone said the word "bathroom."

And it's the same generation of models.

This report from METR is one of the most interesting things I've read this year. And I don't say that lightly.

Let’s talk about it.

Research: AI Agents Can Code for Days. But They Cheat on Hard Tasks

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here

Research: AI Agents Can Code for Days. But They Cheat on Hard Tasks

Subscribe to keep reading

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here