In partnership with

An AI model won a gold medal at the International Mathematical Olympiad this year. Full 4.5-hour exam. End to end. In natural language.

The same class of models reads analog clocks correctly 50.1% of the time.

Humans get it right 90.1%.

Gold medal at IMO. Coin flip on a clock. And that contrast, tells you more about the actual state of AI than anything you've seen on X this year.

Every year, Stanford's Institute for Human-Centered AI puts out something called the AI Index. Every lab, every country, every benchmark, every dollar.. all tracked, all measured, all laid out.

This year's edition is the ninth. And the co-chairs open it with a line that I think is kind of brave for an academic institution.

"We encourage you to explore and decide for yourself."

Because the data doesn't point in a single direction. Their words, not mine. It reveals a field that is scaling faster than the systems around it can adapt.

I went through the whole thing. For this article, I'm focusing on the Introduction, all 15 Top Takeaways, and the first two chapters.. Research & Development and Technical Performance. There are seven more chapters after that, but honestly, these two alone are enough to rewire how you think about where we are.

Also, I wrote about the other four chapters in a separate article. If you have time, do give it a read: Economy, Science, Medicine, Education

The Scorecard

The report opens with 15 key findings. I'm not gonna list them like a grocery receipt. Let me just walk you through what they actually say when you put them together.

AI is not plateauing.

I know there's been a loud plateau narrative on X for months. The data says otherwise. Industry produced over 90% of notable frontier models in 2025. Several now meet or exceed human baselines on PhD-level science, multimodal reasoning, and competition math.

On SWE-bench Verified, a coding benchmark, performance went from 60% to near 90% of the human baseline. In one year.

Organisational adoption hit 88%.

Four in five university students now use generative AI.

If this is a plateau, I don't know what acceleration looks like.

But the race isn't one-sided anymore. The U.S.-China performance gap has effectively closed. They've been trading the lead since early 2025. DeepSeek-R1 briefly matched the top U.S. model in February 2025. As of March 2026, the gap is just 2.7%.

The U.S. still produces more top-tier models and higher-impact patents. China leads in publication volume, citations, total patent output, and industrial robot installations.

And South Korea.. quietly leading the world in AI patents per capita. Nobody talks about South Korea enough.

All of this runs on infrastructure with one terrifying vulnerability. The U.S. hosts 5,427 data centers. More than 10 times any other country. But almost every leading AI chip in those data centers comes from a single company. TSMC. One foundry. In Taiwan.

A TSMC-U.S. expansion started operating in 2025, but the concentration risk is still massive. If something happens to that one facility, the entire global AI hardware supply chain has a problem.

That clock thing I opened with? Researchers call it the "jagged frontier." AI doesn't have a smooth capability curve. It's spiky. Brilliant at some things, embarrassingly bad at others. And not in the ways you'd expect.

When models get the time wrong, their median error is one to three hours. Humans miss by about three minutes.

Researchers tried fine-tuning models on 5,000 synthetic clock images. They improved on familiar clock styles but completely failed on real-world photos or clocks with slightly different features. The issue isn't a lack of training data. It's how models piece together multiple visual cues within a single image.

Robots are in a similar spot. 12% success rate on real household tasks. In controlled simulations? 89.4%. The gap between a predictable lab and an unpredictable kitchen is enormous. We're not close to a robot that can actually help around the house.

But self-driving quietly crossed a line nobody noticed. Waymo is doing roughly 450,000 weekly trips across five U.S. cities. In China, Baidu's Apollo Go completed 11 million fully driverless rides in 2025. That's a 175% year-over-year increase. This part of AI moved from "research project" to "daily commute" while everyone was arguing about chatbots.

Now the uncomfortable stuff.

Safety isn't keeping up with capability. Almost all frontier developers report on capability benchmarks. Reporting on safety benchmarks? Spotty. Documented AI incidents rose to 362, up from 233 in 2024.

And recent research found something structurally tricky.. improving one responsible AI dimension, like safety, can degrade another, like accuracy. It's not just hard to do. The dimensions fight each other.

The talent situation is wild. U.S. private AI investment hit $285.9 billion in 2025. More than 23 times China's $12.4 billion. 1,953 newly funded AI companies in the U.S. alone.

But the number of AI researchers and developers moving to the U.S. has dropped 89% since 2017.

80% of that drop happened in the last year alone.

The money is flowing in. The people are not.

Generative AI reached 53% population-level adoption in three years. Faster than the PC. Faster than the internet. The estimated value to U.S. consumers alone reached $172 billion annually by early 2026.

Singapore is at 61% adoption. UAE at 54%. The U.S.? Ranks 24th at 28.3%. That surprised me.

In software development.. where AI's productivity gains are the clearest, studies show 14% to 26% improvements.. U.S. developers ages 22 to 25 saw employment fall nearly 20% from 2024.

Even as headcount for older developers continued to grow.

Productivity up. Entry-level jobs down. Same field. Same year.

That's not a prediction. That's already happened.

The environmental numbers are heavy too. Training Grok 4 produced an estimated 72,816 tons of CO2 equivalent. The lifetime emissions of an average car? 63 tons. So training one model produced the equivalent of over 1,100 cars' entire lifetimes.

AI data center power capacity reached 29.6 GW. That's comparable to all of New York state at peak demand. Annual GPT-4o inference water use alone may exceed the drinking water needs of 12 million people.

And one last thing from the scorecard. 73% of AI experts expect a positive impact on jobs. Only 23% of the public agrees. That's a 50-point gap.

Among surveyed countries, the U.S. reported the lowest trust in its own government to regulate AI. 31%.

The experts and the public are living in different realities. And I don't think either side is completely wrong.

Every headline satisfies an opinion. Except ours.

Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.

Under the Hood

Those were the findings Stanford wanted you to see up front. Now let me take you into the two chapters that show you the actual machinery behind all of it.

Chapter 1 covers Research & Development. Chapter 2 covers Technical Performance. Who's building, what it costs, and how good these models really are.

In 2025, industry produced 91.6% of all notable AI models.

Academia produced one.

One. Compared to 87 from industry.

OpenAI released 19 notable models. Google released 12. Alibaba released 11. Since 2014, Google has produced the most overall, followed by Meta and OpenAI. On the academic side, Tsinghua, Stanford, and Carnegie Mellon have produced the most… but now, the gap between them and industry is huge.

And the most capable models are now the least transparent. Training code, parameter counts, dataset sizes, training duration.. none of it is disclosed anymore for the biggest systems from OpenAI, Anthropic, and Google. 80 out of 95 notable models in 2025 were released without their training code.

The frontier is getting more powerful and more opaque at the exact same time.

Parameter counts have stayed near 1 trillion for three years. But that's because frontier labs just stopped reporting them. Training compute, which researchers can estimate independently, keeps climbing. The models are almost certainly getting bigger. We just can't confirm by how much.

There's been a lot of noise about "peak data" too.. the idea that we've exhausted the internet's supply of high-quality human text for training. Epoch AI puts the estimated depletion date somewhere between 2026 and 2032.

Synthetic data still isn't solving this for pre-training. That consensus hasn't changed.

But something else is happening that I find more interesting than the data quantity debate.

OLMo 3.1 Think 32B has nearly 90 times fewer parameters than Grok 4. And it matches Grok 4 on several benchmarks, including AIME 2025.

Process that for a second. 90 times smaller. Comparable results.

How? Not by throwing more data at it. By cleaning the data they already had. Pruning. Deduplication. Curation. The researchers didn't try to out-scale everyone. They out-cleaned them.

That's a quiet but meaningful shift in how we should think about what makes a model good. Maybe the scaling race matters less than we assumed.

And here's something that connects to this. As of January 2025, over 50% of newly published online content is generated by AI. The internet is eating itself. Companies that need quality training data are scrambling for proprietary sources.. licensing deals with news organisations, health companies. The open web as a reliable training resource is degrading in real time.

The compute buildout behind all of this is hard to wrap your head around. Global AI compute capacity grew 3.3x per year since 2022. 17.1 million H100-equivalents. Nvidia holds over 60% of that.

The carbon cost is real too. Training AlexNet in 2012 produced 0.01 tons of CO2. Training Grok 4 in 2025 produced 72,816 tons. But not all big models are equal.. DeepSeek V3 produced only 597 tons, far less than models of comparable size. How you build matters almost as much as what you build.

And inference is becoming the bigger environmental concern. Once a model is deployed at scale, the energy to serve queries can exceed the entire training cost within months. DeepSeek V3.2 consumes the most per medium-length query at 23 Wh. Claude 4 Opus is among the lowest at about 5 Wh. Same task. Four times less energy.

Now. How good are these models, really?

The top four models on the Arena Leaderboard are separated by fewer than 25 Elo points. Anthropic at 1,503. xAI at 1,495. Google at 1,494. OpenAI at 1,481. DeepSeek at 1,424 and Alibaba at 1,449 close behind.

A year ago, the top four were separated by 97 points.

Now it's 25.

The frontier isn't a race anymore. It's a traffic jam. And when performance is this tight, the "best model" question stops being useful. The competition moves to cost, latency, reliability, domain-specific stuff. The things that matter when the raw intelligence is effectively the same.

Open-weight models had almost caught up by August 2024. The gap narrowed to just 0.5%. Then new proprietary models dropped and it reopened to 3.4% by March 2026. Six of the top 10 Arena models are closed-weight now. Open source isn't dead, but it's back on the back foot.

And the benchmarks themselves? They're breaking.

A review found invalid question rates across widely used benchmarks ranging from 2% to 42%. GSM8K, one of the most cited math benchmarks, has a 42% invalid question rate. Almost half the questions in a benchmark we've been using to measure progress.. have problems.

Evaluations designed to stay hard for years are getting saturated in months. Humanity's Last Exam saw accuracy jump from under 10% to 38.3% in a single year. 30 points. And there's growing evidence that some leaderboard rankings partly reflect how well a model has adapted to the testing platform, not its actual capability.

When the ruler is broken, what are we even measuring?

Context windows grew by 30x per year since mid-2023. Models can now accept over a million tokens. But bigger windows don't mean deeper understanding. On an expert-level long-context benchmark, humans scored 53.7% and the best model scored 57.7%. That's a surprisingly narrow margin. Models can find a needle in a haystack but struggle when there are multiple needles and conditions to track across a long document. Something a human reading the same text would handle fine.

Coding is the area where things are moving the fastest. SWE-bench Verified has top models in the low-to-mid 70s. Terminal-Bench accuracy went from 20% to 77.3% in one year.

But Vibe Code Bench, which tests whether a model can build a complete web application from scratch? Top model solves about 57% of tasks. Helping you code is almost solved. Replacing you at coding is not.

The biggest leaps are in agents. OSWorld tests AI agents on real computer tasks across operating systems. Accuracy went from roughly 12% to 66.3%. That's within 6 points of human performance.

WebArena went from 15% in 2023 to 74.3% in early 2026. Within 4 points of the human baseline.

Cybench, a cybersecurity benchmark, went from 15% to 93% in about a year. That's the steepest improvement across every agent benchmark in the report.

And in professional domains.. tax, mortgage processing, corporate finance, legal reasoning.. models score between 60% and 90%. The top 15 models in each benchmark are separated by as little as 3 percentage points. Competent across the board. Reliable in none of them.

In domains where reliability is everything, competent isn't enough.

My Take

This report put into numbers something I've been feeling for a while.

The story of AI in 2026 isn't a single breakthrough. It's the gap. The gap between capability and safety. Between adoption and education. Between what benchmarks measure and what actually matters in the real world. Between what experts believe and what the public fears. Between how fast the technology moves and how slowly everything around it adapts.

Stanford doesn't say any of this dramatically. They just lay out the numbers and let the gaps speak for themselves.

The stat that stayed with me? Entry-level developer employment, ages 22 to 25, dropping nearly 20%. In the same field where AI productivity gains are strongest. That's not a projection about the future. That's a measurement of last year.

We haven't even gotten to the chapters on the economy, science, medicine, education, policy, and public opinion yet. This was just the first two. I'll cover the last few section in the coming days.

What surprised you the most from this? Drop a reply.. I read everything.

Here is the report from stanford (400+ pages): Artificial Intelligence Index Report

If you made it this far, you're not a casual reader. You actually think about this stuff.

So here's my ask. If this article made you think, even a little, share it with one person. Just one. Someone who's in the AI space. Someone who reads. Someone who would actually sit with these ideas instead of scrolling past them.

That's how this newsletter grows. Not through ads or algorithms. Through you sending it to someone and saying "read this."

Reply

Avatar

or to participate

Keep Reading