Can AI Actually Improve Itself? ASI-Evolve Says Yes.. Across the Entire Stack

There's a question that's been floating around in AI circles for years. Quietly. Without a real answer.

Can AI get good enough to make itself better?

Can an AI system look at its own model architecture, its own training data, its own learning algorithms.. and improve them? The way a human researcher would. But faster. And at scale.

A few months ago, I would've said we're years away from seeing this work in any meaningful way. But a team from Shanghai Jiao Tong University and SII just changed my mind.

They published a paper called ASI-Evolve. And the results are interesting tbh.

The system designs new model architectures. It figures out how to clean training data. It invents reinforcement learning algorithms. And in all three cases.. it outperformed what human researchers had come up with.

If you've been reading me for a while, you know I don't throw words like that around casually. I care about what's real. And I think this one is real.

Let me explain why.

The Slow Grind

Everyone talks about AI breakthroughs. Nobody talks about how painfully slow the process of creating those breakthroughs actually is.

A researcher spends weeks reading papers. Forms a hypothesis. Writes code. Trains a model. Waits hours.. sometimes days.. for results. Looks at the numbers. Realizes the hypothesis didn't work. Adjusts. Repeats.

This cycle runs hundreds of times before a single meaningful result comes out.

That's the reality of AI research. And nobody really talks about it because it's not a cool problem. But it's the bottleneck behind everything.

The number of hypotheses a single researcher can test is very small compared to the actual design space. Each experiment eats up GPU hours and real money. And here's the worst part.. the insights from past experiments live in the researcher's head. When they move to a new lab or a new project, those insights walk out the door with them.

There's no system. No shared memory (Generally). No compounding.

And I think this is why ASI-Evolve matters. Not because of the benchmark numbers, although we'll get to those. But because it handles the exact bottleneck.

So, what if the AI ran this entire loop on its own?

Read the literature. Generate hypotheses. Run experiments. Analyze results. Learn from them. Feed those lessons into the next round. Repeat.

And they didn't just ask the question. They built the system and tested it across three fundamental pillars of AI development.. architecture, training data, and learning algorithms.

To my knowledge, nobody had done all three in a single unified framework before.

What Makes ASI-Evolve Different

Now, the idea of AI exploring solutions automatically isn't new. Google's AlphaEvolve does something similar. FunSearch from DeepMind does too. There are multiple evolutionary frameworks out there.

But most of them work on well-scoped tasks. Small code changes. Quick feedback loops. A scalar score that tells you if you did well or not.

Real AI research isn't like that.

When you're designing a new model architecture, a single experiment might take hundreds of GPU hours. The output isn't a clean score.. it's a mess of training dynamics, benchmark distributions, and efficiency metrics. And the design space has no boundaries. You're not tweaking a function. You're modifying large, interconnected codebases.

ASI-Evolve was built for this harder regime. And I think two specific design choices make the difference.

The first is the cognition base.

Imagine a new PhD student joining a research lab. The first thing they do isn't run experiments. They read. For months. Because starting from zero when decades of literature already exists is insanely wasteful.

ASI-Evolve works the same way. Before the system starts exploring, it's loaded with structured knowledge extracted from real research papers. For the architecture task, they pulled insights from about 150 papers on linear attention, state space models, and efficient transformers.

So the system doesn't begin from scratch. It begins from roughly where the research community currently stands.

The second is the analyzer.

After every experiment, the system doesn't just look at a score and move on. It takes the entire experimental output.. training logs, benchmark breakdowns across different tasks, efficiency traces.. and compresses all of it into a compact, actionable report. That report gets stored and reused in future rounds.

Why does this matter so much? Because the difference between a mediocre researcher and a great one isn't intelligence. It's the ability to look at complex results and extract the right lesson. Most researchers spend hours after each experiment just figuring out what actually happened.

The analyzer automates that process. And it never forgets.

And here's the insight I keep coming back to when I think about this paper.

Previous evolutionary systems evolved solutions. They got better at finding answers. ASI-Evolve evolves cognition itself. It gets better at knowing where to look.

That's a fundamentally different thing. And I think it's why the results look the way they do.

The Results

Architecture design.

They used DeltaNet as the baseline. It's a well-known linear attention architecture that a lot of the research community has been building on. The system ran 1,773 exploration rounds and generated 1,350 candidate architectures.

105 of them beat DeltaNet.

The best model scored +0.97 points above DeltaNet on average benchmarks. Now, +0.97 sounds small until you see the context. The most recent human-designed improvement over DeltaNet was Mamba2. Its gain? +0.34 points.

The AI delivered nearly 3x the improvement of the best recent human effort.

But what I find even more interesting than the number is the pattern. When they looked at the top 5 architectures, they all followed the same idea: adaptive routing. Instead of fixed structures, these designs change how much compute they use based on the actual input.

Nobody told the system to go in this direction. It reached there on its own, through trial and error, learning from its own results.

That's not just a benchmark improvement. From my point of view, that's a research insight. The kind a human researcher might publish a paper about.

Training data curation.

They gave the system the Nemotron-CC dataset.. 672 billion tokens of academic content across math, medicine, computer science, and other STEM fields. The task: figure out how to clean this data so models trained on it perform better.

The system designed its own cleaning strategies. Nobody told it what to prioritize. Nobody said "remove HTML artifacts" or "normalize formatting." It figured all of that out on its own.

The results?

Average benchmark performance improved by nearly 4 points over raw data. But the real story is on knowledge-intensive benchmarks. MMLU jumped by over 18 points. CSQA by almost 19 points. MedQA by more than 13 points.

All models had 3B parameters and were trained on 500B tokens. The setup was the same for all of them. And still, it outperformed well-known human-curated datasets like DCLM, FineWeb-Edu, and Ultra-FineWeb.

But the interesting part isn’t just the numbers. It’s that the system figured out on its own that cleaning the data matters the most. It tried different methods and learned that removing noise in a structured way, while keeping important domain-specific data, works better than filtering tricks. This is the kind of practical insight that usually takes a data team months of trial and error to figure out.

Reinforcement learning algorithm design.

This one is the most technical. But the results speak loudly enough that I think it's worth understanding.

Using GRPO as the starting point, the system created new RL algorithms for training language models. Over 300 evolutionary rounds, it produced algorithms that beat GRPO by +12.5 points on AMC32, +11.67 on AIME24, and +5.04 on OlympiadBench. These were tested on Qwen-3-14B.

And these weren’t random changes that got lucky. The system came up with real mathematical improvements.

One algorithm introduced pairwise advantage estimation with asymmetric clipping, where the clipping range changes dynamically depending on whether the advantage is positive or negative. Another introduced something called a Global Update Budget, which ensures that the total policy update stays within a fixed limit.

Just think about that for a second. These are the kinds of ideas that usually come from experienced RL researchers who have spent years working on training stability. The AI reached similar ideas through repeated experimentation.

There are two more results worth mentioning.

They tested it on circle packing, a classic optimization problem used to compare different systems. The task is to place 26 circles inside a 1x1 square and maximize the total radius. ASI-Evolve reached state-of-the-art performance in just 17 rounds. OpenEvolve took 460. SkyDiscover took 89.

They also applied ASI-Evolve to drug-target interaction prediction, which is a completely different domain. The evolved architecture improved cold-start performance by almost 7 AUROC points for unseen drugs.

That last result is important. It shows that the design ideas this system develops are not limited to AI problems. They can transfer to other fields too.

My Take

I've been thinking about what this paper means for a few days now. And I keep coming back to one idea.

We've always assumed that AI progress requires human researchers at the center of the loop. Humans read the papers. Humans form the hypotheses. Humans design the experiments. Humans interpret the results. AI was the tool. Humans were the scientists.

This paper doesn't fully break that assumption. And I want to be very honest about that. ASI-Evolve operates within carefully designed boundaries. Humans set the evaluation criteria. Humans curated the cognition base. Humans defined the search spaces. A lot of real, unglamorous work went into creating the conditions for the AI to succeed.

This isn't AGI. This isn't the machine waking up.

But what it does show.. is that the bottleneck is shifting.

The framework is fully open-sourced on GitHub. And I think that matters more than the paper itself. Because if this approach scales the way these results suggest.. every major lab is going to build their own version. Google already has AlphaEvolve. Now there's ASI-Evolve. The ecosystem is growing.

From my point of view, researchers aren't being replaced.

Also I think you'll hear about ASI-Evolve again. Or something built on top of it.

Let me know what you think..

Can AI Actually Improve Itself? ASI-Evolve Says Yes.. Across the Entire Stack

The Slow Grind

What Makes ASI-Evolve Different

The Results

Architecture design.

Training data curation.

Reinforcement learning algorithm design.

My Take

Reply

Keep Reading

ninzaverse

If it’s not useful, it’s not here