AI is officially generating better research ideas than us. Thanks, China

When it comes to research, researchers from all around the world are doing a commendable job. But I mostly read AI research, and to be honest, researchers from China are on another level. You’ve probably heard what the DeepSeek team has done, it’s not hidden.

Also, when it comes to research, I would say Tsinghua University is the epicentre of Chinese AI research. I don’t remember the exact numbers, but if you compare universities, they are probably number one in publishing research.

And now, again, they have released another strong research paper on AI. It’s about this question: can AI learn scientific taste? It’s a wonderful read, and after reading this article, you’ll understand why I’m saying this.

What is Scientific Taste and Why Is It a Big Deal?

As you already know, AI is good at writing code, reading PDFs, and summarizing business files, along with many other things.

For everyday people, this is cool. But there is still a missing piece.

To understand it, let me simplify a concept called “scientific taste.”

Scientific taste is the ability to judge and propose research ideas that have a high potential for impact. It is not about executing a task. It is about knowing which task is actually worth executing.

I’m going to use an analogy here, but you can connect each point with AI to better understand what I’m trying to say.

Now imagine a highly skilled music producer who can:

Mix and master tracks perfectly
Understand every instrument and sound
Replicate any song with precision
Use all the latest tools flawlessly

That producer has incredible technical skills. But what if they don’t know what actually sounds good?

What if they can’t create a song people truly connect with?

They lack musical taste.

A few days ago, I wrote an article about what Terence Tao said, he clearly mentioned that AI lacks this kind of taste.

Right now, AI models are like that producer.

They are amazing assistants. They can search data, run experiments, and generate ideas.

But when they generate ideas, they often struggle to tell the difference between something truly groundbreaking and something that just looks new but is actually useless.

Great human scientists have this taste. They have foresight. They understand what truly matters.

Now you can see why researchers care about this.

And honestly, with the pace of innovation and tech acceleration we’re seeing in 2026, we really need this kind of taste in AI.

Now you might be thinking: how do you teach a machine “taste”? It sounds like a human trait, something subjective.

But the paper clearly says that taste is not just a personal preference.

In the scientific community, taste is a shared verdict. Over time, the community decides what is valuable through long-term interactions. Work that aligns with this collective taste gets reused, extended, and celebrated.

How do we measure this? Through citations.

Citations are the most common way to measure the impact of scientific research. A paper with many citations usually means the idea connected with the community. It had strong potential impact.

Up to this point in AI training, if we talk about big AI labs, the final step is RLHF (Reinforcement Learning from Human Feedback). They pay human experts to rate AI outputs. But this is very expensive, and you can also expect personal biases at an individual level.

Also, if a task has a clear output, you can verify it using rewards. For example, if the code compiles, the AI gets a reward. If not, no reward. But how do you verify an open-ended research idea? You cannot just run a unit test on a hypothesis, right?

So, researchers came up with something new. They call it Reinforcement Learning from Community Feedback (RLCF).

Instead of relying on expensive human experts, they used natural feedback that already exists in the real world, like millions of citations from scientific history.

Building the “Scientific Judge”

The researchers collected 2.1 million papers from arXiv published up to 2024.

From this massive pool, they built a training dataset called SciJudgeBench. It contains about 700,000 pairs of paper abstracts.

To make the data fair, they created strict rules. They matched papers from the exact same field and the exact same publication time. In each pair, one paper had significantly more citations than the other.

They then trained an AI model (based on the Qwen architecture) to act as a “Scientific Judge.”

The Judge reads the title and abstract of two competing papers, thinks step by step, and predicts which one will get more citations. If the AI guesses correctly, it gets a reward. If it’s wrong, it gets nothing.

Through this massive scale of trial and error, the AI started to learn the patterns of what makes an idea impactful.

Now, If you are wondering if this actually worked, then you must look at the numbers.

Learning scientific judgement is highly scalable. The researchers found a direct relationship between model size and performance. As they scaled up the training data and the model parameters, the AI’s taste got sharper.

The largest model they trained, SciJudge-Qwen3–30B, achieved an 80.6% accuracy rate in predicting the winning papers.

To put that into perspective, they tested the exact same dataset on the best proprietary models in the world.

Gemini 3 Pro scored 75.7%.
GPT-5.2-Thinking scored 72.7%.
DeepSeek-V3.2-Thinking scored 69.9%.

A 30-billion parameter open-source model, trained specifically on community feedback, beat the most advanced AI models on the planet at judging scientific impact.

The numbers above are cool and kinda okay-ish. But I want to see what happens when the model deals with unknown data.

The researchers tested the Scientific Judge on three out-of-domain scenarios to see if it actually learned “taste” or if it just memorized the training data.

It predicted the future.
The model was trained on papers published up to 2024. But it was tested on new papers from 2025. What it learned still worked perfectly on these future papers. Some models improved a lot, showing a 55-point jump in accuracy compared to when they were not trained.
It crossed disciplines.
To test true understanding, the researchers trained a version of the model purely on Computer Science papers. Then, they tested it on Mathematics, Physics, and Biology papers. It still worked. It consistently improved impact prediction across all these unseen fields.
It transferred to human peer review.
Citations happen years after a paper is published. Peer review happens before. The researchers tested the model on papers submitted to the ICLR conference. Instead of predicting citations, the model had to predict which paper got higher peer review scores. The 30B model hit an 87.7% accuracy rate. It successfully mapped citation-derived quality signals onto human acceptance likelihood.

From Judging to Thinking

Learning to judge is only half the picture. A scientist must also propose promising directions.

So, the researchers took their Scientific Judge and used it as a reward mechanism to train a new policy model. They called this one the “Scientific Thinker.”

The goal of the Thinker is to generate new research ideas.

Here is how they trained it. They gave the Thinker a seed paper and asked it to propose a follow-up research idea. The Thinker generated a group of different ideas. The Scientific Judge compared every idea against each other and assigned a win rate.

The Thinker updated its approach based on what the Judge preferred.

After training, the Scientific Thinker achieved an 81.5% win rate against its own base model. It also beat GPT-5.2 and GLM-5 in head-to-head ideation battles.

My Take

Let me explain this with a real example from the paper.

They gave the AI a starting paper about a problem in Reinforcement Learning. The problem was that models often stick to what they already know and don’t explore new ways of reasoning.

Before training, the base AI suggested a simple solution. It proposed a fixed “distributional priming” method. The idea was to force the model to generate more varied answers early on. It was a decent idea, but very limited and required changes to the pre-training stage.

After training, the Scientific Thinker looked at the same paper again. This time, it came up with a more flexible, in-training solution. It suggested something called “Uncertainty-Guided Exploration.” The idea was to use the model’s own uncertainty to guide it toward less likely but more promising paths.

When strong human-level AI evaluators compared both ideas, they all chose the Scientific Thinker’s approach.

Why? Because changing an algorithm or reward function is much easier for people to use than methods that require retraining the whole model. The Thinker understood that a reusable, plug-and-play solution has a much higher chance of being used and cited.

It understood how the “market of ideas” works.

So, I am not saying that this paper will change everything in AI research.

Also, you might be thinking, if AI can now judge and generate high-impact research, why hasn’t the scientific world felt a massive disruption?

Because the real impact will only be visible when it hits the global research economy.

Right now, most people are still using AI simply to summarize PDFs or fix their grammar. Many institutions label AI tools as “efficiency boosters” or “literature review aids.”

But in two to three years, the cognitive shifts will become much harder to ignore. We are moving from AI that reads science to AI that directs science. We are in a transition phase right now.

Just think about how far AI for science has come in just the last two years.

Now imagine where we will be in the next two to three years as these models scale up even further.

The bottleneck in science is no longer going to be execution. We have automated coding. We are automating lab work.

The final frontier was always ideation. Knowing what to build next.

If AI is learning scientific taste, it isn’t just speeding up research. It is changing the trajectory of human discovery.

It is something we will realize, bit by bit, as the world keeps changing.

Let me know what you think. Do you feel like AI can truly possess scientific taste?

What was the first moment you realized AI was actually generating good ideas?

The next idea is already on its way, join my newsletter: https://ninzaverse.beehiiv.com/

AI is officially generating better research ideas than us. Thanks, China

What is Scientific Taste and Why Is It a Big Deal?

Building the “Scientific Judge”

From Judging to Thinking

My Take

Keep Reading

ninzaverse

If it’s not useful, it’s not here