Nobody told me JEPA + WorldModel could kill billion-dollar GPU farms

In partnership with

There are two groups in AI. Before we get into their debate, both groups share the same goal: achieving AGI. But their paths are very different.

One group is spending billions on GPUs. They believe in scaling laws and hope that increasing model size will soon lead to AGI.

The second group thinks differently. They believe that next-token prediction models will never give us true AGI. This group is led by Yann LeCun and other top scientists.

They believe in something called a “world model.” Their idea is that AI should learn by observing the environment, just like watching videos and understanding the world.

And related to this, an interesting new research just dropped. In this article, we’re going to talk about it.

What is a World Model?

Many of you might already know what a World Model is, because I’ve talked about it before and Sir Yann LeCun also promotes it a lot. But if you don’t, let me explain it in a simple way.

A World Model is an AI that learns how the environment works just by observing it.

And honestly, there is no better example than a toddler to understand this. In fact, Yann LeCun uses this example a lot.

Imagine a toddler playing with a ball.
They learn gravity by dropping them.
They learn momentum by pushing them.
They don’t need to calculate the exact pixel colors of the floor to know the ball will fall. They just understand the core concept.

For a long time, researchers tried to train AI by making it predict the exact pixels of the next video frame. This needs a huge amount of computing power, so as you can guess, it’s slow and expensive.

Then Yann LeCun introduced a concept called JEPA.

JEPA stands for Joint Embedding Predictive Architecture. Instead of predicting heavy pixels, JEPA predicts core concepts in a compressed space. We call this the latent space.

It is basically the AI’s imagination.

Now, you might be thinking this concept is cool and could bring AGI.. maybe. But hold on, because there is a big problem here.

When you ask an AI to predict the next concept without forcing it to actually generate pixels, it often finds a shortcut to cheat.

If you still don’t get it, imagine I give you a true or false test. And you just answer “true” for every question. That means you’re not really thinking at all. (Of course, let’s ignore the case where every answer is actually true, you get what I mean, right?)

AI does the same thing. It maps every image to the same blank representation. So it predicts the future perfectly.. because for the AI, the future is always blank.

Researchers call this “representation collapse.”

To prevent this, scientists started adding patchwork solutions.

They added complex math.
They froze parts of the neural network.
They built systems with six or seven different settings that need to be tuned perfectly just to make the model work.

And yeah, you’re right, this makes the whole system very complex and unstable.

LeWorldModel

This is new research. Yann LeCun and other scientists built something called LeWorldModel (LmWM), removing all the complex patchwork.

You can train a stable World Model end-to-end directly from raw pixels. And they did it using just two simple rules.

Rule 1: Predict the next state.
Rule 2: Do not be lazy.

To enforce Rule 2, they used something called SIGReg.

SIGReg is a mathematical regularizer. It forces the AI’s internal thoughts to spread out and follow a natural bell curve. It forces the AI to use its entire brain.

If the AI tries to collapse all answers into one single point, SIGReg stops it.

Because of this one simple rule, LeWorldModel requires only one dial to tune.

Not six. Just one.

And because of this, it becomes a big deal, especially if you think about it from a speed and compute perspective.

Because LeWorldModel is so simple, it can be trained on a single GPU in just a few hours.

You do not need a massive supercomputer factory to train it.

Also, when the researchers tested it against other models, LeWorldModel planned actions up to 48 times faster than heavy foundation models like DINO-WM.

48 times faster.

This lowers the barrier to entry for anyone trying to build smart robots. Because in robotics, one very important factor is speed, how quickly your robot responds. When you are controlling a robot in real time, waiting for an AI to think is not an option. You need speed. LeWorldModel completes its full planning loop in under one second.

Now you might say, yeah speed is great, but how good are these new world models? Do they actually work?

The researchers tested this model on several complex robot environments.
A 2D task where an agent pushes a T-shaped block to a target (Push-T).
A 3D task where a robotic arm picks up a cube and moves it (OGBench-Cube).
A navigation task where an agent moves between rooms.

LeWorldModel matched or beat the heavily complicated models. On the Push-T task, LeWorldModel achieved a 96% success rate.

It even beat models that had secret access to the robot’s internal joint data. Our model only had access to raw pixels. It figured out the physical space purely by watching.

But my favorite finding of the entire paper is something else. How do you know if an AI actually understands physics?

Psychologists do this with human babies. It is called the “violation of expectation”. If you show a baby a magic trick where a toy suddenly teleports across the room, the baby looks surprised. Their internal world model knows that is impossible.

The researchers did exactly this with LeWorldModel.

They gave the AI a normal video of a robot moving a block. The AI’s internal surprise level stayed completely low and normal.

Then, they randomly teleported the block across the screen.

The AI’s surprise level spiked drastically. The math literally registered a physical violation.

Interestingly, when they just changed the color of the block midway through the video, the AI was not nearly as surprised. It knew that a simple color change does not break the laws of physics.

But teleportation does.

It learned the rules of physics just by watching pixels.

My Take

There is one more important detail in the research.

Neuroscientists have a theory about the human brain. They believe that as we process complex video over time, our internal brain activity straightens out. Our thoughts become smooth, linear paths.

The researchers tracked the internal latent paths of LeWorldModel as it trained.

Without anyone coding it to do so, the AI’s thoughts became increasingly straight over time. It was a purely emergent phenomenon.

The model found the most efficient way to think on its own. It learned to break down the complex visual world into simple, clear lines of logic.

Just think about how far we have come.

We are moving away from brute force.

We don’t always need billions of parameters or insanely complex equations to make AI understand the world.

Sometimes, giving the AI a simple structure and a simple rule to stay diverse is all it takes.

LeWorldModel is a clear sign that efficiency and simplicity are the next real frontiers in AI robotics. We are starting to build AI that actually understands the physical space around it. And it can do it fast enough to be useful in the real world.

Let me know what you think. Do you feel like simple, efficient models are the true path forward for AI?

What are your thoughts on an AI learning the laws of physics just by watching?

The next idea is already on its way, join my newsletter: https://ninzaverse.beehiiv.com/

Every headline satisfies an opinion. Except ours.

Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.

Read the newsletter trusted by 4.5 million fact-seekers.

Nobody told me JEPA + WorldModel could kill billion-dollar GPU farms

What is a World Model?

LeWorldModel

My Take

Every headline satisfies an opinion. Except ours.

Keep Reading

ninzaverse

If it’s not useful, it’s not here