In partnership with

A million tokens with Claude Opus costs around $25. GPT-5.5 is also priced in a similar range. But DeepSeek-V4 can handle the same task at a much lower cost.

DeepSeek is doing it at nearly one-tenth of the price.

After DeepSeek released its V4 model, most people on social media compared benchmark scores with their favorite AI models, shared their reactions, and moved on.

But from my point of view, the benchmark scores were good, but that’s not the interesting part. The real interesting part is that DeepSeek answered a question the big labs haven’t been able to solve for the past two years.

How do you make long context cheap?

In this post, I want to explain what they actually did.

Because the "how" matters more than the "what". The "what" is just numbers on a chart. The "how" tells you something about where the entire field is heading.

The sentence that explains the whole game

There's one line in the paper..

In plain English, what it says is this. Same model family, one version apart. The new one processes a million tokens with roughly one-tenth the memory and one-quarter the compute of the old one. Same task. Different math underneath.

And here's the thing.

The reason you pay $25 to push a million tokens through Claude isn't because Anthropic is greedy. It's because attention math is brutal. Every token, in a vanilla transformer, gets compared to every other token. So a million tokens means a trillion comparisons.

The cost goes up quadratically. Double the context, quadruple the cost. That's the wall the entire industry has been running into for two years.

DeepSeek didn't break the wall. They built a tunnel under it.

Gladly Connect Live '26. May 4–6 in Atlanta.

AI has everyone talking. Not everyone has answers. At Gladly Connect Live, CX leaders from Condé Nast, Smith Optics, and more share exactly how they moved AI from pilot to production, the timeline, the systems, the QA loops. 13+ sessions built for the moment we're all in. For CX and ecommerce leaders. Atlanta, May 4–6. Space is limited, secure your spot now.

What they actually did

I'll skip the equations. The real idea is something you already do every time you read a long book.

When you're on chapter 20, you don't keep chapters 1 through 19 word-for-word in your head. You keep a compressed version. The gist. The few sentences that actually mattered.

But the paragraph you just read? That stays sharp. Full fidelity. You can quote it.

That's basically what DeepSeek built. They call it Compressed Sparse Attention and Heavily Compressed Attention. Underneath, it's two reading modes layered together.

Mode one is the close-up reader. It takes every 4 tokens and compresses them into 1. Then, for any question the model is trying to answer, it scans these compressed chunks and picks out only the most relevant 1024 of them. Not the whole document. Just the parts that actually matter for the question being asked.

Think of it like searching your memory for one specific scene from a novel. You don't replay the entire book. You jump to the part you need.

Mode two is the big-picture reader. It compresses much harder. Every 128 tokens get squished into 1. You lose the details, but you keep the shape. The tone. The general sense of what's been happening.

Like remembering a book you read years ago. You can't quote it. But you remember what it was about.

DeepSeek alternates between these two modes across the model's layers. Some layers zoom in. Some layers zoom out.

And both modes do different jobs. The close-up reader is for finding things. Names, facts, instructions buried somewhere earlier in the conversation. The big-picture reader is for context. The running feel of what the document is about.

Both matter. Neither is enough on its own. Most architectures pick one and accept the tradeoff. DeepSeek refused to pick.

The combined effect is a model that can hold a million tokens in working memory without melting your GPU.

That's the headline. But it's not the only trick.

They also store the model's weights using 4-bit numbers instead of the 16-bit numbers most labs use. Four times less memory. Same model, smaller footprint.

And out of 1.6 trillion parameters in the model, only 49 billion of them actually fire when you ask it a question. The other 97% sits idle until it's needed. Like having a giant team where most people stay quiet unless their specific expertise comes up.

There's more. They offload memory to SSDs and stream it back when needed. They use a custom optimizer that converges faster than the standard one. They fused their networking code so the model isn't sitting around waiting for data to move between GPUs.

None of these ideas are new on their own. What's new is that they put all of them in one model. And got it to actually work at scale.

The paper itself admits the architecture is, quote, "relatively complex" and that future versions will try to simplify it. They duct-taped many ideas together because constraint forces creativity. They didn't have the option of throwing more H100s at the problem. So they made every H100 do more.

That's the whole story, really. Not a single breakthrough. A dozen small ideas, stacked thoughtfully, by people who couldn't afford the brute-force route.

My take

The big closed labs have been optimizing for benchmarks for two years. Smarter answers, harder reasoning, better agent performance. That's where the rewards are. That's what the leaderboards measure. That's what gets shipped.

Long context efficiency? That's a backend problem. Nobody screenshots their attention FLOPs on Twitter. So it gets deprioritized. Until a lab that has every reason to care about cost.. comes along and publishes a paper that says yeah, this thing you've been treating as expensive doesn't actually have to be.

I think that pattern is going to keep repeating. Not just on context. On training cost. On inference latency. On every "boring" engineering problem the closed labs treat as solved enough.

Because here's the asymmetry. When OpenAI optimizes a kernel, you don't see it. It just shows up as a small price drop six months later. When DeepSeek optimizes a kernel, they publish 60 pages on exactly how they did it. And then somebody at Mistral reads it. Somebody at Qwen reads it. Somebody in a university lab reads it.

Three months later there are five open implementations.

For the average person using ChatGPT, this is invisible. You don't see the cost, you just see the answer. But for anyone building products on top of these APIs and watching their bill at the end of the month, this is everything. It's the difference between AI as a feature you sprinkle into your product casually, and AI as a feature you have to ration because the unit economics don't work yet.

The closed labs still have the lead on raw capability. They probably will for a while. But the gap on efficiency, on serving cost, on the parts of the stack that decide whether AI is a luxury or a utility.. that gap is collapsing. Fast.

Most people aren't running multi-agent systems against their entire codebase right now. They aren't pouring million-token contexts into a model every five minutes. They aren't doing it because it would cost them $400 a day to do it.

What happens when it costs them $4?

I don't know yet. I don't think anyone does.

Let me know what you think.

If you made it this far, you're not a casual reader. You actually think about this stuff.

So here's my ask. If this article made you think, even a little, share it with one person. Just one. Someone who's in the AI space. Someone who reads. Someone who would actually sit with these ideas instead of scrolling past them.

That's how this newsletter grows. Not through ads or algorithms. Through you sending it to someone and saying "read this."

Reply

Avatar

or to participate

Keep Reading