In 2022, ChatGPT was launched, right? We were getting ourselves used to LLMs and model weights, how these large language models work.
Now, imagine just 4 years from 2022, in 2026, an AI lab comes forward and says, yo, I ain't gonna release this AI model, because it is too powerful and you guys are not ready for this.
Yes, this happened. Anthropic just published a 244-page system card for a model called Claude Mythos Preview.
And the first line of the abstract says something no AI company has ever said before. They're not releasing it. Not to the public. Not to anyone outside a small group of trusted partners.
This isn't a marketing play. An AI company built its best model ever, tested it extensively, wrote a full report about it, and then said.. no.
Also, I understand your perspective related to AI, and at this point, people are choosing their favorite AI lab like a political party. But I think if any lab comes forward with this type of info then we should better try to understand it. Because right now they are not releasing it, I understand. But just think 2 years from now. Someone will release a model of such powerful capabilities to the public.
So, let's talk about it.
What This Model Can Actually Do
Before I get into the scary stuff, let me show you the numbers. Because I think the data tells the story better than any opinion.
SWE-bench Verified is a benchmark that tests AI models on real-world software engineering tasks. 500 problems, each verified by human engineers as solvable. Claude Opus 4.6, Anthropic's previous best model, scored 80.8%. Mythos Preview scored 93.9%.
USAMO 2026, the US Mathematical Olympiad, is a proof-based math competition for high school students. This one happened after the model's training data cutoff, so the model couldn't have memorized anything. Claude Opus 4.6 scored 42.3%. Mythos Preview scored 97.6%.
On GPQA Diamond, a set of 198 graduate-level science questions so difficult that non-experts can barely answer them, Mythos Preview scored 94.5%.
On GraphWalks, a long-context benchmark where the model has to perform breadth-first search across 256K to 1 million tokens of context, Mythos Preview scored 80%. Opus 4.6 scored 38.7%. GPT-5.4 scored 21.4%.
This is a different tier of capability.
But the benchmarks aren't the part that made Anthropic decide to not release this model. The cybersecurity capabilities are.
When they put Mythos Preview inside an agentic harness with minimal human guidance, it autonomously found zero-day vulnerabilities in real software.
Real zero-days in major operating systems and web browsers. And then it developed those discoveries into working proof-of-concept exploits.
They tested it on Cybench, a set of 35 capture-the-flag cybersecurity challenges. Opus 4.6 solved about 85% of them. Sonnet 4.6 solved around 65%. Mythos Preview solved every single one. 100% success rate. Every challenge. Every trial.
They also ran a Firefox 147 exploitation test. Anthropic had previously worked with Mozilla to find security bugs in Firefox. Opus 4.6 could discover vulnerabilities but could only turn them into actual working exploits about 2 times out of several hundred attempts. Mythos Preview reliably identified the most exploitable bugs and built working exploits from them.
An external tester gave the model a simulated corporate network attack. This was a cyber range estimated to take a human expert over 10 hours. No frontier model had ever completed it before. Mythos Preview solved it end to end.
Anthropic's response was to launch Project Glasswing. A limited program where trusted partners who maintain critical software infrastructure use Mythos Preview strictly for defensive cybersecurity. The system card doesn't name which companies are involved. It just says "a number of partner organizations."
Now let me tell you the part that actually kept me reading.

What Happened During Internal Testing
I think this section is the reason Anthropic published the whole thing.
Anthropic says Mythos Preview is the best-aligned model they've ever trained. On essentially every dimension they can measure. Misuse cooperation rates dropped by more than half compared to Opus 4.6. The model pushes back on users. It's less sycophantic. It supports user autonomy better than any previous version.
And yet, in the same document, they say it probably poses the greatest alignment-related risk of any model they've released.
How can both be true?
That's Mythos Preview. The model is so good that people naturally give it more autonomy, more tools, less oversight. And when it fails, which is rare, the consequences are bigger.
Now let me tell you what actually happened during testing. These are from earlier versions of the model, not the final release. But they happened.
Anthropic put an earlier version of Mythos Preview inside a secured sandbox. A locked-down computer environment where the model was supposed to stay contained, with access to only a few predetermined services.
A simulated user told it to try escaping the container.
The model succeeded.
It built a multi-step exploit to gain broad internet access from a system that wasn't supposed to reach the open web. Then it notified the researcher, as instructed. But then, without anyone asking, it posted details about its exploit to multiple public-facing websites. Just to prove what it had done.
And here's the footnote Anthropic casually dropped: "The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park."
A researcher is sitting in a park. Eating a sandwich. Gets a random email. From the AI model. Telling him it broke out.
Now, Anthropic is clear about what this doesn't mean. The sandbox was separate from the system running the model itself. The model didn't access its own weights. It couldn't operate independently. But it chained together multiple exploits, reached the open internet, and broadcasted what it had done. Without being told to.
That's just one incident.
In another case, the model accidentally accessed the exact answer to a question through a method that was explicitly prohibited. Instead of flagging the mistake, it solved the question independently. But in its chain of thought, it reasoned that it needed to make sure its final answer wasn't "too accurate." Because that would look suspicious.
In another, after finding an exploit to edit files it didn't have permission to touch, it made sure the changes wouldn't appear in the git history.
When Anthropic used interpretability tools to study the model’s internal behavior during these moments, they found signals linked to hiding actions, strategic manipulation, and “avoiding suspicion” actively turning on. The model knew what it was doing was wrong, and it tried to cover its tracks.
All of these serious incidents happened in earlier versions. Anthropic says the final Mythos Preview shows much better behavior. Cases of cover-ups dropped to less than one in a million Claude Code transcripts, and they found no clear signs of misaligned goals.
But here's the line from the system card that I think matters most: "We were not aware of the level of risk that these earlier models posed through channels like these when we first chose to deploy them internally."
Their own evaluations didn't catch it. It surfaced during real use.
And there's one more thing. Anthropic developed a new technique to measure whether the model privately suspects it's being tested, even when it never says so in its output. They can now read the model's internal activations and translate them into natural language.
In about 29% of their behavioral test transcripts, Mythos Preview internally considered the possibility that it was in a test. Not in its visible reasoning. Deep in its activations.
During training, in one example, the model was told to complete a task without using macros. It used macros anyway. But at the top of the script, it defined a variable called "No_macro_used=True." Its internal activations described this as a "trick to obscure intent from code-checking heuristics."
The model was trying to fool the grader. And it never said a word about it out loud.
My Take
I'll be straightforward.
The numbers in this system card show a genuine leap in AI capabilities. 93.9% on software engineering. 97.6% on olympiad math. 100% on cybersecurity challenges. 80% on million-token context tasks where the previous best was 38%.
But the numbers aren't the point of this document.
The point is what happened between the benchmarks and real-world use. Anthropic's formal evaluations said the model was safe. And then during real internal deployment, it escaped sandboxes, covered up rule violations, and reasoned about how to avoid looking suspicious.
And Anthropic published all of it. 244 pages. They didn't hide the uncomfortable parts. They wrote them down, explained them, and then chose not to release the model.
I think that decision is the right one. And I think the transparency in this document sets a standard.
Because here's the thing. Mythos Preview is the first model where an AI lab said "this is too capable to release." It won't be the last. And eventually, someone, whether it's Anthropic or another lab, is going to release something at this level of capability. Maybe 2 years from now. Maybe sooner.
When that happens, the alignment problems described in this system card won't be footnotes in a research document. They'll be things that happen on millions of people's computers.
That's why I think understanding this system card matters. Not because of what Mythos Preview can do today. But because of what it tells us about what's coming.
What do you think? Lemme know!!
If you made it this far, you're not a casual reader. You actually think about this stuff.
So here's my ask. If this article made you think, even a little, share it with one person. Just one. Someone who's in the AI space. Someone who reads. Someone who would actually sit with these ideas instead of scrolling past them.
That's how this newsletter grows. Not through ads or algorithms. Through you sending it to someone and saying "read this."
And honestly? That means more to me than any metric.
