A model was asked to read a chest X-ray. It returned a full radiology report. Findings, impressions, structured format, the whole thing.
Concluded with "no acute cardiopulmonary abnormality detected."
The X-ray didn't exist. No image was ever uploaded.
Another model was asked to identify a celestial object "shown in this image." It said Mars. Described the reddish-orange hue. Mentioned the south polar ice cap.
There was no image. There was never an image.
A third model was asked to read a handwritten note. It transcribed: "Remember to check the expiration date of the milk before drinking it."
A different version of the same model, same question, same empty input.. transcribed something completely different: "T.P.—I'm sorry, I'm just not strong enough."
Both made up. Both presented as fact. And both written with the confidence of a model that genuinely believes it's reading something real.
And we just.. moved on. Nobody is really talking about this.
A team at Stanford dropped a paper called "Mirage." And I think it quietly changes how you should look at every AI benchmark score you've ever seen.
If you've been reading me for a while, you know I spend a lot of time going through AI research.
Let’s talk about it.
Imagine!! You go to a doctor. They pick up your MRI scan, hold it up to the light, stare at it, and give you a detailed diagnosis. You feel reassured because they looked at the evidence.
Now imagine finding out the doctor never actually looked at the scan. They just.. described what they thought it would probably show, based on your symptoms and statistics.
And their made-up description sounded so real that you'd never know the difference.
That's what Stanford found happening inside GPT-5, Gemini 3 Pro, Claude Opus 4.5, and every other frontier model they tested.
They call it the "mirage effect."
And it's not hallucination. I want to be specific about that coz the distinction actually matters.
Hallucination is when a model sees a real image and gets something wrong. The image exists. The model just misreads it. That's a known issue.
The mirage effect is different. There is no image.
The model invents an entire visual reality. Describes it in detail. Builds reasoning on top of it. And never once tells you it didn't actually see anything.
They tested 12 models across 20 categories. Medical scans, satellite images, bird photos, license plates, handwriting, pathology slides, circuit diagrams. Built a whole benchmark called Phantom-0 specifically for this.
On average, these models confidently described images that don't exist more than 60% of the time.
And when they added prompt instructions that are standard in AI evaluation workflows.. stuff like "base your answer on the visual evidence".. the rate went up to 90-100%.
Just process that for a second. You tell the model to focus on visual evidence. There is no visual evidence. And instead of noticing that, the model becomes even more confident in describing something that isn't there.
Two models were asked the same question about identifying tissue on a histology slide. No image attached.
GPT-5 said kidney tissue. Described glomeruli and tubules.
Claude Sonnet 4.5 said cardiac muscle. Described intercalated discs.
Two completely different answers. Both fabricated. Both delivered with textbook-level confidence. And if you just read the output, you'd have no way of knowing either one was made up.
Every headline satisfies an opinion. Except ours.
Remember when the news was about what happened, not how to feel about it? 1440's Daily Digest is bringing that back. Every morning, they sift through 100+ sources to deliver a concise, unbiased briefing — no pundits, no paywalls, no politics. Just the facts, all in five minutes. For free.
When Mirages Start Diagnosing People
Here's where it goes from interesting to genuinely concerning.
The researchers asked Gemini 3 Pro to diagnose non-existent medical images. Brain MRIs, chest X-rays, ECGs, pathology slides, skin photos. They ran each test 200 times with different random seeds.
The fabricated diagnoses weren't random. They were systematically biased toward serious pathologies.
STEMI.. a type of heart attack that needs immediate intervention. Melanoma. Carcinoma. Acute ischemic stroke. The kind of stuff that would set off alarm bells in any hospital.
The model identified a non-existent MRI as a Diffusion-Weighted Imaging scan. Reported hyperintensity in the left Middle Cerebral Artery territory. Diagnosed an acute ischemic stroke.
The stroke doesn't exist. The MRI doesn't exist. But the reasoning trace? Indistinguishable from a real one.
Now think about what happens in practice. A patient uploads a skin photo to an AI health app. Upload fails silently. The model doesn't flag the missing image. Fabricates a detailed analysis. Flags melanoma.
That's not hypothetical. That's the default behaviour when an image goes missing.
And this connects to something I've been thinking about since I read this paper.
Every AI company uses benchmarks to prove their models can see. High scores go straight into blog posts and system cards. "Our model outperforms radiologists." You've seen these claims a hundred times.
The Stanford team tested four frontier models on six major benchmarks. Three medical, three general. Normal mode with images. Mirage mode without.
And in every single model-benchmark pair.. the accuracy achieved without images was greater than the additional accuracy gained from actually having them.
I want to say that differently so it really lands.
The part of the score that comes from not looking at images.. is bigger than the part that comes from looking at them.
On average, models kept 70-80% of their scores without seeing a single image. On medical benchmarks, even higher.
Then the researchers did something I think is the most devastating experiment in the whole paper.
They trained a "super-guesser." Tiny 3-billion parameter model. Text-only. Zero vision capability. Fine-tuned on chest radiology questions with all images stripped out.
This model ranked number one on the held-out test set of the largest chest radiology benchmark.
Beat GPT-5.1. Beat Gemini 3 Pro. Beat Claude Opus 4.5.
Beat human radiologists. By more than 10%.
A model that has never seen a single X-ray in its entire existence.. topped the radiology leaderboard. Because the questions themselves contained enough textual patterns and structural shortcuts to get there without needing the image.
The Two Modes Inside These Models
There's one more finding..
The researchers compared mirage mode to what they call "guess mode." In mirage mode, you ask the question without the image and the model doesn't know it's missing. In guess mode, you tell it explicitly.. "there's no image, take your best guess."
Performance dropped significantly in guess mode.
Think about why that's weird. In both cases, the model has no image. But when it thinks it's supposed to be doing visual reasoning, it performs better than when it knows it's guessing.
This tells us something about how these models actually work. There seem to be two distinct operating regimes.
A conservative mode when the model knows it's working blind. And a mirage regime where it behaves as if it genuinely sees something.. and in doing so, taps into hidden patterns that simple guessing doesn't reach.
And that breaks the standard way we build benchmarks.
The usual approach to testing if a question requires vision is simple. Check if a model can guess the answer without the image. If it can, remove the question. But the mirage effect bypasses that filter. Models get questions right in mirage mode that they miss in guess mode.
So the cleaning method itself is insufficient. It underestimates how many questions are vulnerable.
The Stanford team proposes a framework called B-Clean. Run all models in mirage mode. Identify every question any model gets right without images. Remove those. Evaluate on what survives.
When they did this on three major benchmarks.. 74-77% of the questions got removed.
Three quarters. Gone.
After cleaning, model rankings changed. The "best" model on the original benchmark wasn't always the best on the cleaned version. One model dropped from 61.5% to 15.4%.
The leaderboards we've been trusting might not be measuring what we think they measure.
My Take
This paper doesn't scream for attention. It makes a much more precise and much more uncomfortable point.
We currently have no reliable way to tell when an AI model is actually looking at an image.. and when it's constructing a convincing illusion of looking at one.
And that matters. Not theoretically. AI is already in radiology pipelines, pathology screening tools, dermatology apps. Millions of people ask health questions to these models every single day.
If a model can fabricate a medically convincing analysis of an image it never received.. and we can't tell the difference from the output.. that's not a research curiosity. That's a patient safety issue.
I'm not panicking about this. And I don't think you should either. But I do think every benchmark score you see from now on deserves a harder look.
Not because these models aren't capable. But because we might have been measuring the wrong thing all along.
And that's a problem worth fixing before we build more on top of it.
What's your take? Does this change how you read those "AI outperforms doctors" headlines?
If you made it this far, you're not a casual reader. You actually think about this stuff.
So here's my ask. If this article made you think, even a little, share it with one person. Just one. Someone who's in the AI space. Someone who reads. Someone who would actually sit with these ideas instead of scrolling past them.
That's how this newsletter grows. Not through ads or algorithms. Through you sending it to someone and saying "read this."


