ARISE Is Rewiring Medical AI

0
3

The headlines scream it daily. AI passes the boards. AI beats the experts.

But what does that actually mean? We are running out of ways to judge tools that move faster than our understanding of them. Enter the ARISE Network.

Harvard and Stanford physicians teamed up to answer the messy questions nobody wants to ask. What holds up in a chaotic ER? What is clinical reasoning, really? When does a human fail? When does the machine?

“We need to define what medical superintelligence looks like.”

The Magician Doctor

Jonathan Chen knows a thing or two about illusions.

He is a physician, a data scientist, and a performing magician. His background is unusual. He went to college at 13. He coded for years before getting an MD and PhD. Now he teaches at Stanford.

He treats LLMs like magic tricks.

The rule? Misdirection. Everyone stares at the right hand. Chen watches the left. When ChatGPT hit the scene, he didn’t cheer. He looked for where it failed. He looked for the glitch in the matrix.

In late 2024 his team found something strange. LLMs diagnosed patients better than doctors using AI. Better than doctors working alone.

This broke the golden rule. The old dogma said doctors plus AI would beat either group alone. It didn’t.

Why? Timing. Doctors treated the AI like Google. They didn’t trust it.

Chen’s team tried again. They built a customized model. They taught doctors how to talk to it. Real-time collaboration.

The results flipped. Doctors with AI beat doctors alone. They matched the AI, too. Still not beating it, but close.

Same result in management tasks. ARISE is now funding a “flight simulator” for medicine. Practice makes perfect, they hope.

This leaves a ghost question haunting the field. If the robot beats the human-robot team, what were we even testing?

The Historian’s Warning

Adam Rodman talks fast. He thinks faster.

A Harvard internist and medical historian. He says this isn’t new. Stethoscopes changed us. Penicillin changed us. EHRs ruined our backs.

But AI moves higher. It eats cognition. It eats thinking itself.

Rodman digs into the dirt. He points to World War II. Signal detection theory gave us sensitivity and specificity. In 1959, two guys named Ledley and Lusted wrote that diagnosis is just math. Probability. Logic.

Computers followed. Then medical schools did too.

We train residents with rigid frameworks. Nomograms. Rules of thumb. These reflect how computers think, Rodman says. Not how experts actually behave. Real doctors are fast. Intuitive. Non-linear. Messy.

Now we are building AI to mimic the abstract rules. Not the human mess.

So the benchmarks got stale. The New England Journal of Medicine cases. Classic vignettes.

GPT-4 nailed two-thirds of them. Impressive? Sure. Limited? Also yes. Benchmarks get saturated. Physicians aren’t comparators; they are variables.

Rodman went deeper. Science journal published their new findings. OpenAI’s o1 model crushed doctors on historical tasks.

Worse, it beat real Harvard internists at reading EHR data for emergency cases. 76 patients. Real chaos.

“What we need now are prospective clinical trials.”

Rodman isn’t impressed by the hype. He wants RCTs. He wants patient outcomes, not benchmark points.

The Bridge Builder

Ethan Goh fits in a hospital. And in Silicon Valley.

Hospitalist. Policymaker in Singapore. NHS advisor. Startup exec. Now the executive director of ARISE at Stanford. He speaks the languages of both worlds.

He hates standardized tests for AI evaluation. The USMLE tests memory. Medicine tests intuition. Patients don’t read from textbooks. They arrive bleeding. Confused. Complex.

A high score doesn’t make a good doctor. A high score doesn’t save lives.

The field is moving toward simulations. Rubrics instead of binary answers.

Goh pushes further. Benchmarks need to be precise. Triage. Diagnosis. Treatment. Communication. Each needs a different bar.

Enter the MAST. The Medical AI Superintelligence Test.

It covers diagnosis. Management. Safety. Agentic workflows. It benchmarks against real doctors. Not just models in a vacuum.

One part of MAST is called NOHARM. It tracks harm. How often does an LLM give bad advice?

Top models still fail 22% of the time. Mostly omissions. Leaving things out.

Surprisingly? They were still safer than generalist physicians. And ensembles—combining models—were safer than single ones.

Then there is MedAgentBench. Can the AI order meds? Aggregate labs? Do it in a real FHIR-based EHR?

In mid-2026, the best model hit 70%. It choked on steps after step two.

Six months later. Anthropic dropped Opus 4.6. It hit 92%.

Speed. The clock is ticking. The benchmark is dead before it is published.

So ARISE built PhysicianBench. Multi-step tasks. Real execution.

It will last six months maybe.

Then comes the cliff. If AI reaches superintelligence —outperforming top clinicians on meaningful tasks under real conditions—we break the scoreboard.

Like AlphaGo’s Move 37. It confused the human masters because it was irrational to us. Correct for the win.

Concordance with experts will vanish.

We will be forced back to reality. Randomized controlled trials. Hard outcomes. Did the patient get better? Or worse?

The Open Ending

Chen sees the shift coming. Rodman expects it. Goh prepares for it.

They agree on one thing. The definition of “doctor” is changing.

Whether they are right remains unproven.

AI is prying open the door on clinical reasoning. It asks who we are. Where we help. When we should step aside.

The questions are no longer academic. They are clinical. They are immediate.

And we have no perfect answers. Yet. 🩺