Why MedAgentBench and HealthBench Miss Real-World Bugs

TL;DR

Clinical-accuracy benchmarks like HealthBench and MedAgentBench do not catch operational bugs. Real agents fail on integration format errors, session timeouts, portal navigation, and recovery, none of which these benchmarks test.
Claude 3.5 Sonnet tops MedAgentBench at 69.67% on 300 FHIR tasks (Jiang et al., arXiv 2501.14654), but the benchmark uses clean Stanford data and a single pre-authenticated interface.
A 2025 JAMIA survey found only 19% of health systems report high success with imaging AI despite 90% deployment, and 77% cite immature tools as the top barrier (PMC).
Operational evaluation needs multi-interface scenarios, failure injection, and messy data, not another multiple-choice suite.

Two benchmarks dominate healthcare AI evaluation today. HealthBench, released by OpenAI in May 2025, grades models across 5,000 multi-turn conversations with 48,562 rubric criteria built by 262 physicians from 60 countries. MedAgentBench, developed at Stanford and published in NEJM AI, scores agents on 300 FHIR tasks across 100 patient profiles and 700,000+ data points.

Both are serious contributions. Neither predicts whether your agent will survive contact with a real payer portal.

That gap matters because production healthcare AI is missing its mark. A 2025 JAMIA study found only 19% of health systems report high success with imaging AI despite 90% deploying it, and only 38% succeed with clinical risk stratification (PMC). The top cited barrier, named by 77% of respondents, was immature tools.

The benchmarks say the models are ready. The deployments say otherwise. Something between the two is not being measured.

What clinical benchmarks test well

HealthBench tests medical knowledge and reasoning: factual accuracy, communication quality, safety caveats, escalation. Performance climbed fast, with GPT-3.5 Turbo at 16%, GPT-4o at 32%, and o3 at 60% (arXiv 2505.08775). A poor score here is a knowledge problem no operational engineering will fix.

MedAgentBench goes further by testing agents as agents. It requires FHIR API calls, multi-step navigation, structured data extraction, and note generation. A shared benchmark lets teams compare approaches on the same footing.

"Chatbots say things. AI agents can do things. This is a much higher bar for autonomy in the high-stakes world of medical care. We need a benchmark to establish the current state of AI capability on reproducible tasks that we can optimize toward."

Jonathan H. Chen, MD, PhD, senior author of MedAgentBench, Stanford HAI

Reproducibility is not realism, though. The MedAgentBench authors note the benchmark "does not capture the full complexity of real-world medical scenarios that typically require coordination and communication between different teams" (arXiv 2501.14654).

Where the bugs actually live

After watching healthcare AI teams deploy agents across prior auth, revenue cycle, and care coordination, the same failure categories keep showing up. None of them are clinical.

Integration format errors

Submitting an X12 278 transaction with a segment terminator the clearinghouse rejects
Constructing an HL7 ADT message with the wrong field in MSH-9
Sending a FHIR Bundle where a reference points to a resource not in the Bundle
Encoding a diagnosis as ICD-10-CM when the system expects ICD-10-PCS
Formatting a phone number with a country code when the system wants 10 digits

The agent knows medicine. It picks the right code. Then it serializes the payload wrong and gets a 400.

State management errors

A production prior auth agent might run ten workflows at once. Each has its own FHIR session, portal session, and tokens expiring on different clocks. Typical failures:

Losing context when a phone call transfers between departments
Submitting a prior auth twice because the first submission succeeded silently
Mixing Patient A's data into Patient B's portal submission under concurrent load
Starting a workflow over instead of resuming after a network interruption

Timeout and retry errors

Production systems have variable latency. A FHIR server that responds in 200 ms normally may take 30 seconds during batch windows. A payer IVR may hold for 15 minutes. A portal may take 45 seconds to load after an upload. Agents need real retry logic: when to try again, how long to wait, when to give up. Test environments respond instantly, so this code is almost never exercised before go-live.

Payer portals have inconsistent UIs, CAPTCHAs, MFA, dynamic form fields, and frequent layout changes. MedAgentBench cannot test any of this because it only provides one interface. HealthBench cannot test it because HealthBench is Q&A. Yet portal work is the operational core of prior auth, eligibility checking, and claims status.

MedAgentBench tasks either succeed or fail. There is no evaluation of partial failure recovery. In production, the question that matters is not "can the agent finish?" but "what does the agent do when it cannot finish?" Does it retry on a 429? Does it escalate on ambiguous data? Does it produce an error log someone can act on?

The benchmarks measure what is easy to measure: static knowledge and clean API calls. The bugs that ship agents back to the drawing board live in the seams between systems.

What operational evaluation looks like

You cannot test operational correctness with multiple-choice questions. You need environments the agent acts on, and you check what happened.

Multi-interface scenarios. One task should span FHIR reads, a portal submission, a phone call, and a FHIR writeback. Data has to flow correctly between them.

Failure injection. Real evaluation includes injected API timeouts, auth errors, unexpected data shapes, and portal layout changes. Score the recovery, not just the outcome.

Realistic data shapes. Production FHIR is messy: Condition resources with free-text code.text and no code.coding, Observations with a valueQuantity value but no unit, MedicationRequests referencing contained Medication resources. An agent that hits 95% on clean data may drop to 60% against a real Epic instance. This is the same pattern we covered in the FHIR sandbox problem.

Multi-dimensional scoring. Clinical benchmarks produce one accuracy number. Operational eval needs at least five axes:

Completion: did the agent finish?
Correctness: did each check pass?
Efficiency: steps, latency, retries
Safety: wrong payer, exposed PHI, duplicate submits
Resilience: how did it handle injected faults?

An agent finishing 95% of tasks at 3x the expected latency has a different problem from one finishing 85% quickly. An agent that fails safely beats one that fails silently.

Key Takeaways

HealthBench and MedAgentBench test clinical knowledge and basic FHIR interaction. Both are necessary. Neither is sufficient for production.
The top cited barrier to healthcare AI adoption in a 2025 JAMIA survey was immature tools, at 77% of respondents. Benchmark scores are not catching this.
Production failures cluster in five buckets: integration formats, state management, timeouts, navigation, and data extraction.
Operational evaluation needs acting environments, multi-interface scenarios, failure injection, and messy data, not static Q&A.
Score agents on completion, correctness, efficiency, safety, and resilience separately. A single number hides the failures that matter.
Teams that ship reliable agents build internal operational benchmarks on top of the clinical ones.

FAQ

Is MedAgentBench a good benchmark for production readiness?

No, and its authors agree. MedAgentBench uses clean Stanford FHIR data, pre-authenticated API access, and a single interface. Claude 3.5 Sonnet v2's 69.67% score (arXiv 2501.14654) tells you an agent can read and write FHIR under ideal conditions, not that it will survive an Epic production instance with OAuth, rate limits, and contradictory records.

What does HealthBench miss that matters in production?

Tool use, multi-step workflows, and operational context. HealthBench grades responses to fixed prompts. Production agents execute workflows where each step depends on the last and on external system state. A model that answers a warfarin interaction question perfectly can still fail when it has to pull the actual medication list, format the interaction notes, and submit them inside a specific clinical UI.

Are clinical accuracy benchmarks still useful?

Yes. They set a floor. An agent that fails HealthBench has a knowledge problem no amount of operational engineering will fix. The mistake is treating a high score as proof of production readiness. See our FHIR R4 testing guide for what a practical eval layer on top looks like.

What statistics should I cite when making this case internally?

The 2025 JAMIA survey is the cleanest source: 19% high success in imaging AI despite 90% deployment, 38% high success in clinical risk stratification, 77% of health systems citing immature tools as the top barrier (PMC). Pair with MedAgentBench's 69.67% ceiling on clean data to show the gap.

How do I start building operational evaluation?

Pick one workflow, prior auth is a good first target, and enumerate every system it touches. Build a test environment for each. Write scenarios that span all of them, inject realistic failures, and score on the five axes above. This is what Verial's evaluation engine does end to end. Book a demo if you want to see it run against your agent.