Verial
insightshealthcare-aievaluationagents

HIMSS26's Agentic AI Gap Is an Eval Problem

HIMSS26 showed health systems deploying agents faster than they can audit them. The fix isn't more governance theater, it's independent simulation.

Kevin Huang
Kevin Huang · Co-founder, Verial
·8 min read
Share

TL;DR

  • Only 22% of hospital leaders at HIMSS26 said they could deliver auditable AI explanations to regulators within 30 days, and just 4.2% of combined IT and Quality/Safety budgets fund AI governance (Presidio, 2026). That's not a policy gap, it's a missing eval layer.
  • A Mount Sinai study cited at HIMSS26 showed single-agent accuracy collapsing from 73% to 16% under clinical-scale workloads. Governance frameworks don't catch this. Running the agent against a simulated hospital does.
  • The vendors who own the agent platform (Epic Agent Factory, Curiosity, Seismometer) cannot also be the independent validator. Health systems need a test environment that sits outside the EHR stack.
  • We think of this as the Verial wedge: simulated multi-system environments (FHIR, HL7, payer portal, IVR, fax) where an agent's behavior is measured before it ever touches production.

HIMSS26 said the quiet part out loud

Presidio's HIMSS26 writeup captures a tension every health system CIO I've talked to this year has been circling around. Vendors showed up with autonomous agents for scheduling, prior auth, care gaps, denial management. Hospitals showed up asking how to govern the agents they've already bought.

The headline number: 22% of hospital leaders are confident they could produce an auditable AI explanation for a regulator in 30 days. 88% are using AI internally. 4.2% of their combined IT and Quality/Safety budgets fund governance. The math doesn't work.

This is the gap nobody's testing. And it's not a new CHAI playbook or a Joint Commission certification that closes it, those things are policy scaffolding. The missing piece is empirical: did this agent actually do the job, in an environment that looks like my hospital, before I let it touch a patient?

Governance without evidence is theater

Presidio frames the fix as a three-stage maturity model: activate, orchestrate, govern. The sequencing is right. But "govern" in practice today means model cards, vendor questionnaires, and self-reported metrics. That's not evidence, that's paperwork.

The Mount Sinai finding cited in the article is the clearest tell. Single-agent accuracy at 73% in isolation, 16% under clinical-scale load. Orchestrated multi-agent systems held up with 65x fewer resources. You only discover that by running the agent against a workload that resembles production. Static benchmarks, compliance attestations, and vendor demos won't show it.

"If you cannot reproduce the environment, you cannot trust the agent. If you cannot trust the agent, it does not ship."

This is the frame Simon Willison surfaced recently about agent dev broadly, and healthcare is the hardest version of it. The environment is multi-system (EHR plus payer portals plus HIE plus fax plus IVR), the data is the messiest in any vertical, and the cost of being wrong is regulatory or clinical.

Why Epic can't be the validator

Epic is simultaneously building the agent platform (Agent Factory), the foundation models (Curiosity), and the validation suite (Seismometer). This is a structural conflict, not a swipe at Epic. The builder of a platform cannot credibly also be the independent tester of agents built on that platform. As liability questions escalate and regulators push for pre-deployment evidence, governance committees will ask the obvious question: is vendor-reported validation enough?

For one, look at what 2026 already looks like. CMS-0057-F forces payer FHIR APIs by January 2027. Information blocking enforcement kicked in February 2026. The Joint Commission's voluntary AI certification through CHAI is rolling out this year (Healthcare Dive, 2025). The regulatory direction of travel is toward documented, pre-deployment testing evidence. That evidence has to come from somewhere neutral.

Much like how SOC 2 isn't something a SaaS vendor grades themselves on, healthcare AI validation is drifting toward an independent third party. Verial is building for that drift.

What a real eval layer looks like

If the bar is "reproduce the environment," then the eval layer has to simulate the whole stack, not just a FHIR endpoint. A prior auth agent doesn't fail on FHIR syntax. It fails when the payer portal changes a CAPTCHA mid-quarter, when the IVR reroutes, when an X12 278 response comes back with a code the agent's never seen.

Here's the test surface that actually matters:

SurfaceWhat breaks in productionWhat static benchmarks miss
FHIR R4 / US CoreField population rates, coding variation, rate limitsMedAgentBench uses clean Stanford data, single endpoint
HL7v2 (ADT, SIU, ORU)Site-specific segment variation, orderingNot tested at all
Payer portalCAPTCHA, session timeout, UI driftNot tested at all
X12 (278, 270/271, 837)Version mismatches, payer-specific quirksNot tested at all
IVR / voiceMenu trees, hold times, transfer loopsNot tested at all
Fax / CDADocument format variation, OCR recoveryNot tested at all

MedAgentBench scores the top model at 85% read and 54% write on FHIR (Jiang et al., arXiv 2501.14654). Clinical accuracy benchmarks like HealthBench push models past 60% (OpenAI, 2025). Both are real progress. Neither predicts whether the agent handles a payer portal with a broken session.

This is the space Verial is building into. We simulate the multi-system environment, run the agent against it, and produce the measurement.

Activate, orchestrate, govern, and validate

Presidio's three-stage model is useful. I'd add a fourth that sits between orchestrate and govern: validate. You can't govern what you haven't measured, and you can't measure an agent usefully in production, that's the whole point of pre-deployment testing. Validation is the step where you run the agent against realistic scenarios and collect the evidence that governance committees consume.

Concretely, for a health system today, that looks like:

  1. Define the workflow you want to validate (scheduling, prior auth, care gap outreach).
  2. Simulate your environment: population, payer mix, EHR config, external systems.
  3. Run the candidate agent (vendor's, internal team's, whatever) against the sim.
  4. Measure accuracy, recovery, latency, failure modes.
  5. Hand the report to the AI governance committee.

We don't entirely know yet what the exact report format looks like, that's something we're actively iterating with pilot health systems on. What we do know is that the committees don't want another vendor self-assessment. They want something the EHR vendor didn't produce.

What to watch next

With HIMSS26 behind us, the question for the rest of 2026 is whether health system AI governance committees start treating independent validation as a hard requirement or keep accepting vendor-provided evidence. The early signal from the 22% number is that the current model isn't holding. Budgets will shift, the question is where.

Still, this is not easy to do. Simulating a real hospital's multi-system environment with enough fidelity that the eval actually predicts production behavior is the hard part. We have opinions on how to get there (start with scheduling and prior auth, let hospitals define their own scenarios, build from real traffic patterns over time) but the work is ahead of us, not behind.

If you're a health system leader wrestling with the governance gap HIMSS26 surfaced, or an AI team that's tired of vendors grading their own homework, we'd love to talk.

Key Takeaways

  • The HIMSS26 governance gap is structural: only 22% of hospitals can audit AI in 30 days, and 4.2% of relevant budgets fund governance.
  • Clinical accuracy benchmarks don't catch operational failure. Mount Sinai's 73% to 16% drop under load is a simulation problem, not a model problem.
  • Epic and other EHR vendors can't credibly validate agents built on their own platforms. Independent eval infrastructure is the missing layer.
  • Real eval means simulating the whole stack: FHIR, HL7, payer portal, X12, IVR, fax. FHIR-only benchmarks are table stakes, not the finish line.
  • Add "validate" between "orchestrate" and "govern" in the maturity model. Governance without empirical evidence is paperwork.

FAQ

What's the 22% number from HIMSS26?

22% of hospital leaders surveyed said they could deliver an auditable AI explanation to a regulator within 30 days, per Presidio's HIMSS26 recap. The gap reflects missing eval infrastructure, not just missing policies.

Why can't Epic's Seismometer serve as the validator for agents built on Agent Factory?

Epic builds the platform, the foundation model, and the validator. That's a structural conflict of interest. Governance committees increasingly want validation evidence produced by an environment the EHR vendor doesn't control.

How is this different from HealthBench or MedAgentBench?

Those are clinical-accuracy and FHIR-only benchmarks. Production agents fail on portal CAPTCHA, IVR loops, X12 version mismatches, session timeouts, things those benchmarks don't test. The eval layer has to cover the full multi-system stack.

What does independent validation actually produce?

A report the AI governance committee can use for a go/no-go decision: accuracy under realistic load, failure modes, recovery behavior, resource consumption, all measured against scenarios the hospital defined. It's the evidence layer that sits under policy frameworks like CHAI and Joint Commission certification.

Kevin Huang
Kevin Huang · Co-founder, Verial
·8 min read
Share