The Agent RFP: How Hospitals Should Evaluate AI in 2026
Slide decks and 3-month pilots can't tell you if an AI agent survives your workflows. Here's how the agent RFP replaces slideware with sim-based bakeoffs.

TL;DR
- Hospitals should evaluate AI vendors with a sim-based bakeoff, not a slide deck. Vendors connect their agent to a simulated copy of your workflow, run the same task set, and get scored on the same rubric. Empirical, not anecdotal.
- Only 16% of health systems have a system-wide AI governance policy, per a KLAS + CCM survey covered by Healthcare Innovation. That is the gap the agent RFP is meant to close.
- ECRI named insufficient AI governance the #2 patient safety concern of 2025, right behind AI-enabled apps themselves.
- More than 2,000 clinical AI tools hit the US market in 2024 alone, per the Parachute YC S25 launch. Reference calls don't scale to that number. A standardized sim does.
Why the old RFP broke
Hospital procurement was built for EHR modules and imaging hardware. You ask for a demo, you ask for references, you run a 90-day pilot against a handful of cases, and you sign. That process worked when the product was a claim scrubber or a PACS viewer. It does not work for an AI agent that has to traverse a payer portal behind CAPTCHA, parse an HL7v2 ADT feed, fall back to an IVR when the portal is down, and then drop a CDA note back into Epic.
Much like how self-driving needed simulated cities before real roads, healthcare agents need simulated health systems before real EHRs. The demo shows you the happy path. The pilot shows you what survived a scripted 90 days. Neither tells you how the agent behaves when Availity pushes a UI change on a Tuesday.
This is the gap the agent RFP is built to close.
"We stopped evaluating vendors on what they claimed and started evaluating them on what they produced against the same test set. It was the single biggest change we made in how we buy."
Dr. Robert Wachter, Chair of Medicine, UCSF
What an agent RFP actually is
An agent RFP is a bakeoff. You take a real workflow your health system cares about, say prior auth for MRI of the lumbar spine, and you encode it as a set of tasks a simulator can grade. Each vendor connects their agent to the simulator. They all see the same patients, the same payer configs, the same portal quirks, the same IVR menus. Every run produces a score and a trace.
The output is not a slide deck. It's a table. Cases passed, cases failed, median time to completion, hallucination rate on 837P loops, recovery behavior when the portal times out, behavior when insurance has lapsed. Same rubric across vendors.
For a longer treatment of what to put in that rubric, see our earlier post on how to evaluate healthcare AI vendor testing. The agent RFP is the buyer-side version of the same discipline.
The structure we recommend
1. Define the workflow, not the model
Pick one or two workflows that matter to your P&L: prior auth, referral intake, claims status, eligibility, discharge scheduling, or inbound fax triage. Encode each as a scenario set with explicit success criteria. Not "the agent submits a PA," but "the agent submits a CMS-0057-compliant 278 with correct CPT, ICD-10, and attached clinical notes, within 15 minutes, and handles a pended response with supplemental documentation."
2. Pick the interfaces the agent has to touch
Real prior auth isn't one API. It's a FHIR read against Epic, a Availity or Cohere portal login, sometimes a payer IVR, sometimes a fax. If your RFP only scores the FHIR portion, you're testing the easy 20%. The hardest failure modes live in the handoffs between interfaces.
3. Standardize the patient set
Give every vendor the same 200 synthetic patients, stratified by payer mix, complexity, and edge cases. If vendors bring their own data, you're comparing cooking shows where each chef brought their own ingredients.
4. Score on operational correctness, not clinical accuracy
Clinical accuracy is largely a function of the underlying model. Operational correctness, whether the agent actually completes the task end to end, is where vendors diverge. The industry keeps conflating the two. For more on why model benchmarks miss this, see our note on healthcare AI evaluation gaps.
5. Publish the trace
Every run produces a structured trace: what the agent called, what it got back, what it decided. The buyer should be able to replay any failure. This is what reference calls were trying to approximate. Traces do it better.
Why now
Three things are converging in 2026 that make the agent RFP not just nice to have but necessary.
Scale of supply. More than 2,000 clinical AI tools hit the market in 2024 (Parachute, YC S25). You cannot reference-call your way through that. Parachute, as an adjacent player building a marketplace plus monitoring layer, is one signal that the industry knows the old selection process is broken.
Governance expectations. The Joint Commission and Coalition for Health AI (CHAI) are rolling out a voluntary AI certification in 2026, built on governance playbooks tailored to delivery organizations. Certification won't replace evaluation, but it raises the baseline. Hospitals that can't point to empirical evaluation traces are going to look undercooked.
Regulatory patience is running out. ECRI named insufficient AI governance the #2 patient safety concern of 2025, right behind AI-enabled apps themselves (Healthcare Dive). When the safety nonprofits are naming governance specifically, the question from boards and malpractice carriers shifts from "did you evaluate?" to "show me the evaluation."
Still, most hospitals are not ready. The same KLAS survey that produced the 16% number showed many health systems are waiting for federal regulation before drafting their own policies. That's a losing wait. CHAI and Joint Commission are moving, AMA is moving, ECRI is moving, and vendors are shipping 2,000 tools a year. The buyer side has to build the muscle now.
What a good agent RFP produces
- A ranked scoreboard across vendors on the same task set, with cases passed and failed
- Per-vendor traces for every failure, reviewable by the governance committee
- A go/no-go decision that your chief medical officer and compliance lead can defend in writing
- A regression baseline the vendor will be measured against after go-live
If you can't produce those four artifacts at the end of your selection process, you didn't run an evaluation. You ran a vibe check.
Key Takeaways
- Slide decks and 90-day pilots were designed for EHR modules, not agents. They can't tell you whether an agent survives real payer portals, IVRs, and fax.
- An agent RFP is a sim-based bakeoff: same tasks, same patients, same rubric, all vendors in parallel.
- Score on operational correctness (does the agent complete the task end to end) not just clinical accuracy.
- Only 16% of health systems have a system-wide AI governance policy. The agent RFP is how the other 84% build one in practice.
- ECRI, CHAI, and the Joint Commission are all converging on the same message: empirical evaluation is the new floor.
- Publish the trace. Failures you can replay are failures you can fix.
- If your RFP doesn't produce a scoreboard and a trace archive, it's a vibe check, not an evaluation.
FAQ
What's the difference between an agent RFP and a pilot?
A pilot runs the agent live on real patients in your environment. An agent RFP runs every candidate vendor in parallel against a simulated copy of your workflow, using the same patients and rubric. Pilots are slow, sequential, and produce thin evidence. Agent RFPs are fast, parallel, and produce a scoreboard. Most hospitals should run both, in that order.
Do we need PHI to run an agent RFP?
No, and you actively don't want it at this stage. Use synthetic patients mapped to your payer mix and clinical complexity. PHI enters the picture only once a vendor is selected and moving into pilot. Running the bakeoff on synthetic data is what makes it fast, safe, and repeatable across vendors.
How many scenarios does an agent RFP need?
Enough to stratify across the variables that matter. For prior auth, we'd want at least 200 cases across your top 5 payers, 3 procedure categories, and a handful of edge cases (lapsed insurance, portal downtime, pended responses). Fewer than that and the confidence intervals swallow the result.
Who runs the simulator, the vendor or the hospital?
Neither. A neutral third party runs the simulator so no vendor is grading their own homework and no hospital is building sim infrastructure from scratch. That's the piece of the stack we're building at Verial.
How does this fit with CHAI and Joint Commission certification?
Certification sets a governance floor, process, oversight, documentation. The agent RFP is how you produce the evidence those playbooks call for. Expect future versions of the CHAI playbooks to reference sim-based evaluation explicitly, because there's no other way to audit an agent against a workflow at scale.
Where this goes next
The interesting question for 2026 isn't whether hospitals start running agent RFPs. Some already are. It's whether the format standardizes fast enough that vendors start publishing their scores against shared benchmarks, the way self-driving companies publish miles-per-disengagement. As this becomes a standard part of deployment, the conversation shifts from "does your agent know medicine" to "does your agent survive our workflow."
That's the conversation worth having. Let's get on with it.

Related articles
insightsHIMSS26's Agentic AI Gap Is an Eval Problem
HIMSS26 showed health systems deploying agents faster than they can audit them. The fix isn't more governance theater, it's independent simulation.
guidesFHIR Sandbox Alternatives: Epic, HAPI, Cerner, Verial
Epic has 18 patients. HAPI has no auth. Here is how six FHIR sandboxes compare for testing healthcare AI agents in realistic conditions.