Sim-to-Prod: One API for Healthcare AI Test and Live

TL;DR

The sim-to-prod gap is the single biggest source of healthcare agent bugs after go-live. We close it by putting one canonical API surface in front of sandbox and production rails, then flipping a credential, the same pattern Plaid, Stripe, and Twilio proved in fintech and telecom.
Sandboxes and production diverge on auth flow, rate limits, error taxonomy, reference integrity, pagination, and data messiness. Every one of those is a place your agent will silently fail.
ONC reports 70% of US hospitals enabled FHIR app access in 2024, up from 56% in 2021, and CMS-0057-F makes four payer FHIR APIs mandatory by January 1, 2027. The integration surface is shifting under us.
Baseline your agent's simulator performance and treat any drift between sim and prod as a production alarm, not a modeling problem.

The Plaid pattern, applied to healthcare

Plaid, Stripe, and Twilio all solved the same developer problem. You code against a sandbox, pass a sandbox key, ship, swap one credential for a production key, and the same code runs. The API surface, the error shapes, the webhook payloads, the pagination, all the same. The only thing that changes is what's behind the credential.

Healthcare does not work this way. We build against HAPI FHIR, Synthea-seeded test servers, and Epic's App Orchard sandbox, then deploy against Redox, Health Gorilla, direct Epic or Oracle Health FHIR endpoints, and whatever SMART-on-FHIR quirks each hospital's IT team bolted on. Different auth, different error shapes, different data messiness, different everything.

This is where most post-go-live bugs come from. Not the model. Not the prompt. The integration.

Much like how self-driving needed simulated cities before real roads, healthcare agents need sandboxes that behave like production. We're not there yet, and the gap is expensive. Redox has raised $95M and processes over 12 million patient records per day across 50+ EHR systems, largely because every AI company figured out it's cheaper to outsource that translation layer than build it (Tracxn, 2026).

"Interoperability isn't the goal. The goal is value-based care enabled by interoperability."

Micky Tripathi, former National Coordinator for Health IT, ONC

Tripathi is right, but the translation problem is still in the way of the goal. Let's get concrete about where sim and prod actually diverge.

Where sandbox and production split

We've seen the same seven divergences in almost every agent deployment we've evaluated.

Auth envelope. HAPI has no auth. Synthea has no auth. Epic's sandbox uses SMART-on-FHIR with a public client. Production almost always uses confidential clients with dynamic registration, backend services with JWT assertions, or a broker like Redox that hides auth entirely behind a Bearer token. Agents that hard-code sandbox auth break the first time they hit a real OAuth dance with PKCE and rotating keys.

Rate limits. Sandboxes don't enforce them. Production EHRs absolutely do, and they vary: Epic's app-level limits, Oracle Health's tenant-level throttling, and payer APIs that cap at 60 requests per minute. Your agent's retry logic never got tested.

Error taxonomy. HAPI returns clean OperationOutcome resources. Production returns those plus HTML error pages, HTTP 500s with no body, 403s that actually mean "patient opted out of data sharing", and the occasional 200 with an empty Bundle when you expected a 404.

FHIR data messiness. Synthea generates perfect US Core. Production FHIR has Observations with effectivePeriod instead of effectiveDateTime, Conditions with no clinicalStatus, references to Patients that don't exist, and extensions nobody documented. If you haven't seen this before, see our deep dive on the FHIR sandbox problem.

Reference integrity. Sandboxes guarantee Patient/123 actually resolves. Production does not. Dangling references, version-specific references, absolute URLs pointing to another tenant's server, all normal.

Pagination. Sandboxes return everything in one page. Production uses next links, cursor tokens, and sometimes different page sizes per resource type. Agents that don't paginate miss data silently, which is the worst kind of failure.

Idempotency. Sandboxes let you POST the same Observation ten times and happily create ten copies. Production varies: some endpoints dedupe on logical ID, some respect If-None-Exist, some don't. Without idempotency keys, retries become duplicates, and duplicates become clinical incidents.

What to standardize

The answer we've landed on, after building simulators for FHIR, HL7v2, X12, voice, and fax, is the Plaid pattern. Put one canonical API in front of every rail, simulated or live.

Here's what the canonical layer owns:

Concern	What we standardize
Auth	Single `Authorization: Bearer <token>` envelope. Credential flip switches backend.
Errors	One error taxonomy: `invalid_request`, `not_found`, `rate_limited`, `upstream_unavailable`, `data_conflict`. Raw upstream errors in a debug field.
Rate limits	Always return `X-RateLimit-Remaining` and `Retry-After`, even in sim. Agents learn to back off in test, not in prod.
Idempotency	Require `Idempotency-Key` header on all writes. Sim and prod both enforce it.
Pagination	Cursor-based, one shape, regardless of whether the backend uses FHIR `next` links or X12 continuations.
Data shape	US Core profiles as the canonical wire format, with extensions preserved but namespaced.

With this in place, the agent code that works against a Synthea-seeded sim hits a real Epic tenant with one token swap. That is the whole goal.

Sim as a production drift alarm

This is the move most teams miss. Once you have a canonical API, your sim becomes a continuous regression harness. Run your agent against the same 200 benchmark tasks every night. Track latency, error rate, check pass rate, and token usage per task. These become your baseline.

When production starts drifting, say Epic rolls out a new FHIR version, or a payer portal changes its rate limits, the sim numbers won't move but production numbers will. That delta is your alarm. You caught the regression before the hospital did.

In order to make this work, the sim has to be faithful enough that drift means something. A HAPI server with Synthea data isn't faithful enough. It's missing the auth friction, the rate limits, the reference breakage, the 500s. That's why we built Verial around protocol-level simulators with production-shaped failure modes, not just clean happy-path data.

Redox, Health Gorilla, direct FHIR, and the honest tradeoffs

Briefly, because this could be its own post: Redox abstracts EHRs behind a canonical JSON API, great for time-to-integration, opaque when you need to debug a specific Epic quirk. Health Gorilla sits closer to FHIR with a strong focus on records retrieval. Direct EHR FHIR is purest but means N integrations where N grows every quarter.

If you had to have me choose, I would use a Redox-style broker for breadth and go direct FHIR only for the two or three EHR tenants that represent most of your volume. Either way, your agent should not know the difference. That's what the canonical API is for.

CMS-0057-F will force more parity

With CMS-0057-F, Medicare Advantage, Medicaid, CHIP, and QHP payers must implement Patient Access, Provider Access, Payer-to-Payer, and Prior Authorization FHIR APIs by January 1, 2027 (CMS, 2024). This is the first time payer FHIR has been mandated at this scale.

The good news is that a standard implementation guide means a smaller sim-to-prod gap, at least for payer data. The bad news is that every payer will implement the spec slightly differently, and we'll be building new simulators for each one. As this rule takes effect, expect the divergence problem to migrate from EHR vendors to payer vendors. The canonical API approach generalizes cleanly. One-off payer SDKs do not.

ONC data backs this trend: 70% of hospitals enabled FHIR-based app access in 2024, up from 56% in 2021 (ONC ASTP Data Brief 79, August 2025). FHIR is winning. The implementation details are where the pain lives.

Key takeaways

The sim-to-prod gap, not model accuracy, is what breaks most healthcare agents after go-live.
Borrow the Plaid and Stripe pattern: one canonical API in front of sim and prod, credential flip to switch.
Standardize the auth envelope, error taxonomy, rate limit headers, idempotency keys, pagination shape, and data profile.
Seven places sim and prod reliably diverge: auth, rate limits, errors, data messiness, reference integrity, pagination, idempotency.
Run your sim continuously as a regression harness. Drift between sim and prod is a production alarm.
CMS-0057-F by January 2027 means payer FHIR APIs are coming fast. Canonicalize now or rewrite later.
For deeper implementation details, see connecting an agent to a FHIR sandbox.

FAQ

Why not just test against production?

PHI. You can't put real patients in a test harness without a BAA, an IRB, or both, and you can't evaluate agents responsibly without a ground-truth signal. Sims give you both. The canonical API pattern gives you the confidence that the sim results generalize.

Does this mean we can't use Epic's sandbox?

You can and should, as one backend among several. The point is that your agent code doesn't talk to Epic's sandbox directly. It talks to your canonical API, which routes to Epic's sandbox in test and Epic production in prod. The agent doesn't know or care.

How faithful does the sim have to be?

Faithful enough that drift means something. That means real auth flows, real rate limits, real error shapes, messy data, broken references. If your sim is a HAPI server with clean Synthea data, it's not faithful enough to be a production alarm.

What about HL7v2 and X12?

Same pattern. The canonical API wraps an HL7v2 listener or an X12 278 transaction the same way it wraps FHIR. Agent code stays protocol-agnostic until it absolutely has to know. For a broader picture of where this fits with go-live readiness, see our healthcare AI go-live checklist.

Is the canonical API a product or a pattern?

Both. Verial ships a canonical API you can point agents at, backed by our simulators in test and by broker or direct EHR integrations in prod. But the pattern works regardless of vendor. The worst thing you can do is hard-code to a specific sandbox's quirks and discover in week one of pilot that production doesn't look like that.