Testing Healthcare AI Across FHIR, Voice, and Portals

TL;DR

Healthcare AI workflows cross four or more interfaces (FHIR, IVR, portal, X12), and isolated interface tests routinely miss the cross-boundary failures that cause production outages.
CAQH's 2024 Index puts the US medical and dental industry's admin spending at $597 billion annually, with prior auth among the most costly manual transactions.
A single prior auth workflow typically touches four interfaces: FHIR read, IVR call, portal submission, and X12 276/277 status, each with its own auth, timing, and data format.
A unified sandbox where FHIR, IVR, and portal share the same patient state is what makes multi-interface testing meaningful.

A single workflow, four interfaces

Consider an AI agent handling a prior auth request for a knee MRI. The deck says: check eligibility, submit, track status, handle result. In practice, the agent touches at least four interfaces.

It reads clinical data from the EHR's FHIR API: ordering provider NPI, diagnosis codes, insurance, imaging history. Then it calls the payer's phone IVR to check eligibility and pick the correct submission pathway. Some payers require phone-based auth for imaging. Others accept portal submissions. Rules change quarterly. Next, it navigates the payer's portal to submit the request, uploading documentation and filling fields that vary by payer and procedure. Finally, it polls for status through a mix of portal checks, phone callbacks, and X12 276/277 transactions.

Each interface has its own auth, data format, failure modes, and timing. Testing them in isolation tells you each integration works. It does not tell you the workflow works.

"Providers are drowning in administrative transactions. Automation is only valuable if the systems actually talk to each other end to end."

-- April Todd, SVP Policy and CORE, CAQH (CAQH 2024 Index)

Why isolated interface testing fails

Most teams build and test each integration independently. The FHIR client has unit tests that read Patient and Condition resources. The voice module has tests for phone tree navigation. The portal automation has login and form tests. Each suite passes.

Then the agent runs the full workflow and fails in ways no individual test anticipated. These are the operational failures that clinical benchmarks never test.

Data format mismatches between interfaces

The FHIR server returns a diagnosis as a SNOMED code. The portal expects an ICD-10 code in a specific field. The IVR asks for the diagnosis "in plain English." Your agent translates the same concept across three representations, and each step is a failure point.

Common bug: the agent reads Condition.code.coding[0].code (SNOMED), maps to ICD-10 via a lookup, enters it in the portal. But the portal field has a typeahead that matches on ICD-10 description text, not the code. The agent types "M17.11" and gets nothing. A human would type "primary osteoarthritis, right knee" instead. This only surfaces when FHIR data flows into portal context.

State dependencies across interfaces

The IVR tells the agent prior auth is required and gives a reference number. The agent needs that number in the portal submission. If the IVR returns the reference in an unexpected format (with dashes, with a prefix, spoken as individual digits), the agent stores it wrong and portal validation fails.

Multi-step workflows compound this. An agent might need to:

Read clinical data from FHIR
Call the IVR for auth requirements
Gather extra documentation from the EHR
Submit via portal with fields populated from steps 1 to 3
Store confirmation back in the EHR as a DocumentReference

A subtle error in step 2 cascades through steps 3 to 5. Isolated tests cannot catch this.

Timing and ordering dependencies

Interfaces operate on wildly different timescales. A FHIR call returns in milliseconds. An IVR call takes 3 to 15 minutes including hold time. A portal load with upload takes 10 to 60 seconds. Status updates appear hours or days later.

Subtle bugs that result:

Token expiration during slow operations. Agent authenticates to FHIR, starts a long IVR call, then writes back to FHIR. Token has expired. Without mid-workflow refresh, the agent loses IVR call state.
Portal session timeouts. Agent logs in, gathers FHIR and IVR data, returns to the portal. Session is gone. Agent must re-auth without losing form data.
Race conditions on status checks. Agent submits and immediately checks status. The payer has not processed yet. Agent sees "not found" and reads it as a rejection.

Authentication complexity

Each interface requires different auth:

FHIR: OAuth 2.0 with SMART on FHIR launch context, per-patient tokens
Voice IVR: Caller ID verification, PIN or DOB spoken auth
Payer portals: Username/password with MFA, session cookies, CSRF tokens
X12: Trading partner agreements, SFTP keys, clearinghouse credentials

Each isolated auth test passes. The full workflow requires the agent to hold concurrent sessions across all four contexts, refreshing tokens and handling drops without interrupting work.

Building end-to-end test scenarios

Scenario structure

A good multi-interface scenario specifies:

Initial state: what exists in the FHIR server before the workflow (demographics, conditions, coverage)
Interface behavior: how each interface responds (IVR menu, portal fields, error paths)
Expected actions: what the agent should do at each step
Expected outcomes: final state (new FHIR resources, portal confirmation, tracked status)

Scenario categories

Happy path with variations. Standard workflow succeeds with varied menu structures, form layouts, and FHIR shapes.

Cross-interface failures. One interface fails mid-workflow and the agent recovers. IVR disconnects. Portal returns 500. FHIR is briefly unavailable for the write-back.

Data inconsistency. Patient's name in FHIR is "Robert" but IVR asks for "Bob." Insurance ID format differs. The diagnosis from FHIR maps to an ICD-10 code the payer rejects for this procedure.

Timing scenarios. Workflow spans hours or days. Agent submits, waits, checks back. Agent over-polls and triggers rate limiting. Or under-polls and misses a time-sensitive denial.

Payer-specific scenarios. Aetna works. United fails because United requires an extra clinical questionnaire Aetna does not.

The case for unified sandbox environments

Separate test environments for each interface are expensive and fragile. You end up with a FHIR test server, a mock IVR, recorded portal interactions, and a simulated claims pipeline. Each needs maintenance. Worse, they cannot coordinate. The mock IVR does not know what the FHIR server contains, so it cannot answer patient-specific questions.

A unified sandbox provides all interfaces as a coordinated system. FHIR, IVR, portal, and claims share the same patient data. When the agent reads a diagnosis from FHIR and reports it to the IVR, the IVR knows the correct value and can validate. When the agent fills a portal form with IVR data, the portal checks that the reference number matches.

This coordination is what makes multi-interface testing meaningful. Without it, you test each integration in a vacuum.

What coordination looks like

A coordinated sandbox for prior auth includes:

A FHIR R4 server seeded with a scenario-driven synthetic patient with conditions, coverage, and pending ServiceRequest
A simulated IVR that knows the patient's coverage and responds with correct auth requirements
A simulated portal with form fields matching the payer's real submission, expecting data from the FHIR server and IVR
An evaluation engine that checks each step: right FHIR read, right IVR navigation, accurate portal fill, correct write-back

Each component is aware of the others. Evaluation checks that data flowed across all of them, not just that the agent called each.

Key Takeaways

Real healthcare workflows cross four or more interfaces. Testing each in isolation catches the wrong class of bugs.
Cross-interface failures include data format mismatches, state dependencies, timing issues, and auth session drops.
A unified sandbox where all interfaces share the same patient state is what makes end-to-end testing real.
Scenario categories to build: happy-path variations, cross-interface failures, data inconsistency, timing, and payer-specific.
CAQH's 2024 Index reports $597B in US admin spending; prior auth is among the most expensive manual transactions.
Evaluation must check data flow across interfaces, not just that each interface was called.

Testing Healthcare AI Across FHIR, Voice, and Portals

TL;DR

A single workflow, four interfaces

Why isolated interface testing fails

Data format mismatches between interfaces

State dependencies across interfaces

Timing and ordering dependencies

Authentication complexity

Building end-to-end test scenarios

Scenario structure

Scenario categories

The case for unified sandbox environments

What coordination looks like

Key Takeaways

FAQ

Why do isolated integration tests miss multi-interface bugs?

What belongs in a multi-interface test scenario?

Does a unified sandbox need to match production performance characteristics?

How many payer-specific scenarios should I test?

The cost of not testing across interfaces

Related articles

HIMSS26's Agentic AI Gap Is an Eval Problem

The Agent RFP: How Hospitals Should Evaluate AI in 2026