Testing Healthcare AI Across FHIR, Voice, and Portals
Real healthcare workflows span FHIR servers, payer phone lines, insurance portals, and claims. Testing them in isolation misses the failures that matter.
TL;DR
- Healthcare AI workflows cross four or more interfaces (FHIR, IVR, portal, X12), and isolated interface tests routinely miss the cross-boundary failures that cause production outages.
- CAQH's 2024 Index puts the US medical and dental industry's admin spending at $597 billion annually, with prior auth among the most costly manual transactions.
- A single prior auth workflow typically touches four interfaces: FHIR read, IVR call, portal submission, and X12 276/277 status, each with its own auth, timing, and data format.
- A unified sandbox where FHIR, IVR, and portal share the same patient state is what makes multi-interface testing meaningful.
A single workflow, four interfaces
Consider an AI agent handling a prior auth request for a knee MRI. The deck says: check eligibility, submit, track status, handle result. In practice, the agent touches at least four interfaces.
It reads clinical data from the EHR's FHIR API: ordering provider NPI, diagnosis codes, insurance, imaging history. Then it calls the payer's phone IVR to check eligibility and pick the correct submission pathway. Some payers require phone-based auth for imaging. Others accept portal submissions. Rules change quarterly. Next, it navigates the payer's portal to submit the request, uploading documentation and filling fields that vary by payer and procedure. Finally, it polls for status through a mix of portal checks, phone callbacks, and X12 276/277 transactions.
Each interface has its own auth, data format, failure modes, and timing. Testing them in isolation tells you each integration works. It does not tell you the workflow works.
"Providers are drowning in administrative transactions. Automation is only valuable if the systems actually talk to each other end to end."
-- April Todd, SVP Policy and CORE, CAQH (CAQH 2024 Index)
Why isolated interface testing fails
Most teams build and test each integration independently. The FHIR client has unit tests that read Patient and Condition resources. The voice module has tests for phone tree navigation. The portal automation has login and form tests. Each suite passes.
Then the agent runs the full workflow and fails in ways no individual test anticipated. These are the operational failures that clinical benchmarks never test.
Data format mismatches between interfaces
The FHIR server returns a diagnosis as a SNOMED code. The portal expects an ICD-10 code in a specific field. The IVR asks for the diagnosis "in plain English." Your agent translates the same concept across three representations, and each step is a failure point.
Common bug: the agent reads Condition.code.coding[0].code (SNOMED), maps to ICD-10 via a lookup, enters it in the portal. But the portal field has a typeahead that matches on ICD-10 description text, not the code. The agent types "M17.11" and gets nothing. A human would type "primary osteoarthritis, right knee" instead. This only surfaces when FHIR data flows into portal context.
State dependencies across interfaces
The IVR tells the agent prior auth is required and gives a reference number. The agent needs that number in the portal submission. If the IVR returns the reference in an unexpected format (with dashes, with a prefix, spoken as individual digits), the agent stores it wrong and portal validation fails.
Multi-step workflows compound this. An agent might need to:
- Read clinical data from FHIR
- Call the IVR for auth requirements
- Gather extra documentation from the EHR
- Submit via portal with fields populated from steps 1 to 3
- Store confirmation back in the EHR as a DocumentReference
A subtle error in step 2 cascades through steps 3 to 5. Isolated tests cannot catch this.
Timing and ordering dependencies
Interfaces operate on wildly different timescales. A FHIR call returns in milliseconds. An IVR call takes 3 to 15 minutes including hold time. A portal load with upload takes 10 to 60 seconds. Status updates appear hours or days later.
Subtle bugs that result:
- Token expiration during slow operations. Agent authenticates to FHIR, starts a long IVR call, then writes back to FHIR. Token has expired. Without mid-workflow refresh, the agent loses IVR call state.
- Portal session timeouts. Agent logs in, gathers FHIR and IVR data, returns to the portal. Session is gone. Agent must re-auth without losing form data.
- Race conditions on status checks. Agent submits and immediately checks status. The payer has not processed yet. Agent sees "not found" and reads it as a rejection.
Authentication complexity
Each interface requires different auth:
- FHIR: OAuth 2.0 with SMART on FHIR launch context, per-patient tokens
- Voice IVR: Caller ID verification, PIN or DOB spoken auth
- Payer portals: Username/password with MFA, session cookies, CSRF tokens
- X12: Trading partner agreements, SFTP keys, clearinghouse credentials
Each isolated auth test passes. The full workflow requires the agent to hold concurrent sessions across all four contexts, refreshing tokens and handling drops without interrupting work.
Building end-to-end test scenarios
Scenario structure
A good multi-interface scenario specifies:
- Initial state: what exists in the FHIR server before the workflow (demographics, conditions, coverage)
- Interface behavior: how each interface responds (IVR menu, portal fields, error paths)
- Expected actions: what the agent should do at each step
- Expected outcomes: final state (new FHIR resources, portal confirmation, tracked status)
Scenario categories
Happy path with variations. Standard workflow succeeds with varied menu structures, form layouts, and FHIR shapes.
Cross-interface failures. One interface fails mid-workflow and the agent recovers. IVR disconnects. Portal returns 500. FHIR is briefly unavailable for the write-back.
Data inconsistency. Patient's name in FHIR is "Robert" but IVR asks for "Bob." Insurance ID format differs. The diagnosis from FHIR maps to an ICD-10 code the payer rejects for this procedure.
Timing scenarios. Workflow spans hours or days. Agent submits, waits, checks back. Agent over-polls and triggers rate limiting. Or under-polls and misses a time-sensitive denial.
Payer-specific scenarios. Aetna works. United fails because United requires an extra clinical questionnaire Aetna does not.
The case for unified sandbox environments
Separate test environments for each interface are expensive and fragile. You end up with a FHIR test server, a mock IVR, recorded portal interactions, and a simulated claims pipeline. Each needs maintenance. Worse, they cannot coordinate. The mock IVR does not know what the FHIR server contains, so it cannot answer patient-specific questions.
A unified sandbox provides all interfaces as a coordinated system. FHIR, IVR, portal, and claims share the same patient data. When the agent reads a diagnosis from FHIR and reports it to the IVR, the IVR knows the correct value and can validate. When the agent fills a portal form with IVR data, the portal checks that the reference number matches.
This coordination is what makes multi-interface testing meaningful. Without it, you test each integration in a vacuum.
What coordination looks like
A coordinated sandbox for prior auth includes:
- A FHIR R4 server seeded with a scenario-driven synthetic patient with conditions, coverage, and pending ServiceRequest
- A simulated IVR that knows the patient's coverage and responds with correct auth requirements
- A simulated portal with form fields matching the payer's real submission, expecting data from the FHIR server and IVR
- An evaluation engine that checks each step: right FHIR read, right IVR navigation, accurate portal fill, correct write-back
Each component is aware of the others. Evaluation checks that data flowed across all of them, not just that the agent called each.
Key Takeaways
- Real healthcare workflows cross four or more interfaces. Testing each in isolation catches the wrong class of bugs.
- Cross-interface failures include data format mismatches, state dependencies, timing issues, and auth session drops.
- A unified sandbox where all interfaces share the same patient state is what makes end-to-end testing real.
- Scenario categories to build: happy-path variations, cross-interface failures, data inconsistency, timing, and payer-specific.
- CAQH's 2024 Index reports $597B in US admin spending; prior auth is among the most expensive manual transactions.
- Evaluation must check data flow across interfaces, not just that each interface was called.
FAQ
Why do isolated integration tests miss multi-interface bugs?
Because they cannot exercise state that spans interfaces. Token expiry during an IVR call, reference number corruption passed from IVR to portal, FHIR-to-ICD-10 text mismatches in typeahead fields. None surface when you test interfaces one at a time.
What belongs in a multi-interface test scenario?
Initial FHIR state, interface behavior definitions (IVR menus, portal fields, error paths), expected agent actions at each step, and expected final state across all interfaces. Scenarios should be deterministic and rerunnable.
Does a unified sandbox need to match production performance characteristics?
It should simulate the right timing behavior (IVR hold time, portal load, async status updates) so your agent's retry and timeout logic is exercised. Exact latency matching is not required. Ordering and relative timing is.
How many payer-specific scenarios should I test?
At minimum, cover your top three payers by volume, each with distinct auth, portal, and IVR patterns. Add scenarios when you see production failures that an existing scenario does not reproduce.
The cost of not testing across interfaces
Teams that test in isolation report the same pattern. Individual tests pass. Production workflows fail at the boundaries. FHIR data does not match portal expectations. The IVR reference gets corrupted. Tokens expire between steps.
These are not exotic edge cases. They are the normal failure modes of multi-interface workflows. The only way to catch them is to test the way the agent actually works.
Related articles
insightsHIMSS26's Agentic AI Gap Is an Eval Problem
HIMSS26 showed health systems deploying agents faster than they can audit them. The fix isn't more governance theater, it's independent simulation.
insightsThe Agent RFP: How Hospitals Should Evaluate AI in 2026
Slide decks and 3-month pilots can't tell you if an AI agent survives your workflows. Here's how the agent RFP replaces slideware with sim-based bakeoffs.