Voice Agent IVR Testing: A Practical How-To Guide [2026]

TL;DR

A practical how-to for writing IVR test scenarios: structure tests around authentication, verification, action, and transfer legs. Cover DTMF and speech-recognition paths separately.
Over 90% of prior auth and eligibility requests still involve a phone call at some step (CAQH Index 2024), which is why payer IVR testing cannot be skipped.
Target above 95% navigation accuracy and above 98% data-extraction accuracy in simulated tests before going live.
For the strategic case and testing framework, see testing voice AI agents against healthcare IVR systems.

What healthcare IVRs look like

Every payer IVR has the same four legs: initial greeting and routing, identification and verification, automated lookup or live transfer, and the representative conversation.

A typical Aetna greeting:

"Thank you for calling Aetna. For pharmacy, press 1. For medical claims, press 2. For eligibility, press 3. For prior authorization, press 4."

Some systems also accept speech ("say claims, eligibility, or prior authorization"). Your agent needs to handle both, because payers mix modes and switch without warning.

After routing, the IVR asks for identification: provider NPI or tax ID, member ID, patient date of birth, claim number for status inquiries. Input methods differ: some want digit-by-digit DTMF, others accept full speech.

Per the American Medical Association, the average practice spends 12 hours per physician per week on prior auth alone, much of it on these calls. That scale is why voice automation is worth testing properly.

Scenario 1: Claim status inquiry

The most common IVR interaction.

Call the payer line, navigate to medical claims
Enter provider NPI
Enter member ID and date of birth
Enter claim number or date of service
Listen to status (paid, denied, in process, pending)
Capture payment amount, denial reason code, expected processing date

Test variations to cover:

Claim found and paid with EOB details
Claim denied with specific reason code (CO-4, PR-1)
Claim in process with estimated completion date
Claim not found
Multiple matching claims where the IVR lists them

Scenario 2: Benefits verification

Navigate to eligibility
Provide member identification
Request a specific benefit category (outpatient imaging, DME)
Capture deductible status, copay, coinsurance, out-of-pocket max, limitations

Test variations: active coverage, terminated coverage, coverage with prior auth requirements, in-network vs out-of-network tiers.

Scenario 3: Prior authorization

The most complex scenario because it often requires a live representative.

Navigate to prior authorization
Enter identification
Determine whether the IVR can complete the request or must transfer
If IVR: enter procedure codes, diagnosis codes, dates of service
If representative: wait on hold, then interact

Representative interaction adds speech-to-speech complexity. For the failure catalog behind automated prior auth, see prior auth agent failure modes.

DTMF vs speech recognition

Cover both input modes explicitly.

DTMF tests. Correct digit sequences, timeout when no key is pressed, invalid key presses (7 when options are 1 to 5), star for back, pound for confirm, rapid input before each prompt finishes.

Speech tests. Clear pronunciation of NPIs and dates, recognition failure recovery ("I didn't understand that"), ambiguity ("did you say fifteen or fifty"), background noise, accent variation, and letter-by-letter spelling for alphanumeric IDs.

In simulation, configure recognition accuracy as a test variable. Perfect recognition does not prepare your agent for production.

"Conversational healthcare AI lives and dies on robustness under noisy, real phone conditions, not on lab benchmarks."

Emily Mower Provost, Professor of Computer Science, University of Michigan (see Michigan Computer Science)

Handling variability

Payers reorder options, add new ones, consolidate submenus. Agents that depend on "press 3 for eligibility" will break. Train and test the agent to select by prompt content, not fixed digit.

Test with: reordered menus, new options inserted between existing ones, submenu moved, holiday or high-volume messages prepended to the greeting.

Greeting variations

"Thank you for calling Aetna" becomes "Due to high call volume, wait times may be longer. Thank you for calling Aetna." The agent must handle preambles without getting lost.

Representative behavior

Simulate different styles: efficient reps who ask everything upfront, reps who ask one question at a time, reps who hold mid-conversation to check systems, reps who transfer to a different department, and reps who provide incorrect info your agent must catch.

Edge cases

Hold handling

Short holds (30s to 2 min) with music
Long holds (15+ min) with periodic "please continue to hold"
Mid-conversation holds ("let me check on that")
Hold that ends in disconnect after 20 minutes
Hold that ends with transfer to a new rep who needs context repeated

Disconnection and retry

Drops during IVR navigation, during hold, during representative conversation. Busy signals. "All circuits are busy." The agent should redial and resume with context preserved.

Wrong-department transfers

The rep transfers to the wrong queue. The new rep says "call back and select option 2." The agent must recognize this and redial or navigate back.

After-hours

IVR announces hours and hangs up. 24/7 automated lookups but reps only during business hours. Callback offered instead of hold.

Metrics to track

Navigation accuracy. Percent of calls reaching the correct destination on the first attempt. Target above 95%.
Information extraction accuracy. Percent of data points captured correctly vs ground truth. Target above 98%.
Call completion rate. Percent achieving purpose without human fallback.
Average handle time. Compare to human baseline. Expect 30% faster on automated legs.
Retry rate. Redials due to drops, wrong transfers, or navigation errors. Lower is better, not zero.
Recognition recovery rate. When the IVR does not understand the agent, how often does the retry succeed.

Building the harness

A practical harness has four parts:

IVR tree simulator. Configurable state machine. Each node has a prompt, expected input type (DTMF or speech), valid inputs, and transitions. One tree per payer.
Representative simulator. Rule-based or LLM-driven conversation engine with access to simulated payer data so it can answer claim and benefit queries.
Scenario runner. Sets up the test, initiates the call between agent and simulator, captures results.
Evaluation engine. Compares extracted data to ground truth, reports accuracy, flags failure patterns.

Start with the top two or three payers your agents call most often. Build faithful simulations. Write the scenarios above. Measure rigorously. Then expand.

Key Takeaways

Structure every IVR test around the four legs: greet, identify, act, and (if needed) transfer. Cover DTMF and speech paths separately.
The three must-have scenarios are claim status, benefits verification, and prior authorization. Every other scenario is a variation.
Simulate representative variation explicitly. Agents that only see the "good rep" persona in testing fail in production.
Treat menu changes, greeting preambles, and wrong-department transfers as first-class scenarios, not edge cases.
For the strategic framing and why this matters, see testing voice AI agents against healthcare IVR systems.

Voice Agent IVR Testing: A Practical How-To Guide [2026]

TL;DR

What healthcare IVRs look like

Scenario 1: Claim status inquiry

Scenario 2: Benefits verification

Scenario 3: Prior authorization

DTMF vs speech recognition

Handling variability

Menu changes

Greeting variations

Representative behavior

Edge cases

Hold handling

Disconnection and retry

Wrong-department transfers

After-hours

Metrics to track

Building the harness

Key Takeaways

FAQ

What pass rate should I target before go-live?

Do I need separate tests for DTMF and speech paths?

How often should IVR tests run?

How do I get realistic payer IVR trees to simulate?

Related articles

HIMSS26's Agentic AI Gap Is an Eval Problem

The Agent RFP: How Hospitals Should Evaluate AI in 2026