Verial
guidesvoice-aiivrtesting

Voice Agent IVR Testing: A Practical How-To Guide [2026]

Step-by-step how-to for building IVR test scenarios for healthcare voice agents: DTMF, speech recognition, hold handling, and representative interactions.

K
Kevin Huang · Co-founder, Verial
·7 min read
Share

TL;DR

  • A practical how-to for writing IVR test scenarios: structure tests around authentication, verification, action, and transfer legs. Cover DTMF and speech-recognition paths separately.
  • Over 90% of prior auth and eligibility requests still involve a phone call at some step (CAQH Index 2024), which is why payer IVR testing cannot be skipped.
  • Target above 95% navigation accuracy and above 98% data-extraction accuracy in simulated tests before going live.
  • For the strategic case and testing framework, see testing voice AI agents against healthcare IVR systems.

What healthcare IVRs look like

Every payer IVR has the same four legs: initial greeting and routing, identification and verification, automated lookup or live transfer, and the representative conversation.

A typical Aetna greeting:

"Thank you for calling Aetna. For pharmacy, press 1. For medical claims, press 2. For eligibility, press 3. For prior authorization, press 4."

Some systems also accept speech ("say claims, eligibility, or prior authorization"). Your agent needs to handle both, because payers mix modes and switch without warning.

After routing, the IVR asks for identification: provider NPI or tax ID, member ID, patient date of birth, claim number for status inquiries. Input methods differ: some want digit-by-digit DTMF, others accept full speech.

Per the American Medical Association, the average practice spends 12 hours per physician per week on prior auth alone, much of it on these calls. That scale is why voice automation is worth testing properly.

Scenario 1: Claim status inquiry

The most common IVR interaction.

  1. Call the payer line, navigate to medical claims
  2. Enter provider NPI
  3. Enter member ID and date of birth
  4. Enter claim number or date of service
  5. Listen to status (paid, denied, in process, pending)
  6. Capture payment amount, denial reason code, expected processing date

Test variations to cover:

  • Claim found and paid with EOB details
  • Claim denied with specific reason code (CO-4, PR-1)
  • Claim in process with estimated completion date
  • Claim not found
  • Multiple matching claims where the IVR lists them

Scenario 2: Benefits verification

  1. Navigate to eligibility
  2. Provide member identification
  3. Request a specific benefit category (outpatient imaging, DME)
  4. Capture deductible status, copay, coinsurance, out-of-pocket max, limitations

Test variations: active coverage, terminated coverage, coverage with prior auth requirements, in-network vs out-of-network tiers.

Scenario 3: Prior authorization

The most complex scenario because it often requires a live representative.

  1. Navigate to prior authorization
  2. Enter identification
  3. Determine whether the IVR can complete the request or must transfer
  4. If IVR: enter procedure codes, diagnosis codes, dates of service
  5. If representative: wait on hold, then interact

Representative interaction adds speech-to-speech complexity. For the failure catalog behind automated prior auth, see prior auth agent failure modes.

DTMF vs speech recognition

Cover both input modes explicitly.

DTMF tests. Correct digit sequences, timeout when no key is pressed, invalid key presses (7 when options are 1 to 5), star for back, pound for confirm, rapid input before each prompt finishes.

Speech tests. Clear pronunciation of NPIs and dates, recognition failure recovery ("I didn't understand that"), ambiguity ("did you say fifteen or fifty"), background noise, accent variation, and letter-by-letter spelling for alphanumeric IDs.

In simulation, configure recognition accuracy as a test variable. Perfect recognition does not prepare your agent for production.

"Conversational healthcare AI lives and dies on robustness under noisy, real phone conditions, not on lab benchmarks."

Emily Mower Provost, Professor of Computer Science, University of Michigan (see Michigan Computer Science)

Handling variability

Payers reorder options, add new ones, consolidate submenus. Agents that depend on "press 3 for eligibility" will break. Train and test the agent to select by prompt content, not fixed digit.

Test with: reordered menus, new options inserted between existing ones, submenu moved, holiday or high-volume messages prepended to the greeting.

Greeting variations

"Thank you for calling Aetna" becomes "Due to high call volume, wait times may be longer. Thank you for calling Aetna." The agent must handle preambles without getting lost.

Representative behavior

Simulate different styles: efficient reps who ask everything upfront, reps who ask one question at a time, reps who hold mid-conversation to check systems, reps who transfer to a different department, and reps who provide incorrect info your agent must catch.

Edge cases

Hold handling

  • Short holds (30s to 2 min) with music
  • Long holds (15+ min) with periodic "please continue to hold"
  • Mid-conversation holds ("let me check on that")
  • Hold that ends in disconnect after 20 minutes
  • Hold that ends with transfer to a new rep who needs context repeated

Disconnection and retry

Drops during IVR navigation, during hold, during representative conversation. Busy signals. "All circuits are busy." The agent should redial and resume with context preserved.

Wrong-department transfers

The rep transfers to the wrong queue. The new rep says "call back and select option 2." The agent must recognize this and redial or navigate back.

After-hours

IVR announces hours and hangs up. 24/7 automated lookups but reps only during business hours. Callback offered instead of hold.

Metrics to track

  • Navigation accuracy. Percent of calls reaching the correct destination on the first attempt. Target above 95%.
  • Information extraction accuracy. Percent of data points captured correctly vs ground truth. Target above 98%.
  • Call completion rate. Percent achieving purpose without human fallback.
  • Average handle time. Compare to human baseline. Expect 30% faster on automated legs.
  • Retry rate. Redials due to drops, wrong transfers, or navigation errors. Lower is better, not zero.
  • Recognition recovery rate. When the IVR does not understand the agent, how often does the retry succeed.

Building the harness

A practical harness has four parts:

  • IVR tree simulator. Configurable state machine. Each node has a prompt, expected input type (DTMF or speech), valid inputs, and transitions. One tree per payer.
  • Representative simulator. Rule-based or LLM-driven conversation engine with access to simulated payer data so it can answer claim and benefit queries.
  • Scenario runner. Sets up the test, initiates the call between agent and simulator, captures results.
  • Evaluation engine. Compares extracted data to ground truth, reports accuracy, flags failure patterns.

Start with the top two or three payers your agents call most often. Build faithful simulations. Write the scenarios above. Measure rigorously. Then expand.

Key Takeaways

  • Structure every IVR test around the four legs: greet, identify, act, and (if needed) transfer. Cover DTMF and speech paths separately.
  • The three must-have scenarios are claim status, benefits verification, and prior authorization. Every other scenario is a variation.
  • Simulate representative variation explicitly. Agents that only see the "good rep" persona in testing fail in production.
  • Treat menu changes, greeting preambles, and wrong-department transfers as first-class scenarios, not edge cases.
  • For the strategic framing and why this matters, see testing voice AI agents against healthcare IVR systems.

FAQ

What pass rate should I target before go-live?

Above 95% navigation accuracy, above 98% extraction accuracy, and no single payer below 80% pass rate in population-scale tests.

Do I need separate tests for DTMF and speech paths?

Yes. Payers mix modes and change configuration without warning. An agent that passes DTMF tests but fails speech tests will break the first time a payer flips a menu to voice-only.

How often should IVR tests run?

Happy-path and error-recovery tests on every code change. Full representative-conversation tests weekly. Population-scale tests across all payer configurations monthly. This mirrors the cadence most teams use for healthcare AI evaluation.

How do I get realistic payer IVR trees to simulate?

Record and transcribe real calls under PHI-safe conditions, then encode the trees. Resources like the HL7 International community and CAQH's CORE operating rules describe standards that inform expected IVR behavior for eligibility and claim status.

K
Kevin Huang · Co-founder, Verial
·7 min read
Share