Verial
engineeringprior-authfailure-modestesting

6 Ways Prior Auth AI Agents Fail in Production [2026]

Prior auth agents fail at portal login, form mapping, document upload, status polling, denial parsing, and payer quirks. A failure-mode catalog with test patterns.

K
Kevin Huang · Co-founder, Verial
·7 min read
Share

TL;DR

  • Prior auth AI agents fail in six predictable ways: portal login drift, form field mismapping, document upload errors, status-polling mistakes, denial misparsing, and payer-specific quirks. Most teams discover these in production, not testing.
  • Physicians and staff process an average of 39 prior auth requests per physician per week, spending 12 hours on them (AMA 2024 Prior Authorization Survey).
  • Prior auth admin costs the US healthcare system roughly $35 billion per year according to CAQH Index 2024, which is why VC-funded agents keep launching, and breaking.
  • This is Part 1 of a two-part series. For how to build test scenarios against these failures, see testing prior auth workflows with simulated payer portals.

The prior auth automation wave

Prior authorization is the most painful workflow in healthcare administration. A 2024 AMA survey found 94% of physicians report that prior auth causes care delays, and 24% say it has led to a serious adverse event for a patient.

That pain has attracted a wave of AI companies: Cohere Health, Humata Health, Waystar, and dozens of smaller startups. The opportunity is real. The failure modes are real too, and most teams find them in production rather than in testing.

"Prior authorization continues to devastate patients, physicians, and the health care system. The practice diverts limited health care resources away from delivering medically necessary care."

Jesse M. Ehrenfeld, MD, MPH, past President, American Medical Association

This article catalogs the six most common failure modes and describes how to build test scenarios that catch each one before it hits a patient.

Failure mode 1: Portal login and session management

Payer portals are built for humans, not agents, and they change often. Common failures:

  • Credential rotation. Health systems rotate portal creds monthly or quarterly. If the agent does not detect the failure cleanly, it silently stops processing.
  • New MFA or CAPTCHA. A payer adds SMS MFA or a CAPTCHA. The agent cannot proceed and enters a retry loop.
  • Concurrent sessions. Staff logs in manually while the agent is logged in. One session gets killed. The agent keeps submitting invalid requests.
  • Session timeout mid-submission. The portal times out before a long submission completes, forcing re-auth without losing state.

How to test. Build simulated portal scenarios for each: invalid credentials, unexpected MFA, CAPTCHA, concurrent session termination, and 10-minute timeout mid-flow. Verify the agent detects, reports, and does not retry indefinitely.

Failure mode 2: Form field mapping

The same conceptual field (diagnosis code, NPI, urgency) appears in different labels, formats, and widgets across portals.

  • Dynamic fields that only appear after a dropdown selection. The agent fills visible fields and submits, missing required ones.
  • Typeahead fields that expect a description, not the ICD-10 code the agent typed.
  • Ambiguous radio labels. "Urgent" vs "Expedited" vs "Standard" vs "Routine" changes the payer's review clock.
  • Date format variations. MM/DD/YYYY vs YYYY-MM-DD vs date-picker widgets that reject typed input.
  • Free-text clinical rationale. Too short gets denied for insufficient docs. Too long and reviewers skim.

How to test. Scenarios with dynamic fields, typeahead-by-description, ambiguous urgency labels, each common date format, and rationale text that must hit required clinical elements.

Failure mode 3: Document assembly and upload

Most requests need supporting docs: office notes, labs, imaging, letters of medical necessity. The agent must pull from the EHR via FHIR, assemble, and upload.

  • Missing EHR docs. Unsigned notes or docs in systems the agent cannot reach.
  • Format mismatches. Portal wants PDF; EHR returns a base64 CDA. Any conversion step can corrupt output.
  • Size limits. 10 MB cap, imaging package is 40 MB. Some portals do not support multi-attachment.
  • Upload timeouts. Agent does not know if the upload succeeded. Retry creates duplicates.
  • Wrong document attached. Most recent note is not always the clinically relevant one.

How to test. Scenarios where required docs are missing, CDA needs PDF conversion, uploads exceed limits, uploads time out, and multiple candidate docs require clinical selection logic.

Failure mode 4: Status polling and tracking

After submission, requests sit in review for hours to weeks.

  • Polling too fast. Payer detects automation and temporarily blocks the account, freezing every pending auth.
  • Polling too slow. Payer requests info with a 24-hour deadline. Agent checks every 4 hours and misses it. Auto-denied.
  • Status parsing errors. "Pended, Additional Info Required" gets bucketed as "Pending." The agent never responds.
  • Partial approvals. Payer approves 6 of 12 PT visits. Agent reads "Approved" and the provider schedules all 12.
  • Race conditions. Agent checks status before the payer system has processed the submission. Logs a false error.

How to test. Payer that rate-limits polling, 24-hour additional-info deadline, non-standard status labels, partial approvals, and 30-second status-update delays.

Failure mode 5: Denial reason parsing and appeals

  • Unstructured denial text. "Does not meet medical necessity per policy 2024-IM-003." Agent needs to look up the policy, understand what criteria failed, and gather targeted evidence. Most file boilerplate appeals.
  • CARC and RARC code misinterpretation. Agent logs "unknown denial reason" instead of mapping the code.
  • Appeal deadlines. Payer windows range from 10 business days to 180 days. Missing it means the denial stands.
  • Appeal-level confusion. Wrong level (first-level vs second-level vs external review) wastes time and may forfeit escalation rights.

How to test. Policy-specific denials, uncommon CARC/RARC codes, 10-business-day deadlines, and first-to-second-level escalation flows.

Failure mode 6: Payer-specific quirks

These are undocumented and change without notice.

  • UnitedHealthcare requires a separate clinical questionnaire for certain imaging. Submitting the standard form alone gets auto-denied.
  • Aetna uses different portal URLs for commercial vs Medicare Advantage. Wrong portal returns "member not found."
  • Cigna periodically reorganizes menus. Hardcoded nav paths break silently.
  • Humana requires specific CPT modifiers other payers do not.
  • Blue Cross Blue Shield varies by state affiliate. An agent configured for BCBS Illinois will fail on BCBS Texas.

How to test. Per-payer suites that exercise their specific requirements, plus regression tests that detect portal structure drift.

Key Takeaways

  • Prior auth agents fail more often on operational edge cases than on clinical reasoning, which is why clinical benchmarks miss these bugs.
  • The six failure categories (login, forms, documents, status, denials, payer quirks) cover most real-world breakage.
  • Build per-failure-mode test scenarios and run them continuously, not only on code change. Portals change without your code changing.
  • Turn every production failure into a permanent test scenario. Over time your suite becomes a map of the failure landscape.
  • Pair this failure catalog with Part 2, testing prior auth workflows with simulated payer portals, to turn diagnosis into working tests.

FAQ

How much does prior auth cost the US healthcare system?

About $35 billion per year in administrative spend according to the CAQH Index 2024. Fully automating prior auth could save an estimated $494 million annually.

What is the CMS-0057-F rule and does it fix these failures?

CMS-0057-F mandates FHIR-based prior auth APIs for most payers by 2027. It will reduce some portal scraping, but manual portals and IVRs will remain for years, so agents still need operational testing.

Why do clinical benchmarks like HealthBench miss these bugs?

HealthBench and MedAgentBench test clinical reasoning. Prior auth failures are operational: portal drift, CAPTCHA, document encoding, rate limits. Operational correctness is a separate evaluation axis.

How many prior auth requests does the average practice handle?

Physicians average 39 requests per week and spend about 12 hours on them (AMA 2024 survey). That volume is why small error rates compound quickly.

K
Kevin Huang · Co-founder, Verial
·7 min read
Share