| Rank | Model | Score | Avg. Steps |
|---|---|---|---|
|
Loading benchmark results...
|
|||
Click the links below to view live, hosted versions of all GUI envs. You can also self-host them by following instructions in the GitHub repo.
HealthAdminBench contains 135 tasks sourced from three core healthcare administrative workflows (prior authorization, DME orders, appeals) across three difficulty levels.
A reproducible evaluation framework for measuring AI agent capability on healthcare workflows.
The AI agent is given a healthcare task mirroring real admin workflows and observes a live web portal via accessibility tree, screenshots, or both.
The agent navigates forms, reviews clinical documentation, checks coverage criteria, and submits decisions all through standard browser interactions.
Each task is scored across multiple evaluation criteria (from exact state checks to LLM-judged clinical accuracy) with statistical reproducibility across runs.
If you found this work helpful, please cite it as:
@article{healthadminbench,
title={HealthAdminBench: A Benchmark for Evaluating LLMs on Solving Administrative Healthcare Tasks},
author={Suhana Bedi and Ryan Welch and Ethan Steinberg and Michael Wornow and Taeil Matthew Kim and Haroun Ahmed and Peter Sterling and Bravim Purohit and Qurat Akram and Angelic Acosta and Esther Nubla and Pritika Sharma and Mike Pfeffer and Sanmi Koyejo and Nigam Shah},
year={2026}
}