Version 1.0 — Now Live

HealthAdminBench

Suhana Bedi^1,*, Ryan Welch^1,*, Ethan Steinberg^2,*, Michael Wornow²,Taeil Matthew Kim¹,
Haroun Ahmed², Peter Sterling², Bravim Purohit², Qurat Akram³, Angelic Acosta³,
Esther Nubla³, Pritika Sharma³, Mike Pfeffer³, Sanmi Koyejo¹, Nigam Shah¹

¹ Stanford University ² Kinetic Systems ³ Stanford Healthcare

Leaderboard Environments Paper GitHub

135

Tasks

1698

Subtasks

GUI Environments

Models Evaluated

Leaderboard

Model Rankings

Task Description Task Description + Portal Guidance

Screenshot Axtree

Rank	Model ▼	Score ▼	Avg. Steps ▼
Loading benchmark results...

Results

Model Comparisons

Metric Explorer

X Axis

Y Axis

Loading comparison chart...

Loading correlation...

Environments

Live GUI Environments

Click the links below to view live, hosted versions of all GUI envs. You can also self-host them by following instructions in the GitHub repo.

Tasks + Verifiers

Dataset Explorer

HealthAdminBench contains 135 tasks sourced from three core healthcare administrative workflows (prior authorization, DME orders, appeals) across three difficulty levels.

You can view all raw task JSON files on Github here.

Methodology

How It Works

A reproducible evaluation framework for measuring AI agent capability on healthcare workflows.

Agent Observes Portal

The AI agent is given a healthcare task mirroring real admin workflows and observes a live web portal via accessibility tree, screenshots, or both.

Agent Takes Actions

The agent navigates forms, reviews clinical documentation, checks coverage criteria, and submits decisions all through standard browser interactions.

Multi-Faceted Evaluation

Each task is scored across multiple evaluation criteria (from exact state checks to LLM-judged clinical accuracy) with statistical reproducibility across runs.

Citation

If you found this work helpful, please cite it as:

@article{healthadminbench,
  title={HealthAdminBench: A Benchmark for Evaluating LLMs on Solving Administrative Healthcare Tasks},
  author={Suhana Bedi and Ryan Welch and Ethan Steinberg and Michael Wornow and Taeil Matthew Kim and Haroun Ahmed and Peter Sterling and Bravim Purohit and Qurat Akram and Angelic Acosta and Esther Nubla and Pritika Sharma and Mike Pfeffer and Sanmi Koyejo and Nigam Shah},
  year={2026}
}