👂 💉

EHRSHOT

An EHR Benchmark for Few-Shot Evaluation of Foundation Models

6,739

patients

41,661,637

clinical events

921,499

visits

15

prediction tasks
While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits.

We help address these challenges through three contributions.

  1. We publish a new dataset, EHRSHOT, which contains de-identified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients.
  2. We release the full weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance.
  3. We define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.
Our model is available at this link. The dataset is available at this link. Code to reproduce our results is available at this link.

Overview


We collect the structured data within the deidentified longitudinal EHRs of patients from Stanford Hospital.

Comparison to Prior Work


Most prior benchmarks are (1) limited to the ICU setting and (2) not tailored towards few-shot evaluation of pre-trained models. In contrast, EHRSHOT contains (1) the full breadth of longitudinal data that a health system would expect to have on the patients it treats and (2) a broad range of tasks designed to evaluate models' task adaptation and few-shot capabilities.

Tasks


EHRSHOT includes 15 clinical classification tasks with canonical train/val/test splits, defined as follows.

TaskTypePrediction TimeTime Horizon
Long Length of StayBinary11:59pm on day of admissionAdmission duration
30-day ReadmissionBinary11:59pm on day of discharge30-days post discharge
ICU TransferBinary11:59pm on day of admissionAdmission duration
Thrombocytopenia4-way MulticlassImmediately before result is recordedNext result
Hyperkalemia4-way MulticlassImmediately before result is recordedNext result
Hypoglycemia4-way MulticlassImmediately before result is recordedNext result
Hyponatremia4-way MulticlassImmediately before result is recordedNext result
Anemia4-way MulticlassImmediately before result is recordedNext result
HypertensionBinary11:59pm on day of discharge1 year post-discharge
HyperlipidemiaBinary11:59pm on day of discharge1 year post-discharge
Pancreatic CancerBinary11:59pm on day of discharge1 year post-discharge
CeliacBinary11:59pm on day of discharge1 year post-discharge
LupusBinary11:59pm on day of discharge1 year post-discharge
Acute MIBinary11:59pm on day of discharge1 year post-discharge
Chest X-Ray Findings14-way Multilabel24hrs before report is recordedNext report

We include a graphical summary of the different task definitions below.

Benchmarking Results


We evaluate each baseline model in a few-shot setting. For each of the 15 benchmark tasks, we steadily increase the number of examples k that each model sees from k = 1 to the full training dataset, and record the model’s AUROC and AUPRC at each k.

In the below figures, the bolded lines are the Macro-AUC for each model within a task category, averaged across all subtasks at each k. The lighter lines are the AUC for each model on each subtask.

We include the performance of each model trained on the entire EHRSHOT training split on the far right of every plot as "All".

Please note that results are slightly different from the original EHRSHOT paper due to small changes made in the dataset for public release.

Check out the Leaderboard for up-to-date results.

AUROC
AUPRC

Citation

@article{wornow2023ehrshot,
  title={EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models}, 
  author={Michael Wornow and Rahul Thapa and Ethan Steinberg and Jason Fries and Nigam Shah},
  year={2023},
  eprint={2307.02028},
  archivePrefix={arXiv},
  primaryClass={cs.LG}
}