👂 💉

EHRSHOT

An EHR Benchmark for Few-Shot Evaluation of Foundation Models

Paper GitHub EHR Foundation Model Dataset

6,739

patients

41,661,637

clinical events

921,499

visits

15

prediction tasks

While the general machine learning (ML) community has benefited from public datasets, tasks, and models, the progress of ML in healthcare has been hampered by a lack of such shared assets. The success of foundation models creates new challenges for healthcare ML by requiring access to shared pretrained models to validate performance benefits.

We help address these challenges through three contributions.

We publish a new dataset, EHRSHOT, which contains de-identified structured data from the electronic health records (EHRs) of 6,739 patients from Stanford Medicine. Unlike MIMIC-III/IV and other popular EHR datasets, EHRSHOT is longitudinal and not restricted to ICU/ED patients.
We release the full weights of CLMBR-T-base, a 141M parameter clinical foundation model pretrained on the structured EHR data of 2.57M patients. We are one of the first to fully release such a model for coded EHR data; in contrast, most prior models released for clinical data (e.g. GatorTron, ClinicalBERT) only work with unstructured text and cannot process the rich, structured data within an EHR. We provide an end-to-end pipeline for the community to validate and build upon its performance.
We define 15 few-shot clinical prediction tasks, enabling evaluation of foundation models on benefits such as sample efficiency and task adaptation.

Our model is available at this link. The dataset is available at this link. Code to reproduce our results is available at this link.

Overview

We collect the structured data within the deidentified longitudinal EHRs of patients from Stanford Hospital.

Comparison to Prior Work

Most prior benchmarks are (1) limited to the ICU setting and (2) not tailored towards few-shot evaluation of pre-trained models. In contrast, EHRSHOT contains (1) the full breadth of longitudinal data that a health system would expect to have on the patients it treats and (2) a broad range of tasks designed to evaluate models' task adaptation and few-shot capabilities.

Tasks

EHRSHOT includes 15 clinical classification tasks with canonical train/val/test splits, defined as follows.

Task	Type	Prediction Time	Time Horizon
Long Length of Stay	Binary	11:59pm on day of admission	Admission duration
30-day Readmission	Binary	11:59pm on day of discharge	30-days post discharge
ICU Transfer	Binary	11:59pm on day of admission	Admission duration
Thrombocytopenia	4-way Multiclass	Immediately before result is recorded	Next result
Hyperkalemia	4-way Multiclass	Immediately before result is recorded	Next result
Hypoglycemia	4-way Multiclass	Immediately before result is recorded	Next result
Hyponatremia	4-way Multiclass	Immediately before result is recorded	Next result
Anemia	4-way Multiclass	Immediately before result is recorded	Next result
Hypertension	Binary	11:59pm on day of discharge	1 year post-discharge
Hyperlipidemia	Binary	11:59pm on day of discharge	1 year post-discharge
Pancreatic Cancer	Binary	11:59pm on day of discharge	1 year post-discharge
Celiac	Binary	11:59pm on day of discharge	1 year post-discharge
Lupus	Binary	11:59pm on day of discharge	1 year post-discharge
Acute MI	Binary	11:59pm on day of discharge	1 year post-discharge
Chest X-Ray Findings	14-way Multilabel	24hrs before report is recorded	Next report

We include a graphical summary of the different task definitions below.

Benchmarking Results

We evaluate each baseline model in a few-shot setting. For each of the 15 benchmark tasks, we steadily increase the number of examples k that each model sees from k = 1 to the full training dataset, and record the model’s AUROC and AUPRC at each k.

In the below figures, the bolded lines are the Macro-AUC for each model within a task category, averaged across all subtasks at each k. The lighter lines are the AUC for each model on each subtask.

We include the performance of each model trained on the entire EHRSHOT training split on the far right of every plot as "All".

Please note that results are slightly different from the original EHRSHOT paper due to small changes made in the dataset for public release.

Check out the Leaderboard for up-to-date results.

EHRSHOT

Title here

EHRSHOT

6,739

41,661,637

921,499

15

Overview

Comparison to Prior Work

Tasks

Benchmarking Results

AUROC

AUPRC

Citation