A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

Paper GitHub Dataset download (TBD) Model download (TBD)




CTPA studies

225+ million

medical events


fully linked modalities
Synthesizing information from various data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets.

To address this limitation,

  1. We publish a new dataset, INSPECT (Integrating Numerous Sources for Prognostic Evaluation of Clinical Timelines), which contains de-identified longitudinal records from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes.
  2. INSPECT contains structured data from the (1) electronic health record (EHR), (2) unstructured data from radiology reports, and (3) images from computed tomography pulmonary angiography (CTPA) scans. INSPECT is the first dataset to link these three modalities for a large cohort of patients.
  3. Using INSPECT, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks, including both diagnosis and prognosis tasks. We evaluate image-only, EHR-only, and fused models.
  4. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement.
Our model and dataset are available via a research data use agreement at this link. Code to reproduce our results is available at this link.


We collect the (1) electronic health record (EHR), (2) unstructured data from radiology reports, and (3) images from computed tomography pulmonary angiography (CTPA) scans from Stanford Hospital.


Multi-planer reconstruction for 3D volumetric data for one patient of INSPECT

Comparison to Prior Work

Most prior medical multimodal data are (1) limited to single modality and (2) only 1 case per patient (i.e., no longitudinal data). (3) no prognosis tasks. (4) limited size of cohort. (5) no benchmarking of models.

In contrast, INSPECT contains (1) the full breadth of longitudinal data that a health system would expect to have on the patients it treats; (2) all three modalities are linked to each patient/study. (3) contains both diagnosis and prognosis tasks (4) provides reproducible codes for benchmarking.

Dataset Imaging Modalities Reports EHR #Patients #Image Studies Diagnostic Tasks Prognostic Tasks
Open-I Chest X-ray 7,466
CheXpert Chest X-ray 65,240 224,316 14
MIMIC-CXR Chest X-ray 65,379 227,835 14
UK Biobank Multiple MRI, DXA, Ultrasound * 100,000 Many
RSPECT CT 12,195 12,195 13
RadFusion CT * 1,794 1,837 1
INSPECT (Ours) CT 19,402 23,248 1 3

(* denotes partial availabilty).

Breath of longitudinal EHR data

INSPECT includes 9 STARR OMOP tables (STAnford Research Repository in Observational Medical Outcomes Partnership) to cover broad ranges of medical events, making it perfectly suitable to build medical foundation models.

Table Type Whole Cohort Per Patient
# Records Percentage Median Min Max
Measurement 183,820,762 (76.9%) 3,783 0 500,368
Drug Exposure 17,288,279 (7.23%) 271 0 118,228
Procedure Occurrence 8,614,273 (3.6%) 190 1 35,926
Condition Occurrence 8,320,211 (3.48%) 148 0 27,480
Visit Occurrence 5,865,211 (2.45%) 126 1 16,336
Visit Detail 1,355,691 (0.56%) 23 0 4,840
Device Exposure 88,010 (0.03%) 1 0 682
Person 87,158 (0.03%) 4 1 48
Death 4,410 (0.001%) 0 0 13
Total 225,444,005 (100%) 5,080 7 741,873

Benchmarking the dataset

We construct a per-modality backbone benchmark and use a late fusion strategy to produce final prediction.

Check out the Leaderboard for up-to-date results.


  title={INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis},
  author={Huang, Shih-Cheng and Huo, Zepeng and Steinberg, Ethan and Chiang, Chia-Chun and Langlotz, Curtis and Lungren, Matthew P and Yeung, Serena and Shah, Nigam and Fries, Jason Alan},
  journal={arXiv preprint arXiv:2311.10798},