INSPECT

A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

Paper GitHub Image data download EHR data download Model download

19,402

patients

23,248

CTPA studies

225+ million

medical events

3

fully linked modalities

Synthesizing information from various data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets.

To address this limitation,

We publish a new dataset, INSPECT (Integrating Numerous Sources for Prognostic Evaluation of Clinical Timelines), which contains de-identified longitudinal records from a large cohort of pulmonary embolism (PE) patients, along with ground truth labels for multiple outcomes.
INSPECT contains structured data from the (1) electronic health record (EHR), (2) unstructured data from radiology reports, and (3) images from computed tomography pulmonary angiography (CTPA) scans. INSPECT is the first dataset to link these three modalities for a large cohort of patients.
Using INSPECT, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks, including both diagnosis and prognosis tasks. We evaluate image-only, EHR-only, and fused models.
Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement.

Our model and dataset are available via a research data use agreement at this link. Code to reproduce our results is available at this link.

Overview

We collect the (1) electronic health record (EHR), (2) unstructured data from radiology reports, and (3) images from computed tomography pulmonary angiography (CTPA) scans from Stanford Hospital.

CTPA Demo

Multi-planer reconstruction for 3D volumetric data for one patient of INSPECT

Comparison to Prior Work

Most prior medical multimodal data are (1) limited to single modality and (2) only 1 case per patient (i.e., no longitudinal data). (3) no prognosis tasks. (4) limited size of cohort. (5) no benchmarking of models.

In contrast, INSPECT contains (1) the full breadth of longitudinal data that a health system would expect to have on the patients it treats; (2) all three modalities are linked to each patient/study. (3) contains both diagnosis and prognosis tasks (4) provides reproducible codes for benchmarking.

Dataset	Imaging Modalities	Reports	EHR	#Patients	#Image Studies	Diagnostic Tasks	Prognostic Tasks
Open-I	Chest X-ray	✅	❌		7,466	❌	❌
CheXpert	Chest X-ray	❌	❌	65,240	224,316	14	❌
MIMIC-CXR	Chest X-ray	✅	✅	65,379	227,835	14	❌
UK Biobank	Multiple MRI, DXA, Ultrasound	❌	*	100,000	Many	❌	❌
RSPECT	CT	❌	❌	12,195	12,195	13	❌
RadFusion	CT	❌	*	1,794	1,837	1	❌
INSPECT (Ours)	CT	✅	✅	19,402	23,248	1	3

(* denotes partial availabilty).

Breath of longitudinal EHR data

INSPECT includes 9 STARR OMOP tables (STAnford Research Repository in Observational Medical Outcomes Partnership) to cover broad ranges of medical events, making it perfectly suitable to build medical foundation models.

Table Type	Whole Cohort		Per Patient
Table Type	# Records	Percentage	Median	Min	Max
Measurement	183,820,762	(76.9%)	3,783	0	500,368
Drug Exposure	17,288,279	(7.23%)	271	0	118,228
Procedure Occurrence	8,614,273	(3.6%)	190	1	35,926
Condition Occurrence	8,320,211	(3.48%)	148	0	27,480
Visit Occurrence	5,865,211	(2.45%)	126	1	16,336
Visit Detail	1,355,691	(0.56%)	23	0	4,840
Device Exposure	88,010	(0.03%)	1	0	682
Person	87,158	(0.03%)	4	1	48
Death	4,410	(0.001%)	0	0	13
Total	225,444,005	(100%)	5,080	7	741,873

Benchmarking the dataset

We construct a per-modality backbone benchmark and use a late fusion strategy to produce final prediction.

Check out the Leaderboard for up-to-date results.

Citation

@article{huang2023inspect,
  title={INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis},
  author={Huang, Shih-Cheng and Huo, Zepeng and Steinberg, Ethan and Chiang, Chia-Chun and Langlotz, Curtis and Lungren, Matthew P and Yeung, Serena and Shah, Nigam and Fries, Jason Alan},
  journal={arXiv preprint arXiv:2311.10798},
  year={2023}
}