A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis
We collect the (1) electronic health record (EHR), (2) unstructured data from radiology reports, and (3) images from computed tomography pulmonary angiography (CTPA) scans from Stanford Hospital.
Multi-planer reconstruction for 3D volumetric data for one patient of INSPECT
Most prior medical multimodal data are (1) limited to single modality and (2) only 1 case per patient (i.e., no longitudinal data). (3) no prognosis tasks. (4) limited size of cohort. (5) no benchmarking of models.
In contrast, INSPECT contains (1) the full breadth of longitudinal data that a health system would expect to have on the patients it treats; (2) all three modalities are linked to each patient/study. (3) contains both diagnosis and prognosis tasks (4) provides reproducible codes for benchmarking.
Dataset | Imaging Modalities | Reports | EHR | #Patients | #Image Studies | Diagnostic Tasks | Prognostic Tasks |
---|---|---|---|---|---|---|---|
Open-I | Chest X-ray | ✅ | ❌ | 7,466 | ❌ | ❌ | |
CheXpert | Chest X-ray | ❌ | ❌ | 65,240 | 224,316 | 14 | ❌ |
MIMIC-CXR | Chest X-ray | ✅ | ✅ | 65,379 | 227,835 | 14 | ❌ |
UK Biobank | Multiple MRI, DXA, Ultrasound | ❌ | * | 100,000 | Many | ❌ | ❌ |
RSPECT | CT | ❌ | ❌ | 12,195 | 12,195 | 13 | ❌ |
RadFusion | CT | ❌ | * | 1,794 | 1,837 | 1 | ❌ |
INSPECT (Ours) | CT | ✅ | ✅ | 19,402 | 23,248 | 1 | 3 |
(* denotes partial availabilty).
INSPECT includes 9 STARR OMOP tables (STAnford Research Repository in Observational Medical Outcomes Partnership) to cover broad ranges of medical events, making it perfectly suitable to build medical foundation models.
Table Type | Whole Cohort | Per Patient | |||
---|---|---|---|---|---|
# Records | Percentage | Median | Min | Max | |
Measurement | 183,820,762 | (76.9%) | 3,783 | 0 | 500,368 |
Drug Exposure | 17,288,279 | (7.23%) | 271 | 0 | 118,228 |
Procedure Occurrence | 8,614,273 | (3.6%) | 190 | 1 | 35,926 |
Condition Occurrence | 8,320,211 | (3.48%) | 148 | 0 | 27,480 |
Visit Occurrence | 5,865,211 | (2.45%) | 126 | 1 | 16,336 |
Visit Detail | 1,355,691 | (0.56%) | 23 | 0 | 4,840 |
Device Exposure | 88,010 | (0.03%) | 1 | 0 | 682 |
Person | 87,158 | (0.03%) | 4 | 1 | 48 |
Death | 4,410 | (0.001%) | 0 | 0 | 13 |
Total | 225,444,005 | (100%) | 5,080 | 7 | 741,873 |
We construct a per-modality backbone benchmark and use a late fusion strategy to produce final prediction.
Check out the Leaderboard for up-to-date results.
@article{huang2023inspect, title={INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis}, author={Huang, Shih-Cheng and Huo, Zepeng and Steinberg, Ethan and Chiang, Chia-Chun and Langlotz, Curtis and Lungren, Matthew P and Yeung, Serena and Shah, Nigam and Fries, Jason Alan}, journal={arXiv preprint arXiv:2311.10798}, year={2023} }