Zero-Shot Trial Matching with LLMs

TL;DR: We explore zero-shot clinical trial patient matching with large language models (LLMs) under two system designs (traditional prompting vs reduced prompting via retrieval augmentation): (a) We inject the patient's entire set of notes into a prompt input into an Assessment LLM (e.g. GPT-4) for evaluation. (b) In our two-stage retrieval pipeline, we first query the top-k most relevant chunks from the patient's notes, then inject only those top-k chunks into the prompt input into an Assessment LLM. Both paradigms are compared using the same prompting strategies, with the only distinction being the amount of patient information included in the prompt.

Matching patients to clinical trials is a key unsolved challenge in bringing new drugs to market. Today, identifying patients who meet a trial's eligibility criteria is highly manual, taking up to 1 hour per patient. LLMs offer a promising solution. In this work, we explore their application to trial matching. First, we design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria. Our zero-shot system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark. Second, we improve the data and cost efficiency of our method by identifying a prompting strategy which matches patients an order of magnitude faster and more cheaply than the status quo, and develop a two-stage retrieval pipeline that reduces the number of tokens processed by up to a third while retaining high performance. Third, we evaluate the interpretability of our system by having clinicians evaluate the natural language justifications generated by the LLM for each eligibility decision, and show that it can output coherent explanations for 97% of its correct decisions and 75% of its incorrect ones. Our results establish the feasibility of using LLMs to accelerate clinical trial operations.

For our initial zero-shot evaluation, we feed the entire patient's medical history into the LLM and have it predict all criteria at once. All of the models we test are able to fit each patient's history into their context windows (Table 1). Despite not being tuned for trial matching or provided any in-context examples, GPT-4 beats the state-of-the-art by a margin of +6 Macro-F1 and +2 Micro-F1 points.

Table 1: Zero-shot 2018 n2c2 benchmark results using the ACIN prompt strategy. We use versions of each model with at least 32k context length, with the exception of GPT-3.5 (limited to 16k tokens) and Llama-3-70b (limited to 8k tokens). Bootstrapped 95% confidence intervals on the test set (1000 samples) are shown in subscript.

Table 2: Performance and efficiency across different prompt strategies. Cost and data efficiency of prompting strategies Considering one criterion/note at a time improves performance. "Tokens" includes both prompt and completion tokens (i.e. inputs and outputs). "API Calls" is the total number of times the LLM was queried. "Cost" is based on OpenAI’s pricing as of January 25, 2024

We are able to surpass the prior state-of-the-art on Macro-F1 using roughly one-third and one-half as many tokens as needed in the vanilla ICAN and ACAN strategies, respectively (using a patient's full note).

Figure 3: Model performance increases as the number (k) of retrieved documents increases, but quickly plateaus with diminishing returns. We test k ∈ {1, 3, 5, 10}. Each subfigure is a different prompting strategy. The y-axis is model performance (Macro/Micro-F1) and the x-axis is the total number of tokens processed by the model. Orange is GPT-4, blue is GPT-3.5, and the green line is the prior state-of-the-art. Stars represent each model’s best performance when feeding in all notes. The MiniLM embedding model is the dashed line, while BGE is the solid line.

We sample 468 rationales generated by GPT-4 using the and have two clinicians evaluate their veracity. Each rationale was evaluated on a 3-part scale: Correct, Partially Correct, and Incorrect, based on how accurately it aligned to the relevant patient's EHR. The results show that GPT-4 is able to provide legitimate rationales for most its decisions. When GPT-4 makes a correct eligibility decision (Figure 4), 89% of its rationales were judged as fully correct, 8% as partially correct, and 3% as incorrect. When GPT-4 made an incorrect eligibility decision (Figure 5), its rationales were split 67/8/25%.

Figure 4: A (top). Clinician assessment of the rationales generated by GPT-4 for its correct eligibility decisions. B (bottom). Clinician assessment of the rationales generated by GPT-4 for its incorrect eligibility decisions.

BibTeX

@article{wornow2025zero,
  title={Zero-shot clinical trial patient matching with llms},
  author={Wornow, Michael and Lozano, Alejandro and Dash, Dev and Jindal, Jenelle and Mahaffey, Kenneth W and Shah, Nigam H},
  journal={NEJM AI},
  volume={2},
  number={1},
  pages={AIcs2400360},
  year={2025},
  publisher={Massachusetts Medical Society}
}

Zero-Shot Clinical Trial Patient Matching with LLMs

Abstract

Results

Retrieval Pipeline

Interpretability

BibTeX