A simple video demonstrates the report generation task and CheXagent can do other tasks related to chest X-rays. Please refer to our paper for more details.

Abstract

Chest X-rays (CXRs) are the most frequently performed imaging test in clinical practice. Recent advances in the development of vision-language foundation models (FMs) give rise to the possibility of performing automated CXR interpretation, which can assist physicians with clinical decision-making and improve patient outcomes. However, developing FMs that can accurately interpret CXRs is challenging due to the (1) limited availability of large-scale vision-language datasets in the medical image domain, (2) lack of vision and language encoders that can capture the complexities of medical data, and (3) absence of evaluation frameworks for benchmarking the abilities of FMs on CXR interpretation. In this work, we address these challenges by first introducing CheXinstruct - a large-scale instruction-tuning dataset curated from 28 publicly-available datasets. We then present CheXagent - an instruction-tuned FM capable of analyzing and summarizing CXRs. To build CheXagent, we design a clinical large language model (LLM) for parsing radiology reports, a vision encoder for representing CXR images, and a network to bridge the vision and language modalities. Finally, we introduce CheXbench - a novel benchmark designed to systematically evaluate FMs across 8 clinically-relevant CXR interpretation tasks. Extensive quantitative evaluations and qualitative reviews with five expert radiologists demonstrate that CheXagent outperforms previously-developed general- and medical-domain FMs on CheXbench tasks. Furthermore, in an effort to improve model transparency, we perform a fairness evaluation across factors of sex, race and age to highlight potential performance disparities. The figure below shows the overview of the proposed pipeline.

Overview

Data: CheXinstruct

We introduce CheXinstruct - a large-scale instruction-tuning dataset curated from 28 publicly-available datasets. The figure below shows the collection of datasets and tasks comprising CheXinstruct.

CheXinstruct

You can explore the samples of CheXinstruct in the following demo, where three samples are shown for each task.

Model: CheXagent

We then present CheXagent - an instruction-tuned FM capable of analyzing and summarizing CXRs. The figure below shows the four-stage training process of CheXagent, starting from adapting a general LLM for clinical use, through training a CXR vision encoder and a vision-language bridger, to the final stage of instruction tuning on diverse CXR tasks.

CheXbench

Evaluation: CheXbench

We introduce CheXbench - a novel benchmark designed to systematically evaluate FMs across 8 clinically-relevant CXR interpretation tasks. The following table shows the results of CheXbench for tasks associated with image perception comparing CheXagent with general domain and medical domain FMs on several CXR datasets. For each task, we report accuracy.

CheXbench

We show results from automated evaluations using GPT-4 for findings generation. The figure below shows the results of GPT-4 evaluations, which demonstrates that the reports generated by CheXagent outperform medical-domain FMs for the findings generation task on MIMIC-CXR.

CheXbench

We conduct a reader study in which five radiologists compare text generated by CheXagent against text written by a physician. The figure below shows the corresponding results.

CheXbench

Fairness Evaluation

Furthermore, in an effort to improve model transparency, we perform a fairness evaluation across factors of sex, race and age to highlight potential performance disparities. The figure below shows the performance of CheXagent subgroup performance on cardiomegaly classification investigating potential model biases. F1 Scores vary across sex, racial groups, and age categories.

CheXbench

BibTeX

@article{stanford-aimi-chexagent-2024,
    title={CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation},
    author={Chen, Zhihong and Varma, Maya and Delbrouck, Jean-Benoit and Paschali, Magdalini and Blankemeier, Louis and Veen, Dave Van and Valanarasu, Jeya Maria Jose and Youssef, Alaa and Cohen, Joseph Paul and Reis, Eduardo Pontes and Tsai, Emily B. and Johnston, Andrew and Olsen, Cameron and Abraham, Tanishq Mathew and Gatidis, Sergios and Chaudhari, Akshay S and Langlotz, Curtis},
    journal={arXiv preprint arXiv:2401.12208},
    url={https://arxiv.org/abs/2401.12208},
    year={2024}
}