GREEN: Generative Radiology Report Evaluation and Error Notation

Stanford University,
May 2024

*Equal Contribution

Abstract

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to its medical nature. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers a score aligned with expert preferences, human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to the error counts of 6 experts and the preferences of 2 experts. Our method demonstrates not only a higher correlation with expert error counts but simultaneously higher alignment with expert preferences when compared to previous approaches.

Overview

Metric

We are introducing a novel metric named GREEN (Generative Radiology Evaluation and Error Notation), designed to assess the quality of radiology reports produced by machine learning models. Leveraging the advanced natural language understanding capabilities of large language models, GREEN accurately identifies and elucidates clinically significant discrepancies between reference and generated reports. This metric generates

1) a detailed score ranging from 0 to 1 for quantitative analysis,
2) a comprehensive summary for qualitative analysis.

Such interpretable evaluation makes GREEN a tool for providing user feedback and enhancing the quality of automated radiology reporting.

CheXinstruct

Development

To build our training dataset, we compiled 100,000 pairs of reference and generated candidate radiology reports from six chest X-ray datasets. These datasets include MIMIC-CXR, MIMIC-PRO, CandidPTX, PadChest, BIMCV-covid19, and OpenI, with generation guided by GPT-4 to highlight differences in predefined clinical categories. Pairing strategies ranged from random matching to semantic similarity and RadGraph permutations, ensuring diversity with 174,329 unique reports. We further pre-trained models with medical text datasets including MIMIC-IV Radiology and Discharge Summaries, MIMIC-CXR reports, PubMed content, Wiki Medical Terms, and Medical Guidelines. We then train a variety of open-source large language models on our training set to outperform previous approaches quantitatively and qualitatively.

CheXbench

Validation

Our validation shows that the GREEN score closely approximates the error assessment of an average radiologist, with a significant error difference of 1.54, nearly matching GPT-4's performance. With that GREEN also approaches the average inter-expert difference, indicating high fidelity in error evaluation. Comparative analysis of metrics revealed that both versions of GREEN exhibit stronger correlations with total radiologist error counts than conventional metrics.

CheXbench

Additionally, GREEN error counts achieved a correlation coefficient of 0.79, demonstrating superior performance compared to other metrics, including GPT -4-based models. This robust correlation, along with high expert preference alignment, highlights GREEN's efficacy and potential as an interpretable tool for medical report evaluation.

CheXbench

BibTeX

@article{ostmeier2024green,
            title={GREEN: Generative Radiology Report Evaluation and Error Notation},
            author={Ostmeier, Sophie and Xu, Justin and Chen, Zhihong and Varma, Maya and Blankemeier, Louis and Bluethgen, Christian and Michalson, Arne Edward and Moseley, Michael and Langlotz, Curtis and Chaudhari, Akshay S and others},
            journal={arXiv preprint arXiv:2405.03595},
            year={2024}
          }