Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.
Dataset:
Dataset | Split | Num. Examples |
---|---|---|
SRRG-Impression | Train | 405,972 |
Validate | 1,505 | |
Test | 2,219 | |
Test Reviewed | 231 | |
Total | 409,927 | |
SRRG-Findings | Train | 181,874 |
Validate | 976 | |
Test | 1,459 | |
Test Reviewed | 233 | |
Total | 184,542 |
We release multiple models for SRRG-Impression and SRRG-Findings.
Example usage:
import io import requests import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoTokenizer import tempfile # step 1: Setup constants model_name = "StanfordAIMI/CheXagent-2-3b-srrg-findings" dtype = torch.bfloat16 device = "cuda" # step 2: Load Processor and Model tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True) model = model.to(dtype) model.eval() # step 3: Download image from URL, save to a local file, and prepare path list url = "https://huggingface.co/IAMJB/interpret-cxr-impression-baseline/resolve/main/effusions-bibasal.jpg" resp = requests.get(url) resp.raise_for_status() # Use a NamedTemporaryFile so it lives on disk with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmpfile: tmpfile.write(resp.content) local_path = tmpfile.name # this is a real file path on disk paths = [local_path] prompt = "Structured Radiology Report Generation for Findings Section" # build the multimodal input query = tokenizer.from_list_format( [*([{"image": img} for img in paths]), {"text": prompt}] ) # format as a chat conversation conv = [ {"from": "system", "value": "You are a helpful assistant."}, {"from": "human", "value": query}, ] # tokenize and generate input_ids = tokenizer.apply_chat_template( conv, add_generation_prompt=True, return_tensors="pt" ) output = model.generate( input_ids.to(device), do_sample=False, num_beams=1, temperature=1.0, top_p=1.0, use_cache=True, max_new_tokens=512, )[0] # decode the “findings” text response = tokenizer.decode(output[input_ids.size(1) : -1]) print(response)
Response:
Lungs and Airways: - No evidence of pneumothorax. Pleura: - Bilateral pleural effusions. Cardiovascular: - Cardiomegaly. Other: - Bibasilar opacities. - Mild pulmonary edema.
Model | # Classes (cf ontology) | Weighted Scores (P / R / F1 / Support) |
---|---|---|
🤗 StanfordAIMI/SRR-BERT-Leaves | 54 | 0.91 / 0.92 / 0.91 / 178,303 |
🤗 StanfordAIMI/SRR-BERT-Upper | 24 | 0.92 / 0.92 / 0.92 / 169,849 |
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses | 162 | 0.89 / 0.88 / 0.88 / 178,346 |
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses | 72 | 0.89 / 0.88 / 0.88 / 168,454 |
Dataset | Split | Num. Examples |
---|---|---|
🤗 StanfordAIMI/StructUtterances | Train | 1,203,332 |
Validate | 150,417 | |
Test | 150,417 | |
Test Reviewed | 1,609 | |
Total | 1,506,158 |
The evaluation script of F1-SRR-BERT is available here.
Example usage:
import json import torch from transformers import BertTokenizer, BertForSequenceClassification from datasets import load_dataset import requests # Configuration MODEL_PATH = "StanfordAIMI/SRRG-BERT-Upper-with-Statuses" MAPPING_URL = "https://raw.githubusercontent.com/jbdel/StructEval/refs/heads/main/structeval/upper_with_statuses_mapping.json" MAX_LENGTH = 128 DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Fetch mapping from GitHub resp = requests.get(MAPPING_URL) resp.raise_for_status() label_map = resp.json() idx2label = {v: k for k, v in label_map.items()} # Load tokenizer & model tokenizer = BertTokenizer.from_pretrained("microsoft/BiomedVLP-CXR-BERT-general") model = BertForSequenceClassification.from_pretrained(MODEL_PATH, num_labels=len(label_map)) model.to(DEVICE).eval() # Grab one test sentence dataset = load_dataset("StanfordAIMI/StructUtterances", split="test_reviewed") sentence = dataset[35]["utterance"] # Tokenize and infer inputs = tokenizer( sentence, padding="max_length", truncation=True, max_length=MAX_LENGTH, return_tensors="pt" ).to(DEVICE) with torch.no_grad(): logits = model(**inputs).logits preds = (torch.sigmoid(logits)[0].cpu().numpy() > 0.5).astype(int) pred_labels = [idx2label[i] for i, flag in enumerate(preds) if flag] print(f"Sentence: {sentence}") print("Predicted labels:", pred_labels)
Response:
Sentence: Patchy consolidation in the left retrocardiac area, suggestive of atelectasis or early airspace disease. Predicted labels: ['Consolidation (Uncertain)', 'Air space opacity (Uncertain)']
Difference between models:
Model | Predicted Labels |
---|---|
🤗 StanfordAIMI/SRR-BERT-Leaves | Atelectasis; Air space opacity–multifocal |
🤗 StanfordAIMI/SRR-BERT-Upper | Consolidation; Air space opacity |
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses | Atelectasis (Uncertain); Air space opacity–multifocal (Uncertain) |
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses | Consolidation (Uncertain); Air space opacity (Uncertain) |
@inproceedings{delbrouck-etal-2025-automated,
title = "Automated Structured Radiology Report Generation",
author = "Delbrouck, Jean-Benoit and
Xu, Justin and
Moll, Johannes and
Thomas, Alois and
Chen, Zhihong and
Ostmeier, Sophie and
Azhar, Asfandyar and
Li, Kelvin Zhenghao and
Johnston, Andrew and
Bluethgen, Christian and
Reis, Eduardo and
Muneer, Mohamed and
Varma, Maya and
Langlotz, Curtis",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
year = "2025",
publisher = "Association for Computational Linguistics",
}
1. No Finding
2. Lung Finding
2.1. Lung Opacity
2.1.1. Air space opacity
2.1.1.1. Diffuse air space opacity
2.1.1.1.1. Edema
2.1.1.2. Focal air space opacity
2.1.1.2.1. Consolidation
2.1.1.2.1.1. Pneumonia
2.1.1.2.1.2. Atelectasis
2.1.1.2.1.3. Aspiration
2.1.1.2.2. Segmental collapse
2.1.1.2.2.1. Lung collapse
2.1.1.2.3. Perihilar airspace opacity
2.1.1.3. Air space opacity–multifocal
2.1.2. Masslike opacity
2.1.2.1. Solitary masslike opacity
2.1.2.1.1. Mass/Solitary lung mass
2.1.2.1.2. Nodule/Solitary lung nodule
2.1.2.1.3. Cavitating mass with content
2.1.2.2. Multiple masslike opacities
2.1.2.2.1. Cavitating masses
2.2. Emphysema
2.3. Fibrosis
2.4. Pulmonary congestion
2.5. Hilar lymphadenopathy
2.6. Bronchiectasis
3. Pleural Finding
3.1. Pneumothorax
3.1.1. Simple pneumothorax
3.1.2. Loculated pneumothorax
3.1.3. Tension pneumothorax
3.2. Pleural Thickening
3.2.1. Pleural Effusion
3.2.1.1. Simple pleural effusion
3.2.1.2. Loculated pleural effusion
3.2.2. Pleural scarring
3.3. Hydropneumothorax
3.4. Pleural Other
4. Widened Cardiac Silhouette
4.1. Cardiomegaly
4.2. Pericardial effusion
5. Mediastinal Finding
5.1. Mediastinal Mass
5.1.1. Inferior mediastinal mass
5.1.2. Superior mediastinal mass
5.2. Vascular Finding
5.2.1. Widened aortic contour
5.2.1.1. Tortuous Aorta
5.2.2. Calcification of the Aorta
5.2.3. Enlarged pulmonary artery
5.3. Hernia
5.4. Pneumomediastinum
5.5. Tracheal deviation
6. Musculoskeletal Finding
6.1. Fracture
6.1.1. Acute humerus fracture
6.1.2. Acute rib fracture
6.1.3. Acute clavicle fracture
6.1.4. Acute scapula fracture
6.1.5. Compression fracture
6.2. Shoulder dislocation
6.3. Chest wall finding
6.3.1. Subcutaneous Emphysema
7. Support Devices
7.1. Suboptimal central line
7.2. Suboptimal endotracheal tube
7.3. Suboptimal nasogastric tube
7.4. Suboptimal pulmonary arterial catheter
7.5. Pleural tube
7.6. PICC line
7.7. Port catheter
7.8. Pacemaker
7.9. Implantable defibrillator
7.10. LVAD
7.11. Intraaortic balloon pump
8. Upper Abdominal Finding
8.1. Subdiaphragmatic gas
8.1.1. Pneumoperitoneum