Abstract

Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.

Overview

Report Generation

Dataset:

Dataset Split Num. Examples
SRRG-Impression Train 405,972
Validate 1,505
Test 2,219
Test Reviewed 231
Total 409,927
SRRG-Findings Train 181,874
Validate 976
Test 1,459
Test Reviewed 233
Total 184,542

We release multiple models for SRRG-Impression and SRRG-Findings.

Model Family Dataset Variant Model
CheXagent SRRG-Impression 🤗 StanfordAIMI/CheXagent-2-3b-srrg-impression
SRRG-Findings 🤗 StanfordAIMI/CheXagent-2-3b-srrg-findings
CheXpert-Plus SRRG-Impression 🤗 StanfordAIMI/chexpert-plus-srrg_impression
SRRG-Findings 🤗 StanfordAIMI/chexpert-plus-srrg_findings
MAIRA-2 SRRG-Impression 🤗 StanfordAIMI/maira2-srrg-impression
SRRG-Findings 🤗 StanfordAIMI/maira2-srrg-findings

Example usage:

import io
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
import tempfile

# step 1: Setup constants
model_name = "StanfordAIMI/CheXagent-2-3b-srrg-findings"
dtype = torch.bfloat16
device = "cuda"

# step 2: Load Processor and Model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
model = model.to(dtype)
model.eval()

# step 3: Download image from URL, save to a local file, and prepare path list
url = "https://huggingface.co/IAMJB/interpret-cxr-impression-baseline/resolve/main/effusions-bibasal.jpg"
resp = requests.get(url)
resp.raise_for_status()

# Use a NamedTemporaryFile so it lives on disk
with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmpfile:
    tmpfile.write(resp.content)
    local_path = tmpfile.name  # this is a real file path on disk

paths = [local_path]

prompt = "Structured Radiology Report Generation for Findings Section"
# build the multimodal input
query = tokenizer.from_list_format(
    [*([{"image": img} for img in paths]), {"text": prompt}]
)

# format as a chat conversation
conv = [
    {"from": "system", "value": "You are a helpful assistant."},
    {"from": "human", "value": query},
]

# tokenize and generate
input_ids = tokenizer.apply_chat_template(
    conv, add_generation_prompt=True, return_tensors="pt"
)
output = model.generate(
    input_ids.to(device),
    do_sample=False,
    num_beams=1,
    temperature=1.0,
    top_p=1.0,
    use_cache=True,
    max_new_tokens=512,
)[0]

# decode the “findings” text
response = tokenizer.decode(output[input_ids.size(1) : -1])
print(response)

Response:

Lungs and Airways:
  - No evidence of pneumothorax.
  
  Pleura:
  - Bilateral pleural effusions.
  
  Cardiovascular:
  - Cardiomegaly.
  
  Other:
  - Bibasilar opacities.
  - Mild pulmonary edema.
  

F1-SRR-BERT

Model # Classes (cf ontology) Weighted Scores (P / R / F1 / Support)
🤗 StanfordAIMI/SRR-BERT-Leaves 54 0.91 / 0.92 / 0.91 / 178,303
🤗 StanfordAIMI/SRR-BERT-Upper 24 0.92 / 0.92 / 0.92 / 169,849
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses 162 0.89 / 0.88 / 0.88 / 178,346
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses 72 0.89 / 0.88 / 0.88 / 168,454
Dataset Split Num. Examples
🤗 StanfordAIMI/StructUtterances Train 1,203,332
Validate 150,417
Test 150,417
Test Reviewed 1,609
Total 1,506,158

The evaluation script of F1-SRR-BERT is available here.

Example usage:

import json
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset
import requests

# Configuration
MODEL_PATH = "StanfordAIMI/SRRG-BERT-Upper-with-Statuses"
MAPPING_URL = "https://raw.githubusercontent.com/jbdel/StructEval/refs/heads/main/structeval/upper_with_statuses_mapping.json"
MAX_LENGTH = 128
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Fetch mapping from GitHub
resp = requests.get(MAPPING_URL)
resp.raise_for_status()
label_map = resp.json()
idx2label = {v: k for k, v in label_map.items()}

# Load tokenizer & model
tokenizer = BertTokenizer.from_pretrained("microsoft/BiomedVLP-CXR-BERT-general")
model = BertForSequenceClassification.from_pretrained(MODEL_PATH, num_labels=len(label_map))
model.to(DEVICE).eval()

# Grab one test sentence
dataset = load_dataset("StanfordAIMI/StructUtterances", split="test_reviewed")
sentence = dataset[35]["utterance"]

# Tokenize and infer
inputs = tokenizer(
    sentence,
    padding="max_length",
    truncation=True,
    max_length=MAX_LENGTH,
    return_tensors="pt"
).to(DEVICE)

with torch.no_grad():
    logits = model(**inputs).logits
    preds = (torch.sigmoid(logits)[0].cpu().numpy() > 0.5).astype(int)

pred_labels = [idx2label[i] for i, flag in enumerate(preds) if flag]

print(f"Sentence: {sentence}")
print("Predicted labels:", pred_labels)

Response:

Sentence: Patchy consolidation in the left retrocardiac area, suggestive of atelectasis or early airspace disease.
Predicted labels: ['Consolidation (Uncertain)', 'Air space opacity (Uncertain)']

Difference between models:

Model Predicted Labels
🤗 StanfordAIMI/SRR-BERT-Leaves Atelectasis;
Air space opacity–multifocal
🤗 StanfordAIMI/SRR-BERT-Upper Consolidation;
Air space opacity
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses Atelectasis (Uncertain);
Air space opacity–multifocal (Uncertain)
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses Consolidation (Uncertain);
Air space opacity (Uncertain)

BibTeX

@inproceedings{delbrouck-etal-2025-automated,
            title     = "Automated Structured Radiology Report Generation",
            author    = "Delbrouck, Jean-Benoit  and
                         Xu, Justin  and
                         Moll, Johannes  and
                         Thomas, Alois  and
                         Chen, Zhihong  and
                         Ostmeier, Sophie  and
                         Azhar, Asfandyar  and
                         Li, Kelvin Zhenghao  and
                         Johnston, Andrew  and
                         Bluethgen, Christian  and
                         Reis, Eduardo  and
                         Muneer, Mohamed  and
                         Varma, Maya  and
                         Langlotz, Curtis",
            booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
            year      = "2025",
            publisher = "Association for Computational Linguistics",
        }                 

Ontology:

1. No Finding
2. Lung Finding
   2.1. Lung Opacity
       2.1.1. Air space opacity
           2.1.1.1. Diffuse air space opacity
               2.1.1.1.1. Edema
           2.1.1.2. Focal air space opacity
               2.1.1.2.1. Consolidation
                   2.1.1.2.1.1. Pneumonia
                   2.1.1.2.1.2. Atelectasis
                   2.1.1.2.1.3. Aspiration
               2.1.1.2.2. Segmental collapse
                   2.1.1.2.2.1. Lung collapse
               2.1.1.2.3. Perihilar airspace opacity
           2.1.1.3. Air space opacity–multifocal
       2.1.2. Masslike opacity
           2.1.2.1. Solitary masslike opacity
               2.1.2.1.1. Mass/Solitary lung mass
               2.1.2.1.2. Nodule/Solitary lung nodule
               2.1.2.1.3. Cavitating mass with content
           2.1.2.2. Multiple masslike opacities
               2.1.2.2.1. Cavitating masses
   2.2. Emphysema
   2.3. Fibrosis
   2.4. Pulmonary congestion
   2.5. Hilar lymphadenopathy
   2.6. Bronchiectasis
3. Pleural Finding
   3.1. Pneumothorax
       3.1.1. Simple pneumothorax
       3.1.2. Loculated pneumothorax
       3.1.3. Tension pneumothorax
   3.2. Pleural Thickening
       3.2.1. Pleural Effusion
           3.2.1.1. Simple pleural effusion
           3.2.1.2. Loculated pleural effusion
       3.2.2. Pleural scarring
   3.3. Hydropneumothorax
   3.4. Pleural Other
4. Widened Cardiac Silhouette
   4.1. Cardiomegaly
   4.2. Pericardial effusion
5. Mediastinal Finding
   5.1. Mediastinal Mass
       5.1.1. Inferior mediastinal mass
       5.1.2. Superior mediastinal mass
   5.2. Vascular Finding
       5.2.1. Widened aortic contour
           5.2.1.1. Tortuous Aorta
       5.2.2. Calcification of the Aorta
       5.2.3. Enlarged pulmonary artery
   5.3. Hernia
   5.4. Pneumomediastinum
   5.5. Tracheal deviation
6. Musculoskeletal Finding
   6.1. Fracture
       6.1.1. Acute humerus fracture
       6.1.2. Acute rib fracture
       6.1.3. Acute clavicle fracture
       6.1.4. Acute scapula fracture
       6.1.5. Compression fracture
   6.2. Shoulder dislocation
   6.3. Chest wall finding
       6.3.1. Subcutaneous Emphysema
7. Support Devices
   7.1. Suboptimal central line
   7.2. Suboptimal endotracheal tube
   7.3. Suboptimal nasogastric tube
   7.4. Suboptimal pulmonary arterial catheter
   7.5. Pleural tube
   7.6. PICC line
   7.7. Port catheter
   7.8. Pacemaker
   7.9. Implantable defibrillator
   7.10. LVAD
   7.11. Intraaortic balloon pump
8. Upper Abdominal Finding
   8.1. Subdiaphragmatic gas
       8.1.1. Pneumoperitoneum