Automated Structured Report Generation

Abstract

Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT’s hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.

Report Generation

Dataset:

Dataset	Split	Num. Examples
SRRG-Impression	Train	405,972
	Validate	1,505
	Test	2,219
	Test Reviewed	231
	Total	409,927
SRRG-Findings	Train	181,874
	Validate	976
	Test	1,459
	Test Reviewed	233
	Total	184,542

We release multiple models for SRRG-Impression and SRRG-Findings.

Model Family	Dataset Variant	Model
CheXagent	SRRG-Impression	🤗 StanfordAIMI/CheXagent-2-3b-srrg-impression
CheXagent	SRRG-Findings	🤗 StanfordAIMI/CheXagent-2-3b-srrg-findings
CheXpert-Plus	SRRG-Impression	🤗 StanfordAIMI/chexpert-plus-srrg_impression
CheXpert-Plus	SRRG-Findings	🤗 StanfordAIMI/chexpert-plus-srrg_findings
MAIRA-2	SRRG-Impression	🤗 StanfordAIMI/maira2-srrg-impression
MAIRA-2	SRRG-Findings	🤗 StanfordAIMI/maira2-srrg-findings

Example usage:

import io
import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
import tempfile

# step 1: Setup constants
model_name = "StanfordAIMI/CheXagent-2-3b-srrg-findings"
dtype = torch.bfloat16
device = "cuda"

# step 2: Load Processor and Model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)
model = model.to(dtype)
model.eval()

# step 3: Download image from URL, save to a local file, and prepare path list
url = "https://huggingface.co/IAMJB/interpret-cxr-impression-baseline/resolve/main/effusions-bibasal.jpg"
resp = requests.get(url)
resp.raise_for_status()

# Use a NamedTemporaryFile so it lives on disk
with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmpfile:
    tmpfile.write(resp.content)
    local_path = tmpfile.name  # this is a real file path on disk

paths = [local_path]

prompt = "Structured Radiology Report Generation for Findings Section"
# build the multimodal input
query = tokenizer.from_list_format(
    [*([{"image": img} for img in paths]), {"text": prompt}]
)

# format as a chat conversation
conv = [
    {"from": "system", "value": "You are a helpful assistant."},
    {"from": "human", "value": query},
]

# tokenize and generate
input_ids = tokenizer.apply_chat_template(
    conv, add_generation_prompt=True, return_tensors="pt"
)
output = model.generate(
    input_ids.to(device),
    do_sample=False,
    num_beams=1,
    temperature=1.0,
    top_p=1.0,
    use_cache=True,
    max_new_tokens=512,
)[0]

# decode the “findings” text
response = tokenizer.decode(output[input_ids.size(1) : -1])
print(response)

Response:

Lungs and Airways:
  - No evidence of pneumothorax.
  
  Pleura:
  - Bilateral pleural effusions.
  
  Cardiovascular:
  - Cardiomegaly.
  
  Other:
  - Bibasilar opacities.
  - Mild pulmonary edema.

F1-SRR-BERT

Model	# Classes (cf ontology)	Weighted Scores (P / R / F1 / Support)
🤗 StanfordAIMI/SRR-BERT-Leaves	54	0.91 / 0.92 / 0.91 / 178,303
🤗 StanfordAIMI/SRR-BERT-Upper	24	0.92 / 0.92 / 0.92 / 169,849
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses	162	0.89 / 0.88 / 0.88 / 178,346
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses	72	0.89 / 0.88 / 0.88 / 168,454

Dataset	Split	Num. Examples
🤗 StanfordAIMI/StructUtterances	Train	1,203,332
	Validate	150,417
	Test	150,417
	Test Reviewed	1,609
	Total	1,506,158

The evaluation script of F1-SRR-BERT is available here.

Example usage:

import json
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from datasets import load_dataset
import requests

# Configuration
MODEL_PATH = "StanfordAIMI/SRRG-BERT-Upper-with-Statuses"
MAPPING_URL = "https://raw.githubusercontent.com/jbdel/StructEval/refs/heads/main/structeval/upper_with_statuses_mapping.json"
MAX_LENGTH = 128
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Fetch mapping from GitHub
resp = requests.get(MAPPING_URL)
resp.raise_for_status()
label_map = resp.json()
idx2label = {v: k for k, v in label_map.items()}

# Load tokenizer & model
tokenizer = BertTokenizer.from_pretrained("microsoft/BiomedVLP-CXR-BERT-general")
model = BertForSequenceClassification.from_pretrained(MODEL_PATH, num_labels=len(label_map))
model.to(DEVICE).eval()

# Grab one test sentence
dataset = load_dataset("StanfordAIMI/StructUtterances", split="test_reviewed")
sentence = dataset[35]["utterance"]

# Tokenize and infer
inputs = tokenizer(
    sentence,
    padding="max_length",
    truncation=True,
    max_length=MAX_LENGTH,
    return_tensors="pt"
).to(DEVICE)

with torch.no_grad():
    logits = model(**inputs).logits
    preds = (torch.sigmoid(logits)[0].cpu().numpy() > 0.5).astype(int)

pred_labels = [idx2label[i] for i, flag in enumerate(preds) if flag]

print(f"Sentence: {sentence}")
print("Predicted labels:", pred_labels)

Response:

Sentence: Patchy consolidation in the left retrocardiac area, suggestive of atelectasis or early airspace disease.
Predicted labels: ['Consolidation (Uncertain)', 'Air space opacity (Uncertain)']

Difference between models:

Model	Predicted Labels
🤗 StanfordAIMI/SRR-BERT-Leaves	Atelectasis; Air space opacity–multifocal
🤗 StanfordAIMI/SRR-BERT-Upper	Consolidation; Air space opacity
🤗 StanfordAIMI/SRR-BERT-Leaves-with-Statuses	Atelectasis (Uncertain); Air space opacity–multifocal (Uncertain)
🤗 StanfordAIMI/SRR-BERT-Upper-with-Statuses	Consolidation (Uncertain); Air space opacity (Uncertain)

BibTeX

@inproceedings{delbrouck-etal-2025-automated,
            title     = "Automated Structured Radiology Report Generation",
            author    = "Delbrouck, Jean-Benoit  and
                         Xu, Justin  and
                         Moll, Johannes  and
                         Thomas, Alois  and
                         Chen, Zhihong  and
                         Ostmeier, Sophie  and
                         Azhar, Asfandyar  and
                         Li, Kelvin Zhenghao  and
                         Johnston, Andrew  and
                         Bluethgen, Christian  and
                         Reis, Eduardo  and
                         Muneer, Mohamed  and
                         Varma, Maya  and
                         Langlotz, Curtis",
            booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
            year      = "2025",
            publisher = "Association for Computational Linguistics",
        }

Ontology:

1. No Finding
2. Lung Finding
   2.1. Lung Opacity
       2.1.1. Air space opacity
           2.1.1.1. Diffuse air space opacity
               2.1.1.1.1. Edema
           2.1.1.2. Focal air space opacity
               2.1.1.2.1. Consolidation
                   2.1.1.2.1.1. Pneumonia
                   2.1.1.2.1.2. Atelectasis
                   2.1.1.2.1.3. Aspiration
               2.1.1.2.2. Segmental collapse
                   2.1.1.2.2.1. Lung collapse
               2.1.1.2.3. Perihilar airspace opacity
           2.1.1.3. Air space opacity–multifocal
       2.1.2. Masslike opacity
           2.1.2.1. Solitary masslike opacity
               2.1.2.1.1. Mass/Solitary lung mass
               2.1.2.1.2. Nodule/Solitary lung nodule
               2.1.2.1.3. Cavitating mass with content
           2.1.2.2. Multiple masslike opacities
               2.1.2.2.1. Cavitating masses
   2.2. Emphysema
   2.3. Fibrosis
   2.4. Pulmonary congestion
   2.5. Hilar lymphadenopathy
   2.6. Bronchiectasis
3. Pleural Finding
   3.1. Pneumothorax
       3.1.1. Simple pneumothorax
       3.1.2. Loculated pneumothorax
       3.1.3. Tension pneumothorax
   3.2. Pleural Thickening
       3.2.1. Pleural Effusion
           3.2.1.1. Simple pleural effusion
           3.2.1.2. Loculated pleural effusion
       3.2.2. Pleural scarring
   3.3. Hydropneumothorax
   3.4. Pleural Other
4. Widened Cardiac Silhouette
   4.1. Cardiomegaly
   4.2. Pericardial effusion
5. Mediastinal Finding
   5.1. Mediastinal Mass
       5.1.1. Inferior mediastinal mass
       5.1.2. Superior mediastinal mass
   5.2. Vascular Finding
       5.2.1. Widened aortic contour
           5.2.1.1. Tortuous Aorta
       5.2.2. Calcification of the Aorta
       5.2.3. Enlarged pulmonary artery
   5.3. Hernia
   5.4. Pneumomediastinum
   5.5. Tracheal deviation
6. Musculoskeletal Finding
   6.1. Fracture
       6.1.1. Acute humerus fracture
       6.1.2. Acute rib fracture
       6.1.3. Acute clavicle fracture
       6.1.4. Acute scapula fracture
       6.1.5. Compression fracture
   6.2. Shoulder dislocation
   6.3. Chest wall finding
       6.3.1. Subcutaneous Emphysema
7. Support Devices
   7.1. Suboptimal central line
   7.2. Suboptimal endotracheal tube
   7.3. Suboptimal nasogastric tube
   7.4. Suboptimal pulmonary arterial catheter
   7.5. Pleural tube
   7.6. PICC line
   7.7. Port catheter
   7.8. Pacemaker
   7.9. Implantable defibrillator
   7.10. LVAD
   7.11. Intraaortic balloon pump
8. Upper Abdominal Finding
   8.1. Subdiaphragmatic gas
       8.1.1. Pneumoperitoneum