(Update May 12, 2024): Thank you for everyone's participation in Discharge Me! The final leaderboard is available here.

Generating Discharge Summary Sections

The primary objective of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Clinicians play a crucial role in documenting patient progress in discharge summaries, but the creation of concise yet comprehensive hospital course summaries and discharge instructions often demands a significant amount of time. This can lead to clinician burnout and operational inefficiencies within hospital workflows. By streamlining the generation of these sections, we can not only enhance the accuracy and completeness of clinical documentation but also significantly reduce the time clinicians spend on administrative tasks, ultimately improving patient care quality.

1. Task Overview

Participants are given a dataset based on MIMIC-IV which includes 109,168 visits to the Emergency Department (ED), split into training, validation, phase I testing, and phase II testing sets. Each visit includes chief complaints and diagnosis codes (either ICD-9 or ICD-10) documented by the ED, at least one radiology report, and a discharge summary with both "Brief Hospital Course" and "Discharge Instructions" sections. The goal is to generate these two critical sections in discharge summaries.

Please click here for Frequently Asked Questions (FAQ).

1.1 Rules

All participants will be invited to submit a paper describing their solution to be included in the Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2024. If you do not wish to write a paper, you must at least provide a thorough description of your system which will be included in the overview paper for this task. Otherwise, your submission (and reported scores) will not be taken into account.

1.2 Timeline

All deadlines are 11:59 PM ("Anywhere on Earth")

1.3 How to Participate

Please visit the Codabench competition page to register for this shared task. Codabench [1] is the platform that we will use throughout the challenge, and an account is required to officially join the competition. All submissions and leaderboards will be available on that platform. Please direct any questions about the competition to the Codabench discussion forum or email xujustin@stanford.edu.

2. Data

The dataset for this task is created from MIMIC-IV's submodules MIMIC-IV-Note [2] and MIMIC-IV-ED [3] and is available on PhysioNet [4]. In order to download the data, you must have a credentialed PhysioNet account. If you do not have an account, you can create one here. Information on the required CITI training and credntialing process is available here.

The dataset for the "Discharge Me!" task is available on PhysioNet here.

Alternatively, the notebook to process the raw MIMIC-IV files from PhysioNet to form the challenge dataset is available on Colab here.

2.1 Dataset Description

The dataset has been split into a training (68,785 samples), a validation (14,719 samples), a phase I testing (14,702 samples), and a phase II testing (10,962 samples) dataset. The phase II testing dataset will serve as the final test set that will be released on April 12th (Friday), 2024. All datasets and tables are derived from the MIMIC-IV submodules.

Participants are free to use all or part of the provided dataset to develop their systems (except for phase I/II testing splits). To avoid any data contamination, any data in the official testing splits may not be used during training.

Discharge summaries are split into various sections and written under a variety of headings. However, each note in the dataset for this task includes a "Brief Hospital Course" and a "Discharge Instructions" section. The "Brief Hospital Course" section is usually located in the middle of the discharge summary following information about patient history and treatments received during the current visit. The "Discharge Instructions" section is generally located the end of the note as one of the last sections.

Each visit is defined by a unique hadm_id and is associated with a corresponding discharge summary and at least one radiology report. Most visits in the dataset will have only one corresponding ED stay. However, a select few visits may have more than one ED stay (ie. multiple stay_id). Each stay_id can have multiple ICD diagnoses, but will only have one chief complaint. Participants may use online resources for descriptions and details about ICD codes.

Please note that we decided to leave the entire discharge summary intact in the text field of the discharge table to allow for the most flexibility from participating teams. As such, teams will have to remove the target sections by simply masking out the corresponding text in discharge_target or via some other method, as the target sections (text that we aim to generate) should not be a model input.

Please keep in mind the clinical workflows when deciding which other sections of the discharge summary to use during generation. Specifically, please only use sections that would be reasonably available to the clinicians at the time of writing the respective target sections. Notably, it would be acceptable to use the generated "Brief Hospital Course" as an input for the generation of the "Discharge Instructions", as it would have been available to the models/clinicians in a typical setting. Remember to justify your decisions in your system paper and/or submission description.

Special Note:

If you are using pandas to read the .csv.gz tables, please ensure you set keep_default_na=False. For instance:

pd.read_csv('discharge_target.csv.gz', keep_default_na=False)

Otherwise, pandas will automatically convert certain strings, such as in cases where the discharge instruction is 'NA' or 'N/A', into the float NaN.

2.2 Dataset Statistics

The complete dataset contains the following items:

Item Total Count Training Validation Phase I Testing Phase II Testing
Visits 109,168 68,785 14,719 14,702 10,962
Discharge Summaries 109,168 68,785 14,719 14,702 10,962
Radiology Reports 409,359 259,304 54,650 54,797 40,608
ED Stays & Chief Complaints 109,403 68,936 14,751 14,731 10,985
ED Diagnoses 218,376 138,112 29,086 29,414 21,764

2.3 Dataset Schemas

For consistency and ease-of-use, the schemas of the data tables have been kept the same as the ones originally provided in MIMIC-IV's submodules. An additional table in discharge_target.csv.gz is provided, which includes extracted "Brief Hospital Course" and "Discharge Instructions" sections from the discharge summaries.

3. Evaluation Metrics

A hidden subset of 250 samples from the testing datasets of the respective phases will be used to evaluate the submissions. The metrics for this task are based on a combination of textual similarity and factual correctness of the generated text. Specifically, we will consider the following metrics:

Additionally, the submissions from the top-6 scoring teams will be reviewed by a team of clinicians at the end of the competition. Generated sections will be evaluated for their Completeness, Correctness, Readability, and Holistic Comparison to the Reference Text. Specifically, the following criteria will be used (scored from 1 to 5):

There will be two separate leaderboards on the Codabench competition page. One will be dedicated for the scores from the initial phase I testing dataset, and one will be dedicated for the final scores from the phase II testing dataset which will be released on April 12th (Friday), 2024.

Initially, submissions will be scored on both target sections separately. The mean across all test samples will be computed for each metric, resulting in several performance scores for each of the two target sections (not reported on the leaderboards). Then, for each metric, we will take the mean of the scores for each of the two target sections and report it under the metric name on the leaderboards. Finally, we will take the mean once again over all the metrics to arrive at a final overall system score (reported as Overall on the leaderboards). For instance, given samples, suppose is defined as the score for a given sample for a given metric, then the mean across all samples, , would be calculated by:

We would then calculate , the mean of a given metric over both target sections, for each of the 8 metrics using:

Finally, the overall system score would be calculated by taking the mean of the 8 values:

All scoring calculations will be done on Codabench with a Python 3.9 environment. The evaluation scripts are available on GitHub for reference here.

For specific submission instructions and details on evaluation, please visit the Codabench competition page.

4. Frequently Asked Questions

Please also visit the Codabench competition forum. You may find existing answers to your questions there, and are welcome to make your own post.

5. Organizers

Justin Xu


JB Delbrouck

Jean-Benoit Delbrouck

Andrew Johnston

Andrew Johnston

Louis Blankemeier

Louis Blankemeier

Curtis Langlotz

Curtis Langlotz


If you have any questions, please feel free to reach out to xujustin@stanford.edu. We hope you enjoy the shared task and look forward to your systems!

5. References

[1] Z. Xu et al., “Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform,” Patterns, vol. 3, no. 7, pp. 100543-100543, Jul. 2022, doi: https://doi.org/10.1016/j.patter.2022.100543.
[2] Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
[3] Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark, R., & Horng, S. (2023). MIMIC-IV-ED (version 2.2). PhysioNet. https://doi.org/10.13026/5ntk-km72.
[4] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
[5] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
[6] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
[7] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv.org, 2019. https://arxiv.org/abs/1904.09675.
[8] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
[9] Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
[10] W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” Scientific Data, vol. 10, no. 1, p. 586, Sep. 2023, doi: https://doi.org/10.1038/s41597-023-02487-3. ‌