(Update May 12, 2024): Thank you for everyone's participation in Discharge Me! The final leaderboard is available here.
Generating Discharge Summary Sections
The primary objective of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Clinicians play a crucial role in documenting patient progress in discharge summaries, but the creation of concise yet comprehensive hospital course summaries and discharge instructions often demands a significant amount of time. This can lead to clinician burnout and operational inefficiencies within hospital workflows. By streamlining the generation of these sections, we can not only enhance the accuracy and completeness of clinical documentation but also significantly reduce the time clinicians spend on administrative tasks, ultimately improving patient care quality.
1. Task Overview
Participants are given a dataset based on MIMIC-IV which includes 109,168 visits to the Emergency Department (ED), split into training, validation, phase I testing, and phase II testing sets. Each visit includes chief complaints and diagnosis codes (either ICD-9 or ICD-10) documented by the ED, at least one radiology report, and a discharge summary with both "Brief Hospital Course" and "Discharge Instructions" sections. The goal is to generate these two critical sections in discharge summaries.
Please click here for Frequently Asked Questions (FAQ).
1.1 Rules
All participants will be invited to submit a paper describing their solution to be included in the Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2024. If you do not wish to write a paper, you must at least provide a thorough description of your system which will be included in the overview paper for this task. Otherwise, your submission (and reported scores) will not be taken into account.
- Participants must comply with the PhysioNet Credentialed Health Data Use Agreement when using the data.
- Participants may use any additional data to train (or pre-train) their systems. However, all data used for the submission must be in some way available to other researchers.
- Participants may involve existing models trained on proprietary data in their systems. However these models must also be accessible to other researchers in some capacity.
- If participants employ LLMs, please ensure that the team clearly notes the expected outputs by the models or the prompting strategies used so that results can be reproduced. However, please note that sending data via an API to a third party is a violation of the DUA. Please consult this for options.
- All submissions must be made through the Codabench competition page.
1.2 Timeline
- First call for participation: February 5th (Monday), 2024
- Release of training, validation, and phase I testing datasets: February 6th (Tuesday), 2024
- Release of the phase II testing dataset: April 12th (Friday), 2024
- System submission deadline: May 10th (Friday), 2024
- System papers due date: May 17th (Friday), 2024
- Notification of acceptance: June 17th (Monday), 2024
- Camera-ready system papers due: July 1st (Monday), 2024
- BioNLP Workshop Date: August 16th (Friday), 2024
1.3 How to Participate
Please visit the Codabench competition page to register for this shared task. Codabench [1] is the platform that we will use throughout the challenge, and an account is required to officially join the competition. All submissions and leaderboards will be available on that platform. Please direct any questions about the competition to the Codabench discussion forum or email xujustin@stanford.edu.
2. Data
The dataset for this task is created from MIMIC-IV's submodules MIMIC-IV-Note [2] and MIMIC-IV-ED [3] and is available on PhysioNet [4]. In order to download the data, you must have a credentialed PhysioNet account. If you do not have an account, you can create one here. Information on the required CITI training and credntialing process is available here.
The dataset for the "Discharge Me!" task is available on PhysioNet here.
Alternatively, the notebook to process the raw MIMIC-IV files from PhysioNet to form the challenge dataset is available on Colab here.
2.1 Dataset Description
The dataset has been split into a training (68,785 samples), a validation (14,719 samples), a phase I testing (14,702 samples), and a phase II testing (10,962 samples) dataset. The phase II testing dataset will serve as the final test set that will be released on April 12th (Friday), 2024. All datasets and tables are derived from the MIMIC-IV submodules.
Participants are free to use all or part of the provided dataset to develop their systems (except for phase I/II testing splits). To avoid any data contamination, any data in the official testing splits may not be used during training.
Discharge summaries are split into various sections and written under a variety of headings. However, each note in the dataset for this task includes a "Brief Hospital Course" and a "Discharge Instructions" section. The "Brief Hospital Course" section is usually located in the middle of the discharge summary following information about patient history and treatments received during the current visit. The "Discharge Instructions" section is generally located the end of the note as one of the last sections.
Each visit is defined by a unique hadm_id
and is associated with a corresponding discharge summary and at least one radiology report. Most visits in the dataset will have only one corresponding ED stay. However, a select few visits may have more than one ED stay (ie. multiple stay_id
). Each stay_id
can have multiple ICD diagnoses, but will only have one chief complaint. Participants may use online resources for descriptions and details about ICD codes.
Please keep in mind the clinical workflows when deciding which other sections of the discharge summary to use during generation. Specifically, please only use sections that would be reasonably available to the clinicians at the time of writing the respective target sections. Notably, it would be acceptable to use the generated "Brief Hospital Course" as an input for the generation of the "Discharge Instructions", as it would have been available to the models/clinicians in a typical setting. Remember to justify your decisions in your system paper and/or submission description.Please note that we decided to leave the entire discharge summary intact in the
text
field of thedischarge
table to allow for the most flexibility from participating teams. As such, teams will have to remove the target sections by simply masking out the corresponding text indischarge_target
or via some other method, as the target sections (text that we aim to generate) should not be a model input.
Special Note:
If you are using pandas
to read the .csv.gz
tables, please ensure you set keep_default_na=False
. For instance:
pd.read_csv('discharge_target.csv.gz', keep_default_na=False)
Otherwise, pandas
will automatically convert certain strings, such as in cases where the discharge instruction is 'NA'
or 'N/A'
, into the float NaN
.
2.2 Dataset Statistics
The complete dataset contains the following items:
Item | Total Count | Training | Validation | Phase I Testing | Phase II Testing |
---|---|---|---|---|---|
Visits | 109,168 | 68,785 | 14,719 | 14,702 | 10,962 |
Discharge Summaries | 109,168 | 68,785 | 14,719 | 14,702 | 10,962 |
Radiology Reports | 409,359 | 259,304 | 54,650 | 54,797 | 40,608 |
ED Stays & Chief Complaints | 109,403 | 68,936 | 14,751 | 14,731 | 10,985 |
ED Diagnoses | 218,376 | 138,112 | 29,086 | 29,414 | 21,764 |
2.3 Dataset Schemas
For consistency and ease-of-use, the schemas of the data tables have been kept the same as the ones originally provided in MIMIC-IV's submodules. An additional table in discharge_target.csv.gz
is provided, which includes extracted "Brief Hospital Course" and "Discharge Instructions" sections from the discharge summaries.
3. Evaluation Metrics
A hidden subset of 250 samples from the testing datasets of the respective phases will be used to evaluate the submissions. The metrics for this task are based on a combination of textual similarity and factual correctness of the generated text. Specifically, we will consider the following metrics:- BLEU-4 [5]
- ROUGE-1, -2, -L [6]
- BERTScore [7]
- Meteor [8]
- AlignScore [9]
- MEDCON [10]
Additionally, the submissions from the top-6 scoring teams will be reviewed by a team of clinicians at the end of the competition. Generated sections will be evaluated for their Completeness, Correctness, Readability, and Holistic Comparison to the Reference Text. Specifically, the following criteria will be used (scored from 1 to 5):
- Completeness (captures important information):
- Captures no important information (1)
- Captures ~25% of the important information (2)
- Captures ~50% of the important information (3)
- Captures ~75% of the important information (4)
- Captures all of the important information (5)
- Correctness (contains less false information):
- Contains harmful content that will definitely impact future care (1)
- Contains incorrect content that is likely to impact future care (2)
- Contains incorrect content that may or may not impact future care (3)
- Contains incorrect content that will not impact future care (4)
- Contains no incorrect content (5)
- Readability:
- Significantly harder to read than the reference text (1)
- Slightly harder to read than the reference text (2)
- Neither easier nor harder to read than the reference text (3)
- Slightly easier to read than the reference text (4)
- Significantly easier to read than the reference text (5)
- Holistic Comparison to the Reference Text:
- Significantly worse than the reference text (1)
- Slightly worse than the reference text (2)
- Neither better nor worse than the reference text (3)
- Slightly better than the reference text (4)
- Significantly better than the reference text (5)
There will be two separate leaderboards on the Codabench competition page. One will be dedicated for the scores from the initial phase I testing dataset, and one will be dedicated for the final scores from the phase II testing dataset which will be released on April 12th (Friday), 2024.
Initially, submissions will be scored on both target sections separately. The mean across all test samples will be computed for each metric, resulting in several performance scores for each of the two target sections (not reported on the leaderboards). Then, for each metric, we will take the mean of the scores for each of the two target sections and report it under the metric name on the leaderboards. Finally, we will take the mean once again over all the metrics to arrive at a final overall system score (reported as Overall on the leaderboards). For instance, given samples, suppose is defined as the score for a given sample for a given metric, then the mean across all samples, , would be calculated by:
We would then calculate , the mean of a given metric over both target sections, for each of the 8 metrics using:
Finally, the overall system score would be calculated by taking the mean of the 8 values:
All scoring calculations will be done on Codabench with a Python 3.9 environment. The evaluation scripts are available on GitHub for reference here.
For specific submission instructions and details on evaluation, please visit the Codabench competition page.
4. Frequently Asked Questions
Please also visit the Codabench competition forum. You may find existing answers to your questions there, and are welcome to make your own post.
- Q: What will the phase II testing dataset look like?
- A: The phase II testing dataset will be identical in structure to the phase I testing dataset already released, with the target sections also in the full discharge summary. Please see the Codabench discussion post here (scroll to comment #8) for more details.
- Q: Why are there target sections that only have a length of one word?
- A: Thanks @mchizhik for bringing this to our attention. Please see the Codabench discussion post here for a note regarding the target sections and some minor changes to number of samples in the test datasets.
- Q: Can I use the
text
field indischarge.csv.gz
as an input to my system? - A: Yes, but please see the Codabench discussion post here for a note regarding input data.
- Q: Can I use the target sections as inputs into my system?
- A: No, as it would not be a true natural language generation (NLG) task otherwise. Please see the Codabench discussion post here for a note regarding this.
- Q: Do I have to use all the provided data?
- A: No. Please see the Codabench discussion post here for a note regarding data tables.
- Q: What sort of solution systems would be accepted? Do they have to be a machine learning model?
- A: Please see the Codabench discussion post here for a note regarding accepted solutions.
- Q: Where can I find the scoring program used for evaluation? It is not on the Codabench competition page.
- A: The scoring scripts are available on GitHub here.
- Q: Can I use external data to train my system?
- A: Yes, you may use external data to train your system. However, you must disclose all external data used in your system paper and/or submission description. Additionally, your system will not be ranked on the leaderboard if the data used for the submission is not available to other researchers.
- Q: Can I use private data to train my system?
- A: Yes, you may use private data to train your system. However, your system will not be ranked on the leaderboard if the data used for the submission is not available to other researchers.
- Q: Can I use pre-trained models to train my system?
- A: Yes, you may use pre-trained models and finetune it for your system. However, you must disclose all pre-trained models used in your system paper and/or submission description.
5. Organizers
Justin
Xu
Website
Jean-Benoit Delbrouck
WebsiteAndrew Johnston
WebsiteLouis Blankemeier
WebsiteCurtis Langlotz
WebsiteIf you have any questions, please feel free to reach out to xujustin@stanford.edu. We hope you enjoy the shared task and look forward to your systems!
5. References
[1] Z. Xu et al., “Codabench: Flexible, easy-to-use, and reproducible meta-benchmark platform,” Patterns, vol. 3, no. 7, pp. 100543-100543, Jul. 2022, doi: https://doi.org/10.1016/j.patter.2022.100543.
[2] Johnson, A., Pollard, T., Horng, S., Celi, L. A., & Mark, R. (2023). MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2). PhysioNet. https://doi.org/10.13026/1n74-ne17.
[3] Johnson, A., Bulgarelli, L., Pollard, T., Celi, L. A., Mark, R., & Horng, S. (2023). MIMIC-IV-ED (version 2.2). PhysioNet. https://doi.org/10.13026/5ntk-km72.
[4] Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
[5] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
[6] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
[7] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” arXiv.org, 2019. https://arxiv.org/abs/1904.09675.
[8] Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
[9] Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
[10] W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” Scientific Data, vol. 10, no. 1, p. 586, Sep. 2023, doi: https://doi.org/10.1038/s41597-023-02487-3.