RadGraph2: A New Dataset for Tracking Disease Progression in Radiology Reports


Automated information extraction from radiology notes presents significant challenges in the field of medical informatics. Researchers are trying to develop systems that can accurately extract and interpret complex medical data from radiological reports, particularly focusing on tracking disease progression over time. The primary challenge lies in the limited availability of suitably labeled data that can capture the nuanced information contained in these reports. Current methodologies often struggle with representing the temporal aspects of patient conditions, especially when it comes to comparisons with prior examinations, which are crucial for understanding a patient’s healthcare trajectory.

To overcome the limitations in capturing temporal changes in radiology reports, researchers have developed RadGraph2, an enhanced hierarchical schema for entities and relations. This new approach builds upon the original RadGraph schema, expanding its capabilities to represent various types of changes observed in patient conditions over time. RadGraph2 was developed through an iterative process, involving continuous feedback from medical practitioners to ensure its coverage, faithfulness, and reliability. The schema maintains the original design principles of maximizing clinically relevant information while preserving simplicity for efficient labeling. This method enables the capture of detailed information about findings and changes described in radiology reports, particularly focusing on comparisons with prior examinations.

The RadGraph2 method employs a Hierarchical Graph Information Extraction (HGIE) model to annotate radiology reports automatically. This approach utilizes the structured organization of labels to enhance information extraction performance. The core of the system is a Hierarchical Recognition (HR) component that utilizes an entity taxonomy, recognizing inherent relationships between various entities used in graph labeling. For instance, entities like CHAN-CON-WOR and CHAN-CON-AP are categorized under changes in patient conditions. The HR system uses a BERT-based model as its backbone, extracting 12 scalar outputs corresponding to entity categories. These outputs represent conditional probabilities of entities being true, given their parent’s truth in the entity hierarchy.

RadGraph2’s information schema defines three main entity types: “anatomy,” “observation,” and “change,” along with three relation types: “modify,” “located at,” and “suggestive of.” The entity types are further divided into subtypes, forming a hierarchical structure. Change entities (CHAN) are a key addition to the original RadGraph schema, encompassing subtypes such as No change (CHAN-NC), Change in medical condition (CHAN-CON), and Change in medical devices (CHAN-DEV). Each of these subtypes is further categorized to capture specific aspects of change, such as condition appearance, worsening, improvement, or resolution. Anatomy entities (ANAT) and Observation entities (OBS) are retained from the original schema, with OBS further divided into definitely present, uncertain, and absent subtypes. This hierarchical structure allows for a more nuanced representation of the information contained in radiology reports, particularly emphasizing the temporal aspects and changes in patient conditions.

RadGraph2’s schema defines three types of relations as directed edges between entities:

1. Modify relations (modify):

   • Indicate that the first entity modifies the second entity

   • Connect entity types: (OBS-*, OBS-*), (ANAT-DP, ANAT-DP), (CHAN-*, *), and (OBS-*, CHAN-*)

   • Example: “right” → “lung” in “right lung”

2. Located at relations (located_at):

   • Connect anatomy and observation entities

   • Indicate that observation is related to anatomy

   • Connect entity types: (OBS-*, ANAT-DP)

   • Example: “clear” → “lungs” in “lungs are clear”

3. Suggestive of relations (suggestive_of):

   • Indicate that the status of the second entity is derived from the first entity

   • Connect entity types: (OBS-*, OBS-*), (CHAN-*, OBS-*), and (OBS-*, CHAN-*)

   • Example: “opacity” → “pneumonia” in “The opacity may indicate pneumonia”

These relations enable RadGraph2 to capture the complex relationships between different entities in radiology reports, including modifications, anatomical associations, and diagnostic inferences. The schema’s relational structure allows for a more comprehensive representation of the information contained in the reports, facilitating a better understanding of the interconnections between observations, anatomical structures, and changes in patient conditions.

RadGraph2’s dataset is organized into three main partitions:

1. Training set:

   • Contains 575 manually labeled reports

   • Used for model training and optimization

2. Development set:

   • Consists of 75 manually labeled reports

   • Used for model validation and hyperparameter tuning

3. Test set:

   • Comprises 150 manually labeled reports

   • Used for final model evaluation

Key characteristics of the dataset:

• Patient disjointness: Reports in each partition are from distinct sets of patients

• Consistency with original RadGraph: Maintains the report placement from the original dataset

• De-identification: All protected health information in the reports is removed

Additional dataset component:

• 220,000+ automatically labeled reports:

   – Annotated by the best-performing model (HGIE)

   – Provides a large-scale resource for further research and model development

This dataset structure ensures a robust evaluation framework for RadGraph2, maintaining data integrity and patient privacy while offering a substantial corpus for training and testing advanced information extraction models in the radiology domain.

RadGraph2 releases a comprehensive set of files to support researchers and developers. The dataset package includes a README.md file providing a brief overview, along with train.json, dev.json, and test.json files containing labeled reports from MIMIC-CXR-JPG and CheXpert. Also, two large inference files, inference-chexpert.json and inference-mimic.json, contain reports labeled by the benchmark model. The file format follows a structure similar to the original RadGraph dataset, utilizing a JSON format with a hierarchical dictionary structure. Each report is identified by a unique key and contains metadata such as the full text, data split, data source, and a flag indicating if it was part of the original RadGraph dataset. The “entities” key within each report’s dictionary encapsulates detailed information about entity and relation labels, including tokens, label types, token indices, and relations to other entities. This structured format allows for efficient data processing and analysis, enabling researchers to utilize the rich information contained in radiology reports for various natural language processing tasks and medical informatics applications.

RadGraph2 is an advanced approach to automated information extraction from radiology reports, addressing the challenges of tracking disease progression over time. Key aspects of RadGraph2 include:

1. Enhanced hierarchical schema: Built upon the original RadGraph, it introduces new entity types to represent various kinds of changes in patient conditions.

2. Hierarchical Graph Information Extraction model: Utilizes a structured organization of labels and a Hierarchical Recognition component with a BERT-based backbone.

3. Comprehensive entity types: Includes anatomy, observation, and change entities, with further subtypes to capture nuanced information.

4. Relation types: Defines modify, located_at, and suggestive_of relations to represent complex relationships between entities.

5. Dataset structure: Comprises training (575 reports), development (75 reports), and test (150 reports) sets, plus 220,000+ automatically labeled reports.

6. File format: Uses JSON structure with detailed metadata and entity information for each report.

RadGraph2 aims to provide a more comprehensive representation of temporal changes in radiology reports, enabling better tracking of disease progression and patient care trajectories. The dataset and schema offer researchers a robust framework for developing advanced natural language processing models in the medical domain.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here



Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here