Multi-View and Multi-Scale Alignment (MaMA): Advancing Mammography with Contrastive Learning and Visual-Language Pre-training


Multi-View and Multi-Scale Alignment for Mammography Contrastive Learning:
Contrastive Language-Image Pre-training (CLIP) has shown potential in medical imaging, but its application to mammography faces challenges due to limited labeled data, high-resolution images, and imbalanced datasets. This study introduces the first full adaptation of CLIP to mammography through a new framework called Multi-view and Multi-scale Alignment (MaMA). Mammography’s inherent complexities, such as multi-view images with small regions of interest, bilateral asymmetry, and ipsilateral correspondence, demand specialized approaches. MaMA addresses these issues by leveraging the multi-view nature of mammography and aligning image features at different scales. It also uses a symmetric local alignment module to focus on detailed features and a parameter-efficient fine-tuning approach to enhance pre-trained LLMs with medical knowledge. This allows the framework to overcome data scarcity and perform better on mammography tasks.

The MaMA model significantly outperforms existing state-of-the-art methods across multiple tasks on two large mammography datasets, EMBED and RSNA-Mammo, despite using only 52% of the model size compared to the largest baseline. By combining multi-view image alignment and text-image relationships, MaMA effectively learns detailed image representations while maintaining efficient resource usage. This method demonstrates its potential to enhance mammography interpretation through visual-language pre-training, improving cancer detection and diagnosis with fewer computational demands. The code is available for public use to promote further research in this area.

Medical Visual-Language Pre-training Methods:
Existing medical Visual-Language Pre-training (VLP) models are classified into two types. The first involves general-purpose models trained on large-scale datasets with multiple anatomical sites, which show strong generalization but are often outperformed by modality-specific models. The second type focuses on chest X-rays due to the availability of extensive datasets, though they face limitations like pixel imbalance and report alignment. Multi-view contrastive learning, which aligns images from different perspectives, has been applied in mammography but needs more integration with CLIP to exploit multimodal supervision signals fully.

Method:
The proposed MaMA framework introduces a method for constructing structured mammography reports from tabular data and incorporates a multi-view contrastive image-text pre-training approach. It utilizes a template-based caption generation to enhance image understanding and prevent oversimplification. A multi-view contrastive learning framework improves the model’s capability by comparing mammogram views, while the Symmetric Local Alignment (SLA) module enables fine-grained correspondence between image patches and text. Additionally, parameter-efficient fine-tuning (PEFT) of a large pre-trained LLM is employed to improve text encoding, enhancing overall performance without increasing computational costs.

Model Performance on Mammography Datasets:
The experiments utilized the Emory EMBED dataset, comprising over 72,000 multi-view mammograms from 23,356 patients, divided into training, validation, and test sets (70%/10%/20%). The model architecture featured DiNOv2-ViT-B-14 as the image encoder and BioMedLM as the text encoder, with fine-tuning via LoRA for efficiency. The training was optimized using the AdamW optimizer with a 4E-5 learning rate, cosine annealing scheduler, and SLA loss. Hyperparameter tuning included a batch size 144 across four GPUs, and the primary evaluation focused on BI-RADS assessment and breast density prediction, with metrics like balanced accuracy (bACC) and AUC.

MaMA, the proposed model, outperformed baselines such as CLIP, ConVIRT, and MM-MIL in zero-shot and full fine-tuning settings. It demonstrated a 4% improvement in balanced accuracy for BI-RADS and excelled in breast density prediction. MaMA’s robustness was further validated on the out-of-domain RSNA-Mammo dataset for cancer detection, where it achieved higher balanced accuracy and AUC scores compared to the baselines while maintaining adequate sensitivity and specificity. This highlights MaMA’s strong generalization capabilities even with limited training data.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here