Your other Left!
Vision-Language Models Fail
to Identify Relative Positions in Medical Images

MICCAI 2025

* Shared Supervision

Main Findings:

We show that Vision-Language Models (VLM) fail to identify relative positions of anatomical structures in medical images, a fundamental requirement for clinical applicability. We analyze visual-marker strategies for performance enhancement, demonstrate VLMs' reliance on prior anatomical knowledge over image content, and introduce MIRP, an open benchmark dataset for relative positioning tasks in medical imaging. Our evaluation serves as a critical first step toward enabling VLMs for clinical use.

The Story of our Paper

Is coming soon

placeholder image

Is coming soon

placeholder image

How to use our Benchmark and Code

Is coming soon

placeholder image

Is coming soon

placeholder image

BibTeX

@article{wolf2024less,
  title   = {Less is More: Selective reduction of CT data for self-supervised pre-training of deep learning models with contrastive learning improves downstream classification performance},
  author  = {Wolf, Daniel and Payer, Tristan and Lisson, Cathrina Silvia and Lisson, Christoph Gerhard and Beer, Meinrad and G{\"o}tz, Michael and Ropinski, Timo},
  journal = {Computers in Biology and Medicine},
  volume  = {183},
  year    = {2024},
  doi     = {10.1016/j.compbiomed.2024.109242}
}