Back to Search View Original Cite This Article

Abstract

<jats:p>Multimodal large language models (LLMs) are increasingly applied in radiology, but the effect of reasoning capabilities across text- and image-based tasks remains unclear. We evaluated four multimodal LLMs—two non-reasoning (ChatGPT-4, Gemini 1.5 Pro) and two reasoning-capable (ChatGPT-5.1, Gemini 3)—using 50 text-based and 50 arrow-localized MSK radiographic anatomy questions, compared with two board-certified radiologists. Accuracy with 95% confidence intervals was calculated, and image-based errors were categorized. Reasoning-capable models outperformed non-reasoning models in text-based tasks, achieving near-ceiling accuracy (96% and 94%; all p≤0.008) with minimal prompt sensitivity. In image-based tasks, reasoning models performed better than non-reasoning models (70–72% vs 46–48%; p&lt;0.001) but remained inferior to radiologists (88–90%). Errors were mainly adjacent-structure substitution and projection-related overlap. While reasoning enhances text-based performance and robustness, multimodal LLMs remain limited in fine-grained visual grounding and are best suited for supportive roles.</jats:p>

Show More

Keywords

models multimodal reasoning imagebased tasks

Related Articles

PORE

About

Connect