Evaluating Reasoning Effect for LLMs: Prompt Sensitivity and Text-Image Based Performance in Musculoskeletal Radiology

Authors: Eren Çamur, Turay Cesur, Yasin Celal Güneş

Publication: Studies in Health Technology and Informatics, Opening the Personal Gate between Technology and Health Care

Published: May 21, 2026

Source: Crossref

Back to Search View Original Cite This Article

Abstract

<jats:p>Multimodal large language models (LLMs) are increasingly applied in radiology, but the effect of reasoning capabilities across text- and image-based tasks remains unclear. We evaluated four multimodal LLMs—two non-reasoning (ChatGPT-4, Gemini 1.5 Pro) and two reasoning-capable (ChatGPT-5.1, Gemini 3)—using 50 text-based and 50 arrow-localized MSK radiographic anatomy questions, compared with two board-certified radiologists. Accuracy with 95% confidence intervals was calculated, and image-based errors were categorized. Reasoning-capable models outperformed non-reasoning models in text-based tasks, achieving near-ceiling accuracy (96% and 94%; all p≤0.008) with minimal prompt sensitivity. In image-based tasks, reasoning models performed better than non-reasoning models (70–72% vs 46–48%; p<0.001) but remained inferior to radiologists (88–90%). Errors were mainly adjacent-structure substitution and projection-related overlap. While reasoning enhances text-based performance and robustness, multimodal LLMs remain limited in fine-grained visual grounding and are best suited for supportive roles.</jats:p>

Keywords

models multimodal reasoning imagebased tasks

Evaluating Reasoning Effect for LLMs: Prompt Sensitivity and Text-Image Based Performance in Musculoskeletal Radiology

Abstract

Keywords

Related Articles

Effect on Trade Between Member States (Articles 101 and 102 TFEU)

Monitoring, Evaluating, and Improving Safety Measures

24 LEARNING FROM HISTORY

3 SEMINAL CONTRIBUTIONS TO RATIONAL THEORIES OF STATE BEHAVIOR

29 THE CIA’s HISTORICAL REVIEW PANEL