Abstract
<jats:p>The generation of diagnostic reports from chest X-ray images is a complex task that requires both accurate medical image interpretation and clear clinical language. Manual reporting is time-consuming, expertise-intensive, and susceptible to fatigue. Although deep learning–based systems have been developed to support this process, most traditional approaches treat image analysis and report generation as separate tasks, leading to inconsistencies and loss of critical details. Recent vision–language models offer a more unified solution by linking visual understanding with natural language generation. This chapter proposes an end-to-end framework that integrates a Swin Transformer image encoder with a Q-Former and a fine-tuned BioMedLM for producing accurate and clinically meaningful reports. Experiments on a curated chest X-ray dataset demonstrate improved pathology coverage, reduced errors, and enhanced report clarity, supporting reliable and efficient clinical decision-making.</jats:p>