Comparing LLMs for Radiology: Performance, Shortcomings, and Next Steps

Large Language Models (LLMs) have become a cornerstone of natural language processing (NLP) across complex fields, including radiology. While they excel at various information extraction tasks, they also have notable limitations. This post explores key performance aspects of widely available LLMs—GPT, Claude, and LlaMA—in radiology report analysis, highlights their common shortcomings, and points to a modular alternative designed to address these challenges.

Overview of LLM Performance in Radiology

LLMs are adept at handling nuanced language and context. In medical documents, including radiology reports, they can detect terms like "nodule," "lesion," and "mass," which may be used interchangeably to describe similar findings [1]. This language comprehension capability has made them attractive for identifying critical details in vast collections of unstructured reports.

Strength in Language Comprehension

Context-Aware: LLMs interpret terms used in variable ways, enabling them to identify essential sections of a report even when different terminology is employed [1].
Adaptability: LLMs perform well across multiple text-based tasks, from summarizing reports to extracting clinically relevant entities [1][2].

Impact on Data Extraction

Broad Applicability: LLMs can handle text from diverse imaging modalities (e.g., computed tomography and magnetic resonance imaging) and glean important information such as the size, presence, or location of nodules [1][2].
Ease of Deployment: Prompt-based approaches make it relatively straightforward to obtain initial results without extensive custom development [2].

These benefits align with the general observation that LLMs often outperform simpler, rule-based methods in tasks requiring a deeper understanding of text [2].

Shortcomings of LLMs

Despite their success, out-of-the-box LLMs such as GPT-4 exhibit several drawbacks that can significantly impact radiology use cases.

Numeric Reasoning Gaps

Research has shown that LLMs face challenges in accurately interpreting and comparing numeric values [3]. Properly distinguishing a 5 mm nodule from a 6 mm nodule or identifying subtle size variations can exceed the default capabilities of certain LLMs [3].

Black-Box Complexity

Limited Explainability: It is often unclear how an LLM arrives at a specific conclusion or what underlying factors drive its output. Transparency is crucial in a high-stakes domain like healthcare [3].
Hallucinations: LLMs may produce text that appears correct linguistically but is factually incorrect. If not carefully audited, these errors can go unnoticed, posing serious risks in medical contexts [3].

Data Requirements

Training or fine-tuning LLMs often demands hundreds of specialized examples. Acquiring enough labeled radiology reports, especially those with patient data, is expensive and time-consuming [2].

A Modular Alternative: Breaking Down the Task

A more targeted strategy for handling the complexities of radiology reports—particularly when determining if certain findings (like pulmonary nodules) meet specific size thresholds—emphasizes modularity. This approach breaks down the larger information-extraction challenge into smaller, focused components, each of which can be independently refined or replaced as new requirements emerge [4].

A key advantage of this design is that it allows for greater transparency and flexibility than a single, monolithic model. Should classification errors or ambiguities arise—such as incorrectly identifying nodules in regions other than the lungs—engineers and clinicians can more easily trace and address the source within a single component without disrupting the rest of the system [4].

Performance Insights

Early evaluations indicate that while the system can correctly identify potential nodules, distinguishing them from similarly described findings in other body regions remains challenging. However, because the methodology is split into smaller, focused components, the design allows easy updates [4].

Next Steps

Ongoing work aims to refine key components, enhance domain-specific accuracy, and explore additional strategies for explaining model outputs in a high-stakes clinical context. The framework's modular design also allows the integration of new tasks or rules with minimal disruption to existing system parts [4].

While LLMs show promise in radiology report analysis, they face challenges in numeric reasoning, explainability, and data requirements. A modular approach offers a potential solution, balancing flexibility, interpretability, and scalability, making it well-suited for evolving radiology-related needs. Future research should focus on enhancing technical performance and explainability, particularly in healthcare settings where accurate interpretation of outputs is critical [3][4].

Sources

[1] Large language models simplify radiology report impressions https://www.auntminnie.com/imaging-informatics/artificial-intelligence/article/15667295/large-language-models-simplify-radiology-report-impressions

[2] a performance comparison between GPT-4, GPT-3.5-turbo, Flan-T5 ... https://academic.oup.com/jamiaopen/article/7/3/ooae060/7705527

[3] Evaluation and mitigation of the limitations of large language models ... https://www.nature.com/articles/s41591-024-03097-1

[4] Modular Approach to Solve Problems at Scale in Healthcare NLP https://www.nlpsummit.org/spark-nlp-for-healthcare-batteries-included/

Comparing LLMs to Extract Recommendations for Additional Imaging: Performance, Shortcomings, and Next Steps