Reviewing the Most Influential Papers on LLMs in Healthcare: Insights and Implications
- Shafi Ahmed
- Mar 4
- 9 min read
Large Language Models (LLMs), such as GPT-4, Gemini, and others, have become transformative tools across various industries, particularly in healthcare. These sophisticated models, proficient in comprehending and producing human-like text, have facilitated advancements in medical diagnostics, patient care, research, and other domains. Nonetheless, their implementation presents challenges that necessitate a comprehensive understanding of their foundational methodologies, capabilities, and limitations.
This week's edition of AI Horizons features a curated analysis of the most impactful papers on large language models in healthcare. This analysis of pioneering studies aims to underscore how researchers utilize LLMs to enhance healthcare delivery and identify the significant gaps in this field.

AI vs. Doctors: Evaluating GPT-4's Role in Complex Medical Decision-Making
As artificial intelligence continues to make strides across various fields, its potential in healthcare has sparked both excitement and caution. With AI tools like GPT-4 and GPT-4o gaining attention for their natural language processing abilities, questions about their ability to assist in medical decision-making have emerged.
A new study titled “ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study” compared GPT-4 and GPT-4o with Swedish family medicine doctors on open-ended case scenarios, revealing significant gaps in AI performance. Unlike multiple-choice tests, these cases required a nuanced understanding of multimorbidity, social issues, compliance, and legal aspects core to general practice. Top doctors scored 7.2/10, average doctors 6.0/10, GPT-4 4.5/10, and GPT-4o 5.2/10. Key findings highlight that GPT-4 lags behind doctors in suggesting diagnoses, tests, referrals, and addressing legal considerations. It also falls short on psychosocial and common medical issues, as well as medication details. However, GPT-4o showed improvement over GPT-4, emphasising rapid AI advancements. [1]
This study notes three caveats: zero-shot prompting without specialised training, exclusion of the latest GPT models, and lack of fine-tuning for medical use or Swedish context. It suggests AI like GPT-4o should not currently be used for clinical decision-making but shows potential in non-medical tasks to reduce doctors' workload. The authors stress the importance of defining “good enough” AI performance for healthcare and ensuring AI supports rather than replaces doctors. This research sets a high standard for evaluating AI’s role in medicine while reaffirming the essential role of human clinicians in patient care.
However, what can be the solution to this? Another study titled "RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations" explores the potential of retrieval-augmented generation (RAG) to enhance the capabilities of large language models (LLMs) in the healthcare sector. [2]
RAG addresses these limitations by connecting LLMs with external knowledge sources, allowing them to access information beyond their training data. This includes peer-reviewed studies, medical compendiums, and internal policies of healthcare organisations. By leveraging RAG, generative AI tools can consider public and private information, improving accuracy and relevance in healthcare settings. The study outlines RAG's current and future use cases for healthcare information exchange in both clinical and industrial settings.
Advancing Healthcare with Large Language Models: Applications, Challenges, and Future Directions
Large Language Models (LLMs) are significantly impacting the medical field, transforming diagnostics, communication, and decision-making. Research has explored their potential, limitations, and ethical considerations in healthcare.
`One study, titled "Superhuman Performance of a Large Language Model on the Reasoning Tasks of a Physician," evaluated OpenAI's o1-preview model across various medical tasks. The results showed significant improvements in differential diagnosis and diagnostic management but no advancements in probabilistic reasoning or triage. The study calls for more robust benchmarks to compare LLMs with human physicians in clinical reasoning tasks. [3]
Another paper, "The Future Landscape of Large Language Models in Medicine," explores the wide-ranging applications of LLMs, from patient care and medical research to education. While LLMs can aid in medical documentation, enhance patient communication, and democratize access to scientific knowledge, concerns about misinformation, privacy, and biases are raised. The authors stress the importance of developing responsible interaction guidelines and non-commercial open-source projects to prevent monopolies in medical knowledge. [4]
The paper "The Application of Large Language Models in Medicine: A Scoping Review" reviews 550 studies on LLMs in healthcare. It highlights their transformative impact on diagnostics, medical writing, and patient communication. However, the study also points out challenges such as limited contextual understanding and over-reliance risk. The need for ethical integration and empirical studies in clinical settings is emphasized to ensure responsible use. [5]
Lastly, "Large Language Models in Medical and Healthcare Fields: Applications, Advances, and Challenges" provides an overview of how LLMs are applied across various medical tasks, including clinical decision support and electronic health record generation. While LLMs have revolutionized healthcare, challenges such as data security, bias, and accountability remain. Solutions like de-identification frameworks and fairness-promoting prompting methods are proposed to address these concerns. [6]
These studies show that LLMs are transforming medicine, but there are still hurdles to overcome. From enhancing diagnostic accuracy to improving patient communication, the future is promising, but it requires careful integration, rigorous testing, and ethical oversight.
Cognitive Challenges in AI: Assessing the Mental Agility of Large Language Models:
The research paper titled "Age Against the Machine—susceptibility of large language models to cognitive impairment: cross-sectional analysis" evaluates the cognitive abilities to lead large language models (LLMs) using the Montreal Cognitive Assessment (MoCA) and additional tests. The study involved ChatGPT versions 4 and 4o, Claude 3.5 "Sonnet," and Gemini versions 1 and 1.5. ChatGPT 4o achieved the highest MoCA score (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring the lowest (16/30). Moreover, all LLMs showed poor performance in visuospatial/executive tasks, while only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Most large language models, except ChatGPT 4o, displayed mild cognitive impairment on the MoCA test. Similar to humans, "older" chatbots showed more significant cognitive decline. These findings question the assumption that AI can soon replace human doctors, as cognitive limitations in chatbots may impact their diagnostic reliability and patient trust. [7]
Empathy in Code: Exploring the Ethical and Emotional Boundaries of AI
A recent study titled "Large Language Models and Empathy: Systematic Review" explores the capacity of large language models (LLMs) to demonstrate empathy. The study reviewed 12 publications from 2023, focusing on ChatGPT-3.5 and other LLMs like GPT-4 and LLaMA. The studies used various metrics, including automatic metrics like ROUGE and BLEU, and human subjective evaluations. LLMs exhibited elements of empathy, such as emotion recognition and providing emotional support, particularly in medical scenarios. In some cases, LLMs outperformed humans in empathy-related tasks. For example, ChatGPT-3.5's responses were preferred over human responses in 78.6% of cases when answering patient questions from social media. The potential challenges included repetitive use of empathic phrases, difficulty following initial instructions, and sensitivity to prompts. The paper concludes that while LLMs demonstrate cognitive empathy, there is room for improvement in their performance and evaluation strategies for assessing soft skills. The findings suggest that LLMs have the potential to enhance patient care by providing emotionally supportive responses, but further research is needed to refine these models and their applications in healthcare. [8]
The research paper "Empathic AI can’t get under the skin" discusses the potential and limitations of large language models (LLMs) in emulating empathy. The paper highlights the historical context of chatbots, starting with ELIZA in the 1960s and the evolution to modern LLMs capable of fluent human-like conversations. The researchers noted that humans tend to project human traits onto chatbots, perceiving them as empathetic even though they lack proper emotional understanding. The paper raises ethical questions about using empathic AI, particularly in applications like romantic chatbots, personal assistants, and mental health apps. It questions whether users should be informed that AI empathy is simulated. Moreover, personalised LLMs tailored to individual preferences could lead to privacy concerns, echo chambers, and unhealthy emotional attachments. The rapid development of personalised LLMs necessitates careful ethical consideration to avoid potential negative impacts on users and society. This paper emphasises the importance of responsible deployment and ethical research to address these challenges. [9]
Reflecting on these studies, it is clear that while large language models (LLMs) show impressive potential in exhibiting empathy, especially in medical contexts, we must remember that this empathy is simulated, not genuine. As we continue to explore and refine these models, it is essential to balance innovation with ethical considerations, ensuring that their use enhances patient care without compromising privacy or emotional well-being.
AI Scribes in Healthcare: Resolving the Paradox of Conflicting Study Outcomes
The debate around AI scribes is buzzing—are they revolutionary or overrated? The answer seems elusive, as even research offers conflicting results. Take DAX Copilot, an AI tool for ambient documentation. Two recent studies examined its impact, yet their conclusions could not be more different.
One study, published in NEJM AI, analysed 112 Atrium Health primary care physicians using DAX Copilot versus 103 controls who did not. The results? There is no reduction in time spent on electronic health records (EHR) and no increase in revenue per visit. [10]
In contrast, a JAMIA study evaluated 50 Stanford primary care physicians before and three months after adopting DAX. This time, the findings painted a brighter picture: reduced workload, decreased burnout, and better usability than traditional documentation methods. [11]
Why the discrepancy? The study designs and metrics were fundamentally different. The NEJM AI study compared DAX users with a matched control group, focusing on EHR time and financial outcomes. Meanwhile, the JAMIA study assessed the same physicians pre- and post-DAX adoption, examining task load, burnout, and usability. Then there is the context. No two health systems are identical—each has its own EHR setup, culture, workflows, and team dynamics. These variables can significantly influence AI's impact.
AI tools like DAX affect clinical efficiency, economic outcomes, and physician well-being in multifaceted ways. Pinpointing what matters most—and measuring it—remains a challenge. Unsurprisingly, we see variability in healthcare AI experiments, especially in the short term. Context is king, and as we dive deeper into AI's role, these nuances will be key to unlocking its full potential.
Unanswered Questions and Future Directions
Integrating LLMs into healthcare is a transformative development, offering opportunities to enhance patient care, streamline workflows, and democratise access to medical knowledge. However, these advancements come with responsibilities. The studies reviewed in this Newsletter reflect LLMs' potential and pitfalls, underscoring the need for thoughtful, evidence-based implementation. Despite their promise, LLMs in healthcare are far from perfect. The reviewed papers expose several unresolved issues, such as bias in training data, continuous learning rapid evolution in healthcare, and limited accessibility.
As we look to the future, clinicians, researchers, and policymakers need to collaborate in shaping the trajectory of LLMs in healthcare.
Subscribe to my Newsletter to remain at the forefront of AI innovations in healthcare.
References:
1. Arvidsson, R., Gunnarsson, R., Entezarjou, A., Sundemo, D., & Wikberg, C. (2024). ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open, 14(12), e086148. https://doi.org/10.1136/bmjopen-2024-086148
2. Ng, K. K. Y., Matsuba, I., & Zhang, P. C. (2024). RAG in Health Care: A Novel Framework for Improving Communication and Decision-Making by Addressing LLM Limitations. NEJM AI. https://doi.org/10.1056/aira2400380
3. Brodeur, P. G., Buckley, T. A., Kanjee, Z., Goh, E., Ling, E. B., Jain, P., Cabral, S., Abdulnour, R., Haimovich, A., Freed, J. A., Olson, A., Morgan, D. J., Hom, J., Gallo, R., Horvitz, E., Chen, J., Manrai, A. K., & Rodman, A. (2024, December 14). Superhuman performance of a large language model on the reasoning tasks of a physician. arXiv.org. https://www.arxiv.org/abs/2412.10849
4. Clusmann, J., Kolbinger, F.R., Muti, H.S. et al. The future landscape of large language models in medicine. Commun Med 3, 141 (2023). https://doi.org/10.1038/s43856-023-00370-1
5. Meng, Xiangbin, et al. “The Application of Large Language Models in Medicine: A Scoping Review.” IScience, vol. 27, no. 5, 1 Apr. 2024, pp. 109713–109713, https://doi.org/10.1016/j.isci.2024.109713.
6. Wang, D., & Zhang, S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial Intelligence Review, 57(11). https://doi.org/10.1007/s10462-024-10921-0
7. Dayan, R., Uliel, B., & Koplewitz, G. (2024). Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis. BMJ, e081948. https://doi.org/10.1136/bmj-2024-081948
8. Sorin, V., Brin, D., Barash, Y., Konen, E., Charney, A., Nadkarni, G., & Klang, E. (2024). Large Language Models and Empathy: Systematic Review (Preprint). Journal of Medical Internet Research, 26, e52597. https://doi.org/10.2196/52597
9. Empathic AI can’t get under the skin. (2024). Nature Machine Intelligence, 6(5), 495. https://doi.org/10.1038/s42256-024-00850-6
10. Liu, T., Hetherington, T. C., Dharod, A., Carroll, T., Bundy, R., Nguyen, H., Bundy, H. E., Isreal, M., McWilliams, A., & Cleveland, J. A. (2024). Does AI-Powered Clinical Documentation Enhance clinician Efficiency? A longitudinal study. NEJM AI. https://doi.org/10.1056/aioa2400659
11. Shah, S. J., Devon-Sand, A., P, S., MA, Jeong, Y., Crowell, T., Smith, M., Liang, A. S., Delahaie, C., Hsia, C., Shanafelt, T., Pfeffer, M. A., Sharp, C., Lin, S., & Garcia, P. (2024). Ambient artificial intelligence scribes: physician burnout and perspectives on usability and documentation burden. Journal of the American Medical Informatics Association. https://doi.org/10.1093/jamia/ocae295
Comments