Large language models (LLMs) like ChatGPT are rapidly entering clinical workflows, even though none are approved by the U.S. Food and Drug Administration (FDA) for clinical decision support. Physicians have long relied on digital resources such as UpToDate or even Google, but LLMs are unique in their seeming ability to reason clinically—and in their flexibility and unpredictability.

Can these models be trusted? If clinicians are going to use them, what guidance can we offer for how to do so effectively and safely? Early evaluations have been impressive, including high performance on multiple-choice United States Medical Licensing Examination (USMLE) questions and, more recently, on complex clinical reasoning tasks, such as open-ended case vignettes, at times exceeding that of the average physician. But these exams do not reflect real-world practice. Many symptoms never coalesce into an obvious diagnosis, and many management decisions lack a single “correct” recommendation.

In a recent study published in the Journal of General Internal Medicine, we set out to examine how commercially available and commonly used LLMs respond to nuanced clinical management scenarios that are often encountered in practice. We crafted four concise prompts reflecting real-world challenges in inpatient care and posed them to six different models, five times each.

We found significant disagreement between models in their management recommendations, which, while notable, was not entirely surprising, given that six human clinicians would also likely disagree about what to do. More surprising was that the models often disagreed with themselves, changing recommendations even when presented with identical prompts. OpenEvidence—the only domain-specific model tested, trained on biomedical literature—was the most consistent in its responses and the most concrete in the reasoning it displayed.

For clinicians already using these tools at the bedside, it is crucial to understand how LLMs resemble—and differ from—traditional digital tools. They are powerful engines for surfacing relevant literature and outlining clinical considerations. But when asked to reason through that information, they may generate inconsistent conclusions, even in rapid succession. A clinician who treats an LLM like a deterministic calculator may be lulled into false certainty. At a minimum, it is worth re-prompting or sampling responses from more than one model when seeking AI-based input.

The results underscore the need for guidance around safe integration of LLMs into clinical workflows. These models currently sit outside FDA oversight, yet growing evidence suggests they likely meet the definition of medical devices under the 21st Century Cures Act. Policymakers are already debating whether the existing regulatory framework is sufficient—or whether a more flexible, adaptive approach is needed to keep pace with rapidly evolving AI tools.

Finally, for researchers, the study highlights the importance of moving beyond controlled, vignette-based evaluations toward prospective, real-world assessments. Understanding how LLMs behave within electronic health records, in collaboration with clinicians, and in high-stakes patient care will surface both new challenges and new opportunities to harness their potential responsibly.

As generative AI becomes more deeply embedded in clinical practice, this study offers an early but important reminder: LLMs can enrich clinical thinking but cannot replace it. They are best used as one perspective among many—a tool that can broaden consideration, not narrow it—and their role in medicine will depend not just on their intelligence, but on how wisely they are integrated into the complex decisions clinicians make every day.


The study “Variation in Large Language Model Recommendations in Challenging Inpatient Management Scenarios” by Susan Landon, Thomas Savage, S. Ryan Greysen, and Eric Bressman appeared in the October 7, 2025 issue of the Journal of General Internal Medicine.


Author

Julia Hinckley

Julia Hinckley, JD

Director of Policy Strategy


More on AI in Health Care