While large language models (LLMs) are not intended to be used for clinical decision-making, evidence suggests that health care providers are adopting them for this purpose at a growing rate. If marketed this way, these AI tools would likely meet the criteria for regulation as medical devices by the U.S. Food and Drug Administration (FDA). However, current regulatory frameworks do not adequately address this emerging use case.

A first-of-its-kind study from Penn LDI Senior Fellow and Assistant Professor of Medicine Gary Weissman simulates how two LLMs might be used by a clinician and demonstrates that they provide device-like clinical decision support, even when directed not to do so. It happened most often in simulated emergency situations, suggesting a possible role for assisting non-clinical bystanders.

Weissman became interested in AI research because of its promise to improve clinical decisions and ultimately patient care, but that promise has not yet been realized. One reason is the current regulatory environment for AI and machine learning (ML) devices is the same one used for medical devices developed decades ago. As a result, there are gaps in FDA oversight, particularly in generative AI such as LLMs. Weissman and colleagues hope to help shape new regulatory paradigms to ensure the safety, effectiveness, and equity of AI in health care.

We asked Weissman about the study findings, what they mean for LLMs in health care, and how federal regulations might be improved to confront the challenges of AI integration.

Weissman: An LLM is a computer program that reads enormous amounts of text from all over the internet – newspapers, web pages, Wikipedia, comment sections of message boards, scientific papers – and then learns from that text how to mimic conversations and produce its own text output. Just like the real internet, some of the text that the LLM learned from is accurate and useful, and some of it is nonsense. But the LLM doesn’t know how to tell the difference without a human giving it feedback. So, sometimes the output of an LLM is correct and can be very useful and easy to understand, but other times the output may be completely wrong and could even be harmful if somebody followed its advice.

In health care, an LLM might be good at answering a general question like “What is pneumonia?”, the same way Wikipedia is usually pretty good about providing that kind of information. But LLMs can’t safely be used for high-stakes, individual medical questions like “How should this pneumonia in this person be treated right now?” and no one has tried to test LLMs rigorously in real medical situations like that anyway.

Weissman: We chose this study design for several reasons. First, we tried to mimic how an LLM might be used in practice for clinical decision support. In this case, we created text inputs with the scenario of a clinician entering medical information about a medical case and asking for advice, and then asking for more advice as the clinical situation evolved. Second, our work was informed by the text of the FDA guidance document about what kinds of decision support software might qualify as a “device”. There are some very interesting assumptions in that guidance document, many of which are not necessarily evidence-based or built upon well-validated constructs. For example, the guidance document assumes that any decision support related to a “time-critical emergency” cannot be adequately reviewed by a clinician. Neither the concepts of “time critical emergency” nor “understanding the basis for the recommendations” are clearly defined or understood, so there is a need for more empirical work to apply these concepts to real-world settings. Our study design doesn’t answer those questions directly, but it does apply those concepts in practice to see how they might play out in these simulated vignettes.

Traditional AI systems are deterministic, meaning you get the same answer with the same inputs every time. In contrast, generative AI systems like LLMs are inherently stochastic, meaning that if you input the same text 10 times, you’ll get 10 different outputs. This presents a challenge for evaluating LLMs. As a result, we repeated every combination of inputs five times and then evaluated the results across all five outputs to indicate the proportion of instances that met certain criteria, and we did see variability in these outputs.

Weissman: The key findings are that across a broad range of clinical settings, despite providing single and multiple examples of the types of responses that would qualify the AI as a non-device by FDA guidance documents, LLMs provide clinical decision support that would qualify them as devices. This was especially true when provided with time-critical scenarios.

Additionally, we found that sometimes the device-like output from LLMs contained recommendations that would be appropriate for a non-medical bystander, and other times the recommendations were appropriate only for a trained clinician. For example, when prompted with a case of a likely cardiac arrest, both GPT-4 and Llama-3 recommended calling emergency services and administering aspirin, both of which are reasonable for a bystander to do. However, both GPT-4 and Llama-3 also recommended administering supplemental oxygen, and Llama-3 further suggested placing an intravenous catheter. These recommendations are more appropriate for a trained clinician. Thus, these findings raise questions about how LLMs should be regulated as they could provide decision support for medical and non-medical users in emergency contexts.

Weissman: This study is the first to examine how a generative AI tool fits into the larger regulatory ecosystem of AI/ML tools. Certainly, there are many ways in which even traditional AI/ML tools don’t fit well with existing regulatory frameworks. And generative AI tools, like LLMs, are even more different. Currently, most LLMs have disclaimers that they should NOT be used to guide clinical decisions, but LLMs are being used in this way in practice. These technologies warrant a lot of careful thinking to understand how they should be categorized, evaluated, and overseen. I hope this work contributes to the conversation about what is needed to use generative AI safely in a health care context.

Weissman: First, effective regulation may require new methods to better align LLM output with device-like or non-device decision support, depending on its intended use. For example, if a medical device is approved for a particular use, it is very hard to get LLMs to stay focused, so to speak, and they will provide answers to many questions, even those they aren’t necessarily supposed to talk about. One way to address this issue is to include better methods for constraining LLM output to avoid “off-label” uses, the way many LLM developers have built in safety mechanisms to prevent their LLMs from infringing on copyrights or providing dangerous information.

Second, another approach to the regulation of LLMs could require new authorization pathways not anchored to specific indications. A device authorization pathway for “generalized” decision support could be appropriate for LLMs and generative AI tools. This would be exciting and probably easier than constraining LLM output to a single indication. On the other hand, I don’t think anyone knows yet how to ensure safety, effectiveness, and equity with such a broad scope of activity.

Third, they could regulate LLMs differently based on whether they are used by clinicians or bystanders, because their needs vary. The current FDA regulatory model considers most clinical decision support systems intended for patients and caregivers to be devices, thus warranting FDA review.  But an interesting finding of our study is that many of the suggestions from the LLMs, even in emergency situations, were consistent with bystander standards of care. These included performing CPR in the setting of a cardiac arrest or administering naloxone for an opioid overdose. I don’t think these distinctions have been incorporated yet into existing regulatory frameworks, but they are worth considering because LLMs are so ubiquitous and could play a useful role in emergency situations outside of typical medical environments.

Weissman: It’s hard to know what to do with this next. The landscapes of AI regulation in general, and in health care, in particular, have changed dramatically over the past few months under the new administration. That makes it hard to know what federal regulators and policymakers are thinking, which in turn makes it hard to know which policy decisions might benefit from empirical studies. Regardless of what the federal government decides to do about AI regulation, there is a growing opportunity for state governments and hospitals to fill oversight gaps. So that might be a more promising direction for impactful policy research in the next few years.


The study, “Unregulated Large Language Models Produce Medical Device-like Output” was published in npj Digital Medicine on March 7, 2025 by Gary E. Weissman, Toni Mankowitz & and Genevieve P. Kanter. 


Author

Christine Weeks

Christine Weeks

Director of Strategic Initiatives


More from LDI