Large language models (LLMs) are artificial intelligence (AI) systems that understand and generate human language by processing vast amounts of text data. The launch of ChatGPT in late 2022 catapulted one example of LLM into many Americans’ lives. 

At the same time, but with less fanfare, a variety of large language models became common in some healthcare environments, by, for example, programming chatbots to answer health-related questions.

Tech companies and academic researchers, including Penn, took notice. “Artificial intelligence hadn’t translated well into the health area, but with the advent of large language models, suddenly the potential for improving the delivery of care grew enormously,” said LDI Senior Fellow Hamsa Bastani, Associate Professor in Operations, Information, and Decisions at Wharton. Just under a year ago, Penn opened Wharton Healthcare Analytics Lab (WHAL), with Bastani and LDI Senior Fellow Marissa King as faculty co-leads. Attached to WHAL are one post-doctoral fellow, 10 affiliated faculty members, and several grad students, plus an executive director.

The lab’s mandate is to help bring sophisticated data analysis into healthcare areas. “We want analytics to have an impact, and for that impact to be transformative,” said King, professor of Health Care Management at Wharton. 

But there are enormous obstacles, such as differences of opinion among stakeholders about how to implement models. “I think in order to harness the potential of large language models, everybody from regulators to practitioners to data scientists need to come together to help overcome those challenges,” King said. 

A great starting point, she added, is the October 10 conference co-sponsored by LDI and WHAL called “(Re)writing the Future of Healthcare with Generative AI.” King sees the first-ever symposium as a “neutral space” where stakeholders can share concerns and possible solutions. 

While there is significant excitement around the possibilities of LLMs in healthcare, the lab takes a cautious and rigorous approach. At the conference, Bastani will moderate a session called “Drafting the LLM Playbook: Key Questions for Health Systems.” She intends to raise concerns that inform the work at WHAL and among audience members for such issues as patient safety and data accuracy, biases and equity challenges, and data infrastructure. 

Patient Safety and Data Accuracy

One of the most exciting applications of large language models involves using them to enter information into patients’ electronic health records (EHRs) and generate summaries of clinical visits. “Physicians hate doing documentation. They came into this profession because they want to look at patients’ eye-to-eye and talk to them about what they’re feeling. Physicians don’t want to write notes during visits. I have yet to talk to one who’s not super-excited about the possibility of LLMs taking over,” said Bastani. 

A problem with LLMs is that they can produce hallucinations—false or missing information. For instance, a discharge summary from an ER department might inaccurately tag a diabetic patient as having high blood pressure because the two conditions normally go together. “These models are extrapolation machines, and so they often extrapolate correctly, but they can also extrapolate incorrectly,” Bastani explained. Such errors could later lead to treatment mistakes.

In work with the Somaliland Ministry of Health and Development in the Horn of Africa, Bastani and her graduate students have the advantage of working in a health data system that is just starting to digitize. “We’re testing an LLM intervention that strikes a good balance between reducing a physician’s workload while still maintaining high accuracy in the clinical notes. We’re going to randomly audit a subset of these notes and have them reviewed by actual clinical professionals to make sure they’re accurate. If we test a couple thousand and there are no more than one or two inaccuracies, then we call that a win. We want to make sure that we’re not doing much worse than the typical human rate of error,” Bastani said. 

In the U.S., such interventions may be hard to pull off. “But if physicians are motivated to carefully audit these summaries and treat them as drafts rather than final products, that helps a lot. So there’s a question of how do you behaviorally incentivize them to do that. Various tech companies have shown pretty promising results, but they have not been subject to rigorous academic evaluations,” Bastani said. In addition, LLMs can be “fine-tuned” by retraining them on a new data set relevant to the precise task they need to perform well on. 

Bias and Equity Challenges

Biases are rampant in healthcare data, especially electronic health records (EHRs), according to current thinking. “A major issue related to algorithms and equity is that we often don’t have high-quality data in our health systems for underserved populations or minorities, leading to worse health outcomes for them. I think when you’re dealing with algorithmic bias, the best thing to do is to, if possible, collect better data,” said Bastani. The “gold standard,” she added, would be for a healthcare system to reach out to patients who have failed to come in and offer them, say, free rides to appointments. “If you can’t do that, there are various tools in machine learning to help alleviate biases, but they’re never going to be as effective as just collecting more and better data on people,” Bastani said. 

Bias can also come into play after a model is completed. LDI Senior Fellow Kevin Volpp, head of Penn’s Center for Health Incentives and Behavioral Economics, works with WHAL on a cardiovascular risk reduction initiative at Penn Medicine, exploring the use of AI to more efficiently steer resources to those at highest predicted risk. “An important issue is whether AI leads to resources being allocated systematically to some groups rather than others. We need to make sure that in using AI, we are reducing health inequities by improving how effective the programs are without contributing to health inequities getting worse,” he said.

Data Infrastructure

Large language models offer an unparalleled opportunity to mine data from sources that have previously not been tapped in a systematic way. Capitalizing on LLM’s ability to make sense of text and emotion, King is working on analyzing issues related to workforce well-being. “One thing we’re trying to do is to utilize data from electronic health records to understand where there’s likely to be a high risk of burnout or emotional overload in clinicians, especially nurses. There’s immensely rich data within clinical notes,” she said. 

However, “obtaining and building the data set for the project is a Herculean effort. Arguably, the data challenges associated with AI and health care are the biggest obstacle to progress,” said King. These and other issues will all be up for discussion and brainstorming at the October 10 joint conference. 

Building appropriate HIPAA-compliant environments that can support the training and fine-tuning of LLMs on patient-identifiable notes or audio conversation data is a key challenge that health systems must overcome before we can unlock the full potential of these technologies to improve healthcare delivery at scale.


Author

Nancy Stedman

Journalist


More on Health Equity