The problem(s) with large language models in the exam room

Volume 10, Issue 17 | October 31, 2024

Mara G. Aspinall

and

Liz Ruark

Nov 01, 2024

ALSO IN THIS ISSUE

Bird flu update: Humans and (gulp) maybe pigs
Do less-invasive colorectal cancer screening tests measure up?

Bird Flu Update:
Good news from humans, possibly bad news from pigs

Remember the Missouri H5N1 patient who got sick without known exposure to sick animals? Tests confirmed that the patient did not pass the disease on to any other humans.

Yesterday, the USDA announced that it plans to start carrying out bulk milk testing “first at the regional level, with additional testing at the farm level if necessary, until herds in an area are determined to be free of the virus.” (Up to this point, bulk milk testing has been led by the states, not the feds.)

In news that scientists were really hoping wouldn’t happen, bird flu has been found in pigs here in the US. Why does this matter? Because pigs can get infected with both bird and human flu viruses. If a pig gets infected with both at once, the viruses “can potentially create a hybrid that is better able to spread to people than bird flu viruses typically are,” STAT News explains. However, this news might be a false alarm. Since the pigs lived alongside infected poultry, it’s possible that they snuffled up some bird flu virus - hence the positive result on nasal swabs - but weren’t actually infected. Tests on the pigs’ tissue are pending.

LLMs in the exam room are still a work in progress

Like little kids who have built a full-sized race car out of Legos and cannot wait to drive that thing on a real road, AI enthusiasts these days are champing at the bit to bring large language models (LLMs like ChatGPT) into the exam room. A thoughtful Medscape perspective highlights the challenges that must be overcome before that should happen.

The biggest problem (and it’s a doozy) is that AI often confidently gives wrong or simply made-up answers to relatively simple questions, such as whether the patient has had a blood test recently. (Ask an LLM how many Rs there are in the word strawberry and see what it says.)

Two other recent publications from opposite ends of the AI-excitement spectrum provide more comprehensive discussion of the LLM issue, while still maintaining a positive perspective on health-care AI in general. On the pro side, we have The AI Revolution in Medicine (one author interviewed by Eric Topol here). For the cons, there’s AI Snake Oil (one author interviewed by Eric Topol here).

COMMENTARY (also a work in progress): It seems like every new diagnostic innovation must claim “AI inside” to some degree. However, we see four clear current caveats:

No general-purpose LLM can possibly have enough domain-specific training data to be plug-and-play for medical decision-making. Access to relevant, often proprietary, training data is essential and scarce/expensive.
Even SLMs (small language models – LLMs trained with highly domain-specific data) are backward-looking. They can be very good at regurgitating the most likely and relevant parts of what has happened before, but they are still very bad at predicting the future if circumstances differ or change.
The potential for “self-inflicted wounds” is very real. If AI changes behavior - as it is supposed to - its own training data becomes increasingly obsolete/misleading. No one has yet figured out exactly how to tailor AI to different situations while managing this updating challenge.
LLMs interpret the lessons of the past in terms of probabilities. Humans do this too, but LLMs can do it more quickly and more broadly. However, at our best, humans overlay reasoning and logic (math) before deciding what is cause and what is effect. LLMs cannot do this without further modification. As an example, see recent Apple research showing that LLM “reasoning” errors increase as the questions being asked become more novel.

But all that being said, humans have stubborn, often irrational biases and can’t possibly bring our A game to work every single day. We believe that LLMs will improve. In the end, they’ll get used in the exam room - but by humans who use them as just one of many tools guiding a diagnosis or treatment decision.

Are less-invasive colorectal cancer screening options effective?

Colonoscopy is the gold-standard test for colorectal cancer (CRC). It’s highly sensitive for hard-to-detect, asymptomatic pre-cancerous changes (AKA polyps) and is able to remove them at the same time, yielding essentially no false positives.

But colonoscopy is a pain in the bum (pun intended). It requires unpleasant at-home preparation followed by a trip to a health-care facility and some level of sedation or anesthesia. Non-invasive stool tests are now a recognized option for first-line CRC screening, with the caveat that a follow-up colonoscopy may be required to confirm results. Blood tests are now available, too - the same caveat applies.

This week, Annals of Internal Medicine reported a comprehensive comparison of all currently available non-invasive CRC screening tests versus colonoscopy. The study is not based on new research. Instead, it uses an established model to estimate cases and deaths at a population level under different screening-test protocols.

The baseline: Without screening, 7,470 CRC diagnoses per 100,000 people will be made, of whom about half will die. Colonoscopy every 10 years reduces the number of cases that progress to clinical CRC diagnosis by nearly 80%, saving nearly 3,000 lives.

This is a very tough threshold, and all the rest of the tests fall short of it. The stool-based tests (Exact Science’s Cologuard, Geneoscopy’s ColoSense, and the older fecal immunochemical test) come closest, saving almost as many lives as colonoscopy on average (about 70% compared to colonoscopy’s 80%). The blood tests (Guardant and Freenome cfDNA tests) save only about 50% of lives.

Given that the US currently spends $24 billion per year on CRC screening, the relative costs of these tests matter. Here’s the breakdown:

Colonoscopy: about $1500, once every 10 years if normal.
Stool tests:
- cfDNA: about $700, once every three years. About 18% of folks will need follow-up colonoscopy at least once over 10 years.
- FIT (fecal immunochemical test): about $20, but must be done annually. About 20% of folks will need follow-up colonoscopy at least once over 10 years.
Blood tests: about $1500, once every three years. About 14% of folks will need follow-up colonoscopy at least once over 10 years. (Perhaps counterintuitively, that slightly lower follow-up percentage is not a good thing. That number is lower because this test flags fewer potentially positive cases - in other words, it returns more false negatives.)

COMMENTARY: A test not taken saves no one, and this is where the great expectations for the less-invasive tests reside: to reduce the percentage of eligible individuals who do not screen at all.

Home stool-based tests are convenient and simple, but the “ick” factor likely discourages some. Blood tests can be ordered as part of a panel of tests by physicians, but how many of the unscreened visit a physician regularly?

Costs aside, there is no question that the best public-health answer is to provide a variety of different test options to encourage screening among the 40% of people who currently skip it. (One of the things we learned from COVID is that providing a range of testing options is the best way to maximize compliance.) Nevertheless, colonoscopy is still the best option if you’re willing to get it done.

COVID & COVID/Respiratory EUA Update

The FDA issued no new COVID 510(k) premarket notifications, no new EUAs, three amendments to existing EUAs, two revocations and no warning letters in October.

510(k) Premarket Notifications: 0

New EUAs: 0

Amendments to Existing EUAs:

Antigen (COVID only): 0
COVID / Respiratory Panels: 3

Revocations/ Warning Letters: 2

Revocations (2): Cue COVID-19 Test for Home and Over The Counter (OTC) Use | Cue COVID-19 Test