Thursday, October 31, 2024

Medical Practice is Rarely a Zero/One-Shot Activity

A new generative artificial intelligence (AI) study presents some interesting findings.(1) Physicians randomized to conventional information resources plus or minus ChatGPT-4 showed no significant difference in their ability to solve diagnostic vignettes. In other words, physician access to ChatGPT did not improve their diagnostic ability. Another finding, however, was that ChatGPT alone scored far better than each group of physicians. The results can be interpreted in different ways, from generative AI providing no value for physicians to it performing superior to them.

This is a good study, and we need more like it, but we cannot lose sight of the bigger picture, which is that, as the title to this post days, real-world medical practice is rarely a zero- or one-shot activity. In other words, there are very few situations in medicine where the definitive diagnosis must be made the instant that a patient presents. Clinical diagnosis and treatment are most often an iterative process. Sometimes the response to treatment helps make the diagnosis.

Additional insight to this issue in the context of generative AI comes from an unrelated study.(2) This analysis assessed proprietary and open-source models for extracting relevant findings from chest radiograph reports. While GPT-4 led the pack for zero-shot prompting, several other open-source large language models (LLMs) were able to “catch up” with GPT-4 when the prompting included some example reports. In other words, by iterating with additional content, the other LLMs were able to match GPT-4’s performance.

We can gain additional context from another recent publication that assessed diagnostic reasoning.(3) In this study, the generative AI performed inferiorly to physicians, following neither diagnostic nor treatment guidelines, nor being able to interpret laboratory results. A criticism of this study is that it employed less commonly used LLMs. The stated reason for using non-leading LLMs was a restriction on the data used not being allowed to be submitted to commercial LLMs. This paper also did not do any sort of interactive prompting, so we do not know if performance may have improved as the radiology study cited above did.

We can certainly learn about LLM use in medical diagnosis from these studies, but in the long run, we really need studies that assess whether and how generative AI leads to real benefits for patients and their care. I do not underestimate the challenges for carrying out these studies but they must be done to determine the value generative AI for use in clinical medicine.

References

1. Goh E, et al., 2024. Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA Netw Open 7, e2440969. https://doi.org/10.1001/jamanetworkopen.2024.40969
2. Dorfner FJ, et al., 2024. Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports. Radiology 313, e241139. https://doi.org/10.1148/radiol.241139
3. Hager P, et al., 2024. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 30, 2613–2622. https://doi.org/10.1038/s41591-024-03097-1