Informatics Professor: Generalizability and Reproducibility of Scientific Literature and the Limits to Machine Learning

A couple years ago, some colleagues and I wrote a paper raising a number of caveats about the enthusiasm for leveraging the growing volume of patient data in electronic health records and other clinical information systems for so-called re-use of secondary use, such as clinical research, quality improvement, and public health [1]. While we shared that enthusiasm for that type of use, we also recognized some major challenges for trying to extract knowledge from these sources of data, and advocated a disciplined approach [2].

Now that the world’s knowledge is increasingly available in electronic form in online scientific papers and other resources, a growing number of researchers, companies, and others are calling for the same type of approach that will allow computers to process the world’s scientific literature to answer questions, give advice, and perform other tasks. Extracting knowledge from scientific literature may be easier than from medical records. After all, scientific literature is written in a way to report findings and conclusions in a relatively unambiguous manner. In addition, scientific writing is usually subject to copy-editing that decreases the likelihood of grammatical or spelling errors, both of which often make processing medical records more difficult.

There is no question that machine processing of literature can help answer many questions we have [3]. Google does an excellent job of answering questions I have about the time in various geographic locations, the status of current airplane flights, and calories or fat in a given food. But for more complex questions, such as the best treatment for a complex patient, we still have a ways to go.

Perhaps the system with the most hype around this sort of functionality is IBM’s Watson from. Recently, one of the early leaders of artificial intelligence research, Dr. Roger Schank, took IBM to task for its excessive claims (really marketing hype) around Watson [4]. Among the concerns Schank raised were IBM claims that Watson can “out-think” cancer. I too have written about Watson, in a posting to this blog now four years ago, in which I lamented the lack of published research describing its benefits (as opposed to hype pieces extolling its “graduation” of medical school) [5]. While there have been some conference abstracts presented on Watson’s work, we have yet to see any major contributions toward improving the diagnosis or treatment of cancer [6]. Like Schank, I find Watson’s technology interesting, but claims of its value in helping clinicians to treat cancer or other diseases need scientific verification as much as the underlying treatments being used.

We also, however, need to do some thought experiments as to how likely computers can carry out machine learning in this manner. In fact, there are many reasons why the published scientific literature must be approached with care. It has become clear in recent years that what is reported in the scientific literature may not reflect the totality of knowledge, but instead representing the “winner’s curse” of results that have been positive and thus more likely to be published [7,8]. In reality, however, “publication bias” pervades all of science [9].

In addition, further problems plague the scientific literature. It has been discovered in recent years that a good deal of scientific experiments are not reproducible. This was found to be quite prevalent in preclinical studies analyzed by pharmaceutical companies looking for promising drugs that might be candidates for commercial development [10]. It has also been demonstrated in psychology [11]. In a recent survey of scientists, over half agreed with the statement there is a “reproducibility crisis” in science, with 50-80% (depending on the field) unable to reduce an experiment yet very few trying or able to publish about it [12].

Even on the clinical side we know there are many problems with randomized controlled trials (RCTs). Some recent analyses have documented that RCTs do not always reflect the larger population from which the sampling is intended to represent [13,14], something we can now document with that growing quantity of EHR data [15]. Additional recent work has questioned the use of surrogate outcomes in cancer drugs, questioning their validity as indicators of efficacy of the drugs [16,17]. Indeed, it has been shown in many areas of medicine that initial studies are overturned with later, usually larger, studies [18-20].

In addition, the problem is not limited to published literature. A recent study was published that documented significant amounts of inaccuracy in drug compendia that are commonly used by clinicians [21].

My argument has always been that informatics interventions must prove their scientific mettle no differently than other interventions that claim to improve clinical practice, patient health, or other tasks for which we develop them. A few years back some colleagues and I raised some caveats about clinical data. For different reasons, there are also challenges with scientific literature as well. Thus we should be wary of system that claim to “ingest” scientific literature and perform machine learning from it. While it is important to continue this important area of research, we must resist efforts to over-hype it and also must carry out research to validate its success.

References

1. Hersh, WR, Weiner, MG, et al. (2013). Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 51(Suppl 3): S30-S37.
2. Hersh, WR, Cimino, JJ, et al. (2013). Recommendations for the use of operational electronic health record data in comparative effectiveness research. eGEMs (Generating Evidence & Methods to improve patient outcomes). 1: 14. http://repository.academyhealth.org/egems/vol1/iss1/14/.
3. Wright, A (2016). Reimagining search. Communications of the ACM. 59(6): 17-19.
4. Schank, R (2016). The fraudulent claims made by IBM about Watson and AI. They are not doing "cognitive computing" no matter how many times they say they are. Roger Schank. http://www.rogerschank.com/fraudulent-claims-made-by-IBM-about-Watson-and-AI.
5. Hersh, W (2013). What is a Thinking Informatician to Think of IBM's Watson? Informatics Professor. http://informaticsprofessor.blogspot.com/2013/06/what-is-thinking-informatician-to-think.html.
6. Kim, C (2015). How much has IBM’s Watson improved? Abstracts at 2015 ASCO. Health + Digital. http://healthplusdigital.chiweon.com/?p=83.
7. Ioannidis, JP (2005). Why most published research findings are false. PLoS Medicine. 2(8): e124. http://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124.
8. Young, NS, Ioannidis, JP, et al. (2008). Why current publication practices may distort science. PLoS Medicine. 5(10): e201. http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fjournal.pmed.0050201.
9. Dwan, K, Gamble, C, et al. (2013). Systematic review of the empirical evidence of study publication bias and outcome reporting bias - an updated review. PLoS ONE. 8(7): e66844. http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0066844.
10. Begley, CG and Ellis, LM (2012). Raise standards for preclinical cancer research. Nature. 483: 531-533.
11. Anonymous (2015). Estimating the reproducibility of psychological science. Science. 349: aac4716. http://science.sciencemag.org/content/349/6251/aac4716.
12. Baker, M (2016). 1,500 scientists lift the lid on reproducibility. Nature. 533: 452-454.
13. Prieto-Centurion, V, Rolle, AJ, et al. (2014). Multicenter study comparing case definitions used to identify patients with chronic obstructive pulmonary disease. American Journal of Respiratory and Critical Care Medicine. 190: 989-995.
14. Geifman, N and Butte, AJ (2016). Do cancer clinical trial populations truly represent cancer patients? A comparison of open clinical trials to the Cancer Genome Atlas. Pacific Symposium on Biocomputing, 309-320. http://www.worldscientific.com/doi/10.1142/9789814749411_0029.
15. Weng, C, Li, Y, et al. (2014). A distribution-based method for assessing the differences between clinical trial target populations and patient populations in electronic health records. Applied Clinical Informatics. 5: 463-479.
16. Prasad, V, Kim, C, et al. (2015). The strength of association between surrogate end points and survival in oncology: a systematic review of trial-level meta-analyses. JAMA Internal Medicine. 175: 1389-1398.
17. Kim, C and Prasad, V (2015). Strength of validation for surrogate end points used in the US Food and Drug Administration's approval of oncology drugs. Mayo Clinic Proceedings. Epub ahead of print.
18. Ioannidis, JP (2005). Contradicted and initially stronger effects in highly cited clinical research. Journal of the American Medical Association. 294: 218-228.
19. Prasad, V, Vandross, A, et al. (2013). A decade of reversal: an analysis of 146 contradicted medical practices. Mayo Clinic Proceedings. 88: 790-798.
20. Prasad, VK and Cifu, AS (2015). Ending Medical Reversal: Improving Outcomes, Saving Lives. Baltimore, MD, Johns Hopkins University Press.
21. Randhawa, AS, Babalola, O, et al. (2016). A collaborative assessment among 11 pharmaceutical companies of misinformation in commonly used online drug information compendia. Annals of Pharmacotherapy. 50: 352-359.
22. Malin, JL (2013). Envisioning Watson as a rapid-learning system for oncology. Journal of Oncology Practice. 9: 155-157.

Informatics Professor

Monday, June 6, 2016

Generalizability and Reproducibility of Scientific Literature and the Limits to Machine Learning

No comments:

Post a Comment

William Hersh

Total Pageviews

Followers

Blog Archive