Wednesday, September 18, 2019

Closing the Loops on Data Science and Informatics

One of the most highly viewed posts of this blog is a 2015 posting, What is the Difference (If Any) Between Informatics and Data Science. One critique I have had of data science is the focus of most work on only showing prediction and not implementing prescription. In other words, how do we take the predictive output in an ever-increasing number of areas of biomedicine and turn it into programs that actually improve outcomes, whether better patient care, improved healthcare delivery, or more effective research? Some recent publications bring this issue to light and show that we have some loops to close before we attain the value of data science in biomedicine.

A couple of recent perspective pieces bring this closure into light. One is from colleagues Philip Payne, Elmer Bernstam, and Justin Starren [1]. In a Perspective last year in JAMIA Open, they put forth a model that delineates the loop that must be closed, from the development of data science (and informatics) models and systems to the real-world informatics that most who work in the field are familiar with of implementing and evaluating systems with real users and organizations. A more recent paper from Lenert et al. notes that as predictive models are put into place and impact outcomes, they will necessarily impact those models, which will need to be adjusted to the new reality of their use [2].

One aspect of this first loop to be closed is how we study data science and machine learning interventions in actual clinical practice. A pair of recently published papers demonstrate how models and systems can be built and validated, and then assessed in the clinical real world. A first paper by Barton et al. develops and evaluates a model for predicting sepsis from patient vital designs [3]. Sepsis is a medical problem of continued significance while vital sign data is readily available. A subsequent paper by Shimabukuro et al. implements a randomized controlled trial in two medical intensive care units, finding a decrease in length of stay in the units from 13.0 to 10.3 days and a 12.4% reduction in in-hospital mortality [4].

Another recent study assessed the application of machine learning to detecting colonic polyps during colonoscopy [5]. While the machine learning system worked effectively, it was mostly effective at recognizing polyps that were unlikely to progress to cancer quickly, such as small adenomas and hyperplastic polyps. Nonetheless, recognizing such polyps improves the overall quality of colonoscopy exam.

A second loop that will need to be closed to achieve the vision of widespread generalized application of data science will be the generation of standardized EHR data for use across the healthcare system. A group of colleagues and I wrote about this in 2013 [6], as have many others, but some recent work documents aspects of this problem are still not solved. Two recent analyses show variations in how physicians [7] and healthcare organizations [8] document patient care, which may lead to variation in data that is not due to underlying differences in patients.

The need to close these loops show we are still in the early days of machine learning and predictive algorithms. While their impact in medicine will likely be enormous in the long run, there is still much work that will need to be done to optimize their data and how they are most effectively used.


1. Payne P, Bernstam E, Starren J. Biomedical informatics meets data science: current state and future directions for interaction. JAMIA Open. 2018;1:136-41.
2. Lenert M, Matheny M, Walsh C. Prognostic models will be victims of their own success, unless. . . Journal of the American Medical Informatics Association. 2019; Epub ahead of print.
3. Barton C, Chettipally U, Zhou Y, Jiangce Z, Lynn-Palevsky A, Le S, et al. Evaluation of a machine learning algorithm for up to 48-hour advance prediction of sepsis using six vital signs. Computers in Biology and Medicine. 2019;109:79-84.
4. Shimabukuro D, Barton C, Feldman M, Mataraso S, Das R. Effect of a machine learning-based severe sepsis prediction algorithm on patient survival and hospital length of stay: a randomised clinical trial. BMJ Open Respiratory Research. 2019;4(1):e000234.
5. Wang P, Berzin T, Brown J, Bharadwa S, Becq A, Xiao X, et al. Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study. Gut. 2019; Epub ahead of print.
6. Hersh W, Weiner M, Embi P, Logan J, Payne P, Bernstam E, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Medical Care. 2013;51(Suppl 3):S30-S7.
7. Cohen G, Friedman C, Ryan A, Richardson C, Adler-Milstein J. Variation in physicians' electronic health record documentation and potential patient harm from that variation. Journal of General Internal Medicine. 2019; Epub ahead of print.
8. Glynn E, Hoffman M. Heterogeneity introduced by EHR system implementation in a de-identified data resource from 100 non-affiliated organizations. JAMIA Open. 2019; Epub ahead of print.