Tuesday, August 26, 2014

Beyond Prediction: Data Analytics/Data Science/Big Data Must Demonstrate Value

One of my ongoing concerns for data analytics/data science/Big Data in biomedicine and health is that despite the growth of articles and other writing, the accomplishments of using these tools, especially as would be documented in peer-review journals, continues to be small. I am as enthusiastic as anyone about the prospects for harnessing the growing quantity of data in our operational electronic health record (EHR) and other systems for improving health, healthcare, and research. Yet I also believe that we need to be careful that our enthusiasm does not lead to overselling or outright hype, and that we must demonstrate the value for using data just as we would any other clinical process or tool.

There have been some news reports of the value of using Big Data. However, it would be better to see peer-review publication of such results. From the news, it has been reported that two states, Wyoming and Washington, have shown reduced emergency department visits using data-based methods, while Beth Israel Deaconess Hospital has used data as part of an effort that has helped reduce hospital readmissions by 25%. Another earlier news article reported that IBM Watson has learned from data how to diagnose cancer more accurately than physicians, although when I emailed the physician to whom that quote of its success was attributed, he replied that he had never said it (Samuel Nussbaum, email communication, July 28, 2014).

There also continues to be a spate of well-done research demonstrating the predictive value of data. Just this past week, as I was preparing this post, two interesting and informative studies of prediction came across the wire, one looking at risk for metabolic syndrome in a database of 36,944 individuals maintained by a large health insurer [1] and another looking at prediction of hospital readmission [2]. These studies are important, but all of this must be followed with implementation of approaches that make use of data to show real benefit, such as improved patient outcomes, improved health, or even cost efficiency. There is only one study that showed benefit for use of data analytic techniques, using a heart failure prediction algorithm [3]. Maybe I am wrong that other studies demonstrating the application of Big Data techniques have shown benefit (or have been done at all), and I will certainly stand corrected if there are.

Despite the lack of studies demonstrating benefit, there have been plenty of interesting writings about Big Data. Some publications that have even devoted issues or volumes to the topic. One of these was the July issue of the health policy journal, Health Affairs. There were a number of interesting articles in the issue, although none reported any research results demonstrating the value of Big Data. Among the interesting papers were:
  • Bates et al. detailing what they consider the six most important use cases for Big Data: high-cost patients, readmissions, triage, patient decompensation, adverse events, and treatment optimization for diseases affecting multiple organ systems [4]
  • Krumholz describing the need for new thinking and training (including informatics) in the application of Big Data [5]
  • Curtis et al. discussing four large national multi-purpose data networks that could have substantial impact [6]
  • Longhurst et al. presented the concept of the "Green Button," a tool in the EHR that would aggregate data in an attempt to answer clinical questions for which no prior evidence existed [7]
Also appearing recently was the 2014 Yearbook of Medical Informatics, which is now available via open-access publishing and was devoted this year to the topic of Big Data. Similar to the Health Affairs issue, there were several interesting papers (including one of which I was a co-author that focused on how informatics education must adapt to Big Data [8]) but none reporting patient or organizational benefits of Big Data.

There also continues to be a steady stream of other papers related to re-use of clinical data that provide insights or demonstrate the challenges to working it. Two of these papers come from a recent special issue of Journal of the American Medical Informatics Association (JAMIA) devoted to "high-throughput phenotyping." A paper by Richesson et al. documents the challenges in something so seemingly simple as definitively determining patients diagnosed with diabetes mellitus [9]. Another paper by Pathak et al. documents the detailed work required to standardize and normalize data in the EHR for a single quality measure assessing a serum cholesterol levels below 100 mg/dL for patients with diabetes mellitus [10]. Other recent papers in JAMIA have documented the challenges with the quality of diabetes-related data used for quality indicators in primary care [11] and the significant quantity of non-conformance with the details of the Consolidated Clinical Document Architecture (C-CDA) that undermine interoperability [12].

Despite the slow progress, I am still confident that we will see scientific advances around data analytics/data science/Big Data in biomedicine and health. I agree with Cathy O'Neil, who writes that we should be "skeptics, not cynics" about Big Data [13]. In other words, we should approach data, and the results obtained from it, with informed skepticism. I reiterate what I have written in the past, that we must put data to use in ways that demonstrate benefit, apply a research mentality, and take into account the "provocations" of Dana Boyd, the most important of which is that we must not let the data define our questions of it, and instead seek data that will best answer our questions [14].


1. Steinberg, GB, Church, BW, et al. (2014). Novel predictive models for metabolic syndrome risk: a “big data” analytic approach. American Journal of Managed Care. 20: e221-e228.
2. Hebert, C, Shivade, C, et al. (2014). Diagnosis-specific readmission risk prediction using electronic health data: a retrospective cohort study. BMC Medical Informatics & Decision Making. 14: 65. http://www.biomedcentral.com/1472-6947/14/65.
3. Amarasingham, R, Patel, PC, et al. (2013). Allocating scarce resources in real-time to reduce heart failure readmissions: a prospective, controlled study. BMJ Quality & Safety. 22: 998-1005.
4. Bates, DW, Saria, S, et al. (2014). Big data in health care: using analytics to identify and manage high-risk and high-cost patients. Health Affairs. 33: 1123-1131.
5. Curtis, LH, Brown, J, et al. (2014). Four health data networks illustrate the potential for a shared national multipurpose big-data network. Health Affairs. 33: 1178-1186.
6. Krumholz, HM (2014). Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Affairs. 33: 1163-1170.
7. Longhurst, CA, Harrington, RA, et al. (2014). A 'green button' for using aggregate patient data at the point of care. Health Affairs. 33: 1229-1235.
8. Otero, P, Hersh, W, et al. (2014). Big Data: Are Biomedical and Health Informatics Training Programs Ready? Yearbook of Medical Informatics 2014. C. Lehmann, B. Séroussi and M. Jaulent: 177-181.
9. Richesson, RL, Rusincovitch, SA, et al. (2013). A comparison of phenotype definitions for diabetes mellitus. Journal of the American Medical Informatics Association. 20: e319-e326.
10. Pathak, J, Bailey, KR, et al. (2013). Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. Journal of the American Medical Informatics Association. 20: e341-e348.
11. Barkhuysen, P, deGrauw, W, et al. (2014). Is the quality of data in an electronic medical record sufficient for assessing the quality of primary care? Journal of the American Medical Informatics Association. 21: 692-698.
12. D'Amore, JD, Mandel, JC, et al. (2014). Are Meaningful Use Stage 2 certified EHRs ready for interoperability? Findings from the SMART C-CDA Collaborative. Journal of the American Medical Informatics Association. Epub ahead of print.
13. O'Neil, C (2013). On Being a Data Skeptic. Sebastopol, CA, O'Reilly. http://www.oreilly.com/data/free/being-a-data-skeptic.csp.
14. Boyd, D and Crawford, K (2011). Six Provocations for Big Data. Cambridge, MA, Microsoft Research. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431.

1 comment:

  1. Thanks for a great summary of current literature. Mark