I will not recapitulate the talk here, which covers the rationale, data, methods, and early results of the TREC Medical Records Track. (Details can be found on the video and slides from the talk.) I will, however, explore the relationship of what is increasingly called "big data" to biomedicine. One can easily find volumes of information on the Web about big data, but the vision is probably best articulated in the book, The Fourth Paradigm: Data-Intensive Scientific Discovery, published in 2009 by Microsoft Research . This book presents visionary essays on how the growing amount of big data, from EHRs to biomolecular data to patient-entered data will facilitate new discovery of knowledge that conventional experiments will not. As other non-medical essays in the book show, this approach has led to many discoveries in other disciplines that use this form of eScience. We also know that businesses and others make productive use of the vast troves of data they collect from purchases, Web chatter, and other sources of information.
It is important to remember, however, that the existence of large volumes of electronic data does not guarantee that this data will automatically translate into knowledge. In my talk, I reviewed the unfortunately modest amount of literature on this topic. The bottom line, discussed and referenced in more detail below, is that medical records are not only incomplete, but they are also often much less meticulously kept than research data. As I have said in the past, clinical documentation is often what stands between the clinician's daily work and his or her going home for dinner. Another problem with medical records of course is that the data are observational and not experimental, so confounding factors can influence conclusions that might be drawn.
In preparing for this talk, I came across a somewhat obscure but well-written critique of big data . As often happens, I found this paper almost by accident, being pointed to it by one of the email lists to which I subscribe. The primary author of the paper is Danah Boyd, who is another member of Microsoft Research and is also Research Assistant Professor in Media, Culture, and Communication at New York University as well as Visiting Researcher at Harvard Law School. (The paper was delivered as a keynote address at the Oxford Internet Institute's A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society on September 21, 2011.)
Boyd and her co-author list six "provocations" for big data, which sum up to the best critique of big data I have seen. These provocations give us thoughts for concern and are all relevant to biomedicine. I list them here along with my commentary for applicability in biomedicine or other general comments:
- Automating Research Changes the Definition of Knowledge - In all research, we tend to meld the question to the data we can obtain. This has certainly been true in biomedical research, where some have criticized research with answering questions either of interest to the research or that have expediency in being able to answer [5, 6, 7]. We need to remember that the data available in electronic systems, big or small, similarly impacts the questions we ask.
- Claims to Objectivity and Accuracy are Misleading - Just because data are collected in a disinterested way does not mean that bias does not occur. We certainly know from the clinical documentation setting (see above or ) that data entered by clinicians is not necessarily accurate, objective, or complete.
- Bigger Data are Not Always Better Data - This has always been known in medicine from the context of those who do "claims" research based on data collected for billing purposes, which usually consists of diagnosis and procedures codes. One argument for this type of research is the sheer volume of such data, but we also know that this data does not give a complete picture of the patient [9, 10].
- Not All Data Are Equivalent - We certainly know from the clinical setting that certain types of data (e.g., data collected by motivated researchers) are more likely to be of higher completeness and accuracy than others (e.g., clinical documentation) .
- Just Because it is Accessible Doesn’t Make it Ethical - I agree with the author that the use of Institutional Review Boards is important but also has its limitations in keeping research ethical.
- Limited Access to Big Data Creates New Digital Divides - I have seen this issue play out in information retrieval research, where the researchers from the big search engine companies have access to proprietary data, which makes peer review as well as reproducibility of the work difficult at best. I know Jimmy Lin personally, and it pains me to read his comment quoted in this paper.
1. Safran, C., Bloomrosen, M., et al. (2007). Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. Journal of the American Medical Informatics Association, 14: 1-9.
2. Friedman, C., Wong, A., et al. (2010). Achieving a nationwide learning health system. Science Translational Medicine, 2(57): 57cm29.
3. Hey, T., Tansley, S., et al., eds. (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA. Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/.
4. Boyd, D. and Crawford, K. (2011). Six Provocations for Big Data. Cambridge, MA, Microsoft Research. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1926431.
5. Harari, E. (2001). Whose evidence? Lessons from the philosophy of science and the epistemology of medicine. Australia and New Zealand Journal of Psychiatry, 35: 724-730.
6. Cohen, A., Stavri, P., Hersh W. (2004). A categorization and analysis of the criticisms of evidence-based medicine. International Journal of Medical Informatics, 73: 35-43.
7. Tunis, S., Stryer, D., et al. (2003). Practical clinical trials - increasing the value of clinical research for decision making in clinical and health policy. Journal of the American Medical Association, 290: 1624-1632.
8. Benin, A., Vitkauskas, G., et al. (2005). Validity of using an electronic medical record for assessing quality of care in an outpatient setting. Medical Care, 43: 691-698.
8. Jollis, J., Ancukiewicz, M., et al. (1993). Discordance of databases designed for claims payment versus clinical information systems: implications for outcomes research. Annals of Internal Medicine, 119: 844-850.
9. O'Malley, K., Cook, K., et al. (2005). Measuring diagnoses: ICD code accuracy. Health Services Research, 40: 1620-1639.
10. Berlin, J. and Stang, P. (2011). Clinical Data Sets That Need to Be Mined, 104-114, in Olsen, L., Grossman, C. and McGinnis, J., eds. Learning What Works: Infrastructure Required for Comparative Effectiveness Research. Washington, DC. National Academies Press.