Friday, April 26, 2013

The Road to Big Data Passes Through Informatics

I have written a number of postings over the last year about various aspects of electronic health record (EHR) data, from the transition of the work of informatics from implementation to analytics to the problems that still prevent us from making optimal use of data, such as the difficulties of data entry. One of my themes has been that knowledge will not just fall out of the data; we will need to improve the quality and completeness of data to learn from it. The requirements for getting better data include widespread adherence to data standards, engaging and motivating those who enter data to improving it, making it easier for those individuals to enter quality data, and evolving our healthcare system to valuing this data. If we are not able to meet these challenges with our current data, it is unlikely we will be able to do so when we have "big data," i.e., that which is orders of magnitude larger and more complex beyond what we have now. No field has devoted more thought, research, or evaluation to the challenges of clinical and health data than informatics. Thus, whether it is tackling issues of how to implement systems in complex clinical settings; meeting the needs of clinicians, patients, and others; or how to maximize the quality of data, the road to making the best use of (big or non-big) data must pass through informatics.

An example of the fact that knowledge will not just fall out of the data comes from some research activity I have been involved in over the last couple years, which is the Text Retrieval Conference (TREC) Medical Records Track [1]. As those familiar with the field of information retrieval (IR) know, TREC is an annual "challenge evaluation," sponsored by the National Institute for Standards and Technology (NIST) [2]. Challenge evaluations bring research groups with common interests and use cases together to apply their systems to a common task or set of tasks, using a common data set, and comparing results using agreed-upon metrics (ideally in a scholarly and not an overly competitive forum). TREC operates on a yearly cycle, consisting of 5-7 "tracks" that each represent a specific focus of IR research. TREC began with the straightforward tasks of "ad hoc" retrieval (user entering queries into a search engine seeking relevant documents) and "routing" (user seeking relevant documents from a new stream of documents based knowledge of previous relevant documents). In subsequent years, TREC evolved to its current state of diverse tracks representing newer problems in IR, such as Web search, video searching, question-answering, cross-language retrieval, and user studies. (Some of these tracks have spawned their own challenge evaluations, especially in the area of cross-language evaluation, an important issue in Europe and Asia.) Virtually all tracks have focused on generic content, typically newswire or Web content, with very few being "domain specific," although I have been involved in two domain-specific tracks in the areas of genomics literature [3] and medical records [1].

In TREC and IR jargon, test collections consist of an adequately large and realistic collection of content, such as documents, medical records, Web pages, etc. [2]. Test collections also include a set of topics, usually at least 25-50 for statistical reliability [4], that are instances of the task being studied. A final component is human relevance judgments or assessments over the content items, indicating which are relevant and should be retrieved for each topic. Success is usually measured by some sort of aggregate statistic that combines the base measures of recall (proportion of relevant content items in the test collection retrieved) and precision (proportion of relevant content items in the search retrieved). (For those familiar with medical diagnostic test characteristics, these correspond to sensitivity and positive predictive value. The reciprocal of precision is also sometimes called number needed to retrieve, since it measures how many overall documents must be read or viewed for each relevant one retrieved.)

The use case for the track TREC Medical Records Track was identifying patients from a collection of medical records who might be candidates for clinical studies. This is a real-world task for which automated retrieval systems could greatly aid in ability to carry out clinical research, quality measurement and improvement, or other "secondary uses" of clinical data [4]. The metric used to measure systems employed was inferred normalized distributed cumulative gain (infNDCG), which takes into account some other factors, such as incomplete judgment of all documents retrieval by all research groups.

The data for the track was a corpus of de-identified medical records developed by the University of Pittsburgh Medical Center. Records containing data, text, and ICD-9 codes are grouped by "visits" or patient encounters with the health system. (Due to the de-identification process, it is impossible to know whether one or more visits might emanate from the same patient.) There were 93,551 documents mapped into 17,264 visits.

I was involved in a number of aspects of organizing this track. I contributed in both guiding the task (or use case) as well as leading some of track infrastructure activities, namely development of search topics and relevance assessments. This work has been aided greatly by students with medical and other expertise in the OHSU Biomedical Informatics Graduate Program.

The results of the TREC Medical Records Track provide a good example of why the road to big data passes through informatics, or in other words, why there is still considerable work to be done from an informatics standpoint before knowledge simply falls out of data. While the performance of systems in the track has been good from an IR standpoint, they also show these systems and approaches have a considerable ways to go before we can just turn the data analytics crank and have medical knowledge emanate. The magnitude of how far we need to go comes from the precision at various levels of retrieval (e.g., precision at 10 retrieved, 50 retrieved, 100 retrieved, etc.), demonstrating how many nonrelevant visits are retrieved. In the case of typical ad hoc IR, we can probably quickly dispense with documents are relatively easy to identify as not relevant. But this may be a more difficult task for complex patients and complex records.

A failure analysis over the data from the 2011 track carried out at OHSU demonstrated why there are still many challenges that need to be overcome [5]. This analysis found a number of reasons why visits frequently retrieved were not relevant:
  • Notes contain very similar term confused with topic
  • Topic symptom/condition/procedure done in the past
  • Most, but not all, criteria present
  • All criteria present but not in the time/sequence specified by the topic description
  • Topic terms mentioned as future possibility
  • Topic terms not present--can't determine why record was captured
  • Irrelevant reference in record to topic terms
  • Topic terms denied or ruled out
The analysis also found reasons why visits rarely retrieval were actually relevant:
  • Topic terms present in record but overlooked in search
  • Visit notes used a synonym for topic terms
  • Topic terms not named and must be derived
  • Topic terms present in diagnosis list but not visit notes
A number of research groups used a variety of techniques, such as synonym and query expansion, machine learning algorithms, and matching against ICD-9 codes, but still had results that were not better than manually constructed queries (which also require a form of informatics expertise in knowing how to query the clinical domain). The results data also show this is a challenging task, as the performance of different systems varied widely on different topics.

From my perspective, these results show that successful use of big data will not come just from smart algorithms and fast computer hardware. It will also require the informatics expertise to design and implement EHRs, high-quality and complete clinical data, and a proper understanding of the clinical/health domain to make most effective use of the data. As such, achieving the value of big data passes through informatics.


1. Voorhees, E and Hersh, W (2012). Overview of the TREC 2012 Medical Records Track. The Twenty-First Text REtrieval Conference Proceedings (TREC 2012), Gaithersburg, MD. National Institute for Standards and Technology.

2. Voorhees, EM and Harman, DK, Eds. (2005). TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA, MIT Press.

3. Hersh, W and Voorhees, E (2009). TREC genomics special issue overview. Information Retrieval. 12: 1-15.

4. Buckley, C and Voorhees, E (2000). Evaluating evaluation measure stability. Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece. ACM Press. 33-40.

5. Edinger, T, Cohen, AM, et al. (2012). Barriers to retrieving patient information from electronic health record data: failure analysis from the TREC Medical Records Track. AMIA 2012 Annual Symposium, Chicago, IL, 180-188.

Monday, April 8, 2013

Biomedical and Health Informatics vs. Data Science, mHealth, etc. - New Disciplines or New Terminology?

When I entered the field of informatics in the 1980s, a great deal of the research was driven by "artificial intelligence" (AI). Many people were trying to build "rule-based expert systems," while those interested in knowledge representation were constructing "semantic networks." We rarely hear these terms in quotes these days, perhaps with the exception of AI that one hears occasionally. It is not, however, that no one is trying to build systems that guide decision-making and represent knowledge in complex ways, but we just different terminology now, such as clinical decision support and ontologies.

Fast forward to the present, and we see the introduction of new terms, most prominently right now data science [1] and mHealth [2]. Many who are doing work in these areas talk of them as the primary focus of their work. I question, however, whether these are truly new disciplines, or just concentrations (at least for those working in health-related areas) within biomedical and health informatics [3]?

I am most concerned about mHealth, when I see new people coming forward with brilliant ideas and truly innovative technologies, yet not incorporating the experiences from decades of work in informatics. I do not deny that some aspects of using mobile connected devices for health are truly novel, yet what I consider to be the basic principles of informatics still apply, namely things like scalability, interoperability, usability, and so forth. I just see nothing novel enough about mHealth to not call it part of informatics.

The same holds, in my opinion, for data science. There are certainly "computationalist" techniques of which many who work in informatics are not skilled. "Big data" applications will require specialized knowledge. But informatics is a broad field, and no one can master everything. There are other aspects of informatics, such as (I am repeating myself from the previous paragraph here) scalability, interoperability, usability, and so forth that must be married from the results of data science to make the latter's output truly usable. One case in point is the growing number of analyses that predict undesired outcomes, such as hospital readmissions [4]. I am as intellectually interested in these applications as much as anyone, but until it is shown these analyses can be actionable, they will mostly remain interesting theoretical exercises.

I am excited for mobile health applications and advanced uses of data techniques to improve health, healthcare, and research. I hope that those pursuing them do not lose sight of the larger picture of providing end-to-end value for the use of data, information, and knowledge in health-related endeavors, i.e., the goal of biomedical and health informatics [3].


1. Davenport, TH and Patil, DJ (2012). Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, October, 2012.
2. Krohn, R and Metcalf, D (2012). mHealth: From Smartphones to Smart Systems. Chicago, IL, Healthcare Information Management Systems Society.
3. Hersh, W (2009). A stimulus to define informatics and health information technology. BMC Medical Informatics & Decision Making. 9: 24.
4. Gildersleeve, R and Cooper, P (2013). Development of an automated, real time surveillance tool for predicting readmissions at a community hospital. Applied Clinical Informatics. 4: 153-169.