Sunday, June 23, 2013

The Limits to De-Identification of Clinical Data

There is a wealth of opportunity for putting digital clinical data to use for better understanding health and disease as well as improving healthcare delivery, consistent with the vision from the Institute of Medicine of the "learning health system" [1]. Yet as we have seen from the recent news around the US government monitoring of phone call metadata and Internet data, there are serious concerns about the mis-use of digital data that could be a deal-breaker for health-related data if we do not address privacy and security head on.

Concerns about privacy and security of health data are quite valid. Barely a day goes by before we hear about another data breach in a healthcare organization, with those large enough going on the "wall of shame" of the US Department of Health and Human Services Office of Civil Rights (OCR). These concerns are demonstrated well in the famous but eerie ACLU pizza video. Another recent study shows it is quite easy to discern attributes, including those that may be health-related, about people based on what they share on their Facebook wall [2].

If we want to get the health-related benefits of clinical data, we must address privacy and security issues. We need not only strict regulation of what can and cannot be done with data, but also an ethos around its responsible use. However, if we address those issues appropriately, then perhaps we should be, as pointed out by Zak Kohane recently, demanding "more surveillance" of medical records for health-related purposes.

When it comes to protecting data, we need to be realistic about what does and does not work. One solution commonly proposed is "de-identification" of data, i.e., removal of elements that identify individuals. There is certainly a role for the use of de-identified data in many types of analysis of clinical data. There are, however, limits to the use of de-identified data.

The problems of de-identified are two-fold. First, as famously shown by Latanya Sweeney over a decade ago, data de-identified one database can be combined with data in other sources to re-identify people, including the Governor of Massachusetts [3]. She recently demonstrated this again with a study of people who volunteered their data for the Personal Genome Project [4]. Cimino has shown groups of lab test results (e.g., chemistry panels) can allow re-identification of people [5]. The bottom line is that we are awash in data than can allow re-identification of people, and it will only be exacerbated by the ultimate personal identifier that will soon be available, namely the variants in our own genome.

But perhaps the more important limitation is that data that is truly de-identified, i.e., to the point it cannot be re-identified, may lead to incompleteness in the ability for its use in a comprehensive manner. This is mainly because people get healthcare at different places [5,6]. While we may be able to re-identify data within an organization, it is typically difficult when data goes into multi-organization repositories. Again, this de-identified data may be perfectly fine for some purposes, it does not give us the longitudinal data to which we might want to ask more complex questions.

While recent events may give us a jaundiced view of the uses of data, hopefully that view will moderate as we start to see the benefits. Many of those benefits may emanate from healthcare, making it imperative that we both address privacy and security concerns seriously but also put such data to beneficial use.


1. Smith M, Saunders R, Stuckhardt L, and McGinnis JM, Best Care at Lower Cost: The Path to Continuously Learning Health Care in America. 2012, Washington, DC: National Academies Press.
2. Kosinski M, Stillwell D, and Graepel T, Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 2013. 110: 5802-5805.
3. Sweeney L, k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002. 10: 557-570.
4. Sweeney L, Abu A, and Winn J, Identifying participants in the personal genome project by name. Social Science Research Network, 2013.
5. Cimino JJ, The false security of blind dates: chrononymization’s lack of impact on data privacy of laboratory data. Applied Clinical Informatics, 2012. 3(4): 392-403.

Saturday, June 8, 2013

What is a Thinking Informatician to Think of IBM's Watson?

One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a "question-answering system." It works similarly to a search engine, but instead of retrieving "documents" (e.g., articles, Web pages, images, etc.), it outputs "answers" (or at least short snippets of text that are likely to contain answers to questions posed to it).

As one who has done research in information retrieval (IR, also sometimes called "search") for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has "graduated medical school"  and "started its medical career". The latter reference touts Watson as an alternative to the "meaningful use" program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its "crunching" through EHR data to "learn" medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)

One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson [1]. In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP Medicine, PIER, Merck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor's Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).

Watson was evaluated on an additional 188 unseen questions [1]. The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.

We can debate whether or not this is good performance for a computer system, or a computer system being touted to provide knowledge to expert users. But a more disappointing aspect of the study is its limitations that I would have brought up had I been asked to peer-review the paper.

The first question I had was, how does Watson's performance compare with other systems, including IR systems such as Google or Pubmed? As noted above, for the three example questions provided in the paper, Google gave the answer in the snippet of text from the top-ranking Web page each time. It would be interesting to know how other online systems would compare with Watson's performance on the questions used in this study.

Another problem with the paper is that none of the 188 questions were provided, not even as an appendix. In all of the evaluation studies I have performed (e.g., [2-4]), I have always provided some or all of the questions used in the study so the reader could better assess the results.

A final concern was that Watson was not evaluated in the context of a real user. While systems usually need to be evaluated from the "system perspective" before being assessed with users, it would have been informative to see whether Watson provided novel information or altered decision-making in real-world clinical scenarios.

Nonetheless, I am encouraged that a study like this was done, and I hope that more comprehensive studies will be undertaken in the near future. I do maintain enthusiasm for systems like Watson and am confident they will find a role in medicine. But we need to be careful about hype and we must employ robust evaluation methods to test our claims as well as determine how they are best used.


1. Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. Epub ahead of print.
2. Hersh, WR, Pentecost, J, et al. (1996). A task-oriented approach to information retrieval evaluation. Journal of the American Society for Information Science. 47: 50-56.
3. Hersh, W, Turpin, A, et al. (2001). Challenging conventional assumptions of automated information retrieval with real users:  Boolean searching and batch retrieval evaluations. Information Processing and Management. 37: 383-402.
4. Hersh, WR, Crabtree, MK, et al. (2002). Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions. Journal of the American Medical Informatics Association. 9: 283-293.

Saturday, June 1, 2013

Can We Have an Informed Discussion About Healthcare Reform?

One of the responsibilities that I enjoy a great deal and take most seriously as an educator is getting people to think about and to look at issues from all perspectives. Yes I do have my own opinions about many things, but I genuinely try to get people to form their own opinions based on an informed analysis of the facts. I am also willing to change my opinions when the facts no longer support them.

In this vein, I find the national debate in the US about healthcare reform very frustrating. While I admittedly support the Affordable Care Act (aka, Obamacare), I also know it is imperfect. If any healthcare policy wonk were designing healthcare reform from scratch, they would likely not come up with Obamacare. Yet I also recognize that healthcare reform is a political process, and that political outcomes, based on compromise and tradeoffs, never completely satisfy anyone. In addition, we cannot forget why we need healthcare reform in the first place, which is because our current healthcare system is wasteful, harmful, and not sustainable. Doing nothing is not an option, and I believe that Obamacare is preferable to maintaining the status quo.

One of the biggest ironies about Obamacare is that while most Americans oppose the overall law, they support most of the provisions in it, particularly the requirements denying lifetime limits on coverage or on preexisting conditions. They also see changes to their own health insurance plans, changes that would have come with Obamacare or not and are usually at the behest of their employers facing continued premium increases, and blame them on Obamacare. (And clearly those who have fundamental disagreements with Obamacare, or just want to see the President fail no matter what, exploit this to their advantage.)

Probably the main reason why I find the healthcare reform debate so frustrating is that most Americans do not understand many of the core issues around healthcare delivery and finance. In particular, they do not understand the difference between health insurance and healthcare expenditures. Very few Americans, only the very wealthy, can afford to pay for all healthcare costs. Instead, we all pay for healthcare insurance. Furthermore, free markets do not really work in most areas of healthcare, and it is debatable whether we should even try to make them work, as I noted in this blog during the height of the debate over healthcare reform legislation.

This ignorance is best exemplified by postings such as one on an anti-Obamacare site. The quote at the top of the women's healthcare portion of the site (reproduced as a picture below in case the site changes) lays bare how badly people misunderstand health insurance (private or public): “I had a hysterectomy, I have no need for maternity coverage, but I have to now pay for it. I have to pay not only my own premium but I have to subsidize everybody else.” (Kudos to JD Kleinke for pointing out this site in one of his blog postings. I also agree with another posting of his that Obamacare is more conservative, i.e., less liberal, than Medicare.)

The person quoted on this site obviously misunderstands the concept of health insurance. How many people not needing a hysterectomy subsidized this woman's hysterectomy? She obviously does not understand that the whole idea behind insurance is that we "subsidize" each other's needed care, so that when we need it ourselves, it is available for us. If we start lopping off this condition or that procedure from health insurance, then we soon lose the whole concept of insurance. (This is one of the reasons why most Obamacare insurance exchange plans will be more expensive than cut-rate plans that offer meager coverage and can be terminated at any time.) Carrying this woman's logic to an extreme, does she now no longer support paying for women who need a hysterectomy, since she no longer needs one?

Another manifestation of this thinking concerns Medicare. The famous quote "keep your government hands off my Medicare" (last couple paragraphs of this Washington Post article) best demonstrates how little many people truly understand about Medicare. Less blatantly, however, many elderly people who think nothing of demanding anything and everything from Medicare are the same people who are opposed to other forms of government-run health insurance, especially Medicaid for the poor. Yet these seniors do not realize that they are getting severalfold more benefits from Medicare than the contributions they have made over their lives (Fried, J (2008). Democrats and Republicans - Rhetoric and Reality: Comparing the Voters in Statistics and Anecdotes. New York, NY, Algora Publishing.).

But I do agree with those who argue that we cannot provide unfettered access to any and all types of care to everyone, seniors or otherwise. We do need to make some decisions as a society about what constitutes adequate healthcare coverage, and who should pay what. There are some areas where competition and free markets work in healthcare, and those should be encouraged. But the notion that we can buy less costly insurance policies, covering only this or that, really does not make sense.

I am willing to explore all the possible options for healthcare reform. Some conservative ideas make sense. But before we can have those discussions, a good proportion of the population needs to understand some basic realities about healthcare and its financing, and be willing to have an honest discussion about them.