Saturday, June 8, 2013

What is a Thinking Informatician to Think of IBM's Watson?

One of the computer applications that has received the most attention in healthcare is Watson, the IBM system that achieved fame by beating humans at the television game show, Jeopardy!. Sometimes it seems there is such hype around Watson that people do not realize what the system actually does. Watson is a type of computer application known as a "question-answering system." It works similarly to a search engine, but instead of retrieving "documents" (e.g., articles, Web pages, images, etc.), it outputs "answers" (or at least short snippets of text that are likely to contain answers to questions posed to it).

As one who has done research in information retrieval (IR, also sometimes called "search") for over two decades, I am interested in how Watson works and how well it performs on the tasks for which it is used. As someone also interested in IR applied to health and biomedicine, I am even more curious about its healthcare applications. Since winning at Jeopardy!, Watson has "graduated medical school"  and "started its medical career". The latter reference touts Watson as an alternative to the "meaningful use" program providing incentives for electronic health record (EHR) adoption, but I see Watson as a very different application, and one potentially benefitting from the growing quantity of clinical data, especially the standards-based data we will hopefully see in Stage 2 of the program. (I also have skepticism for some of these proposed uses of Watson, such as its "crunching" through EHR data to "learn" medicine. Those advocating Watson performing this task need to understand the limits to observational studies in medicine.)

One concern I have had about Watson is that the publicity around it has been mostly news articles and press releases. As an evidence-based informatician, I would like to see more scientific analysis, i.e., what does Watson do to improve healthcare and how successful is it at doing so? I was therefore pleased to come across a journal article evaluating Watson [1]. In this first evaluation in the medical domain, Watson was trained using several resources from internal medicine, such as ACP Medicine, PIER, Merck Manual, and MKSAP. Watson was applied, and further trained with 5000 questions, in Doctor's Dilemma, a competition somewhat like Jeopardy! that is run by American College of Physicians and in which medical trainees participate each year. A sample question from the paper is, Familial adenomatous polyposis is caused by mutations of this gene, with the answer being, APC Gene. (Googling the text of the question gives the correct answer at the top of its ranking to this and the two other sample questions provided in the paper).

Watson was evaluated on an additional 188 unseen questions [1]. The primary outcome measure was recall (number of correct answers) at 10 results shown, and performance varied from 0.49 for the baseline system to 0.77 for the fully adapted and trained system. In other words, looking at the top ten answers for these 188 questions, 77% of those Watson provided were correct.

We can debate whether or not this is good performance for a computer system, or a computer system being touted to provide knowledge to expert users. But a more disappointing aspect of the study is its limitations that I would have brought up had I been asked to peer-review the paper.

The first question I had was, how does Watson's performance compare with other systems, including IR systems such as Google or Pubmed? As noted above, for the three example questions provided in the paper, Google gave the answer in the snippet of text from the top-ranking Web page each time. It would be interesting to know how other online systems would compare with Watson's performance on the questions used in this study.

Another problem with the paper is that none of the 188 questions were provided, not even as an appendix. In all of the evaluation studies I have performed (e.g., [2-4]), I have always provided some or all of the questions used in the study so the reader could better assess the results.

A final concern was that Watson was not evaluated in the context of a real user. While systems usually need to be evaluated from the "system perspective" before being assessed with users, it would have been informative to see whether Watson provided novel information or altered decision-making in real-world clinical scenarios.

Nonetheless, I am encouraged that a study like this was done, and I hope that more comprehensive studies will be undertaken in the near future. I do maintain enthusiasm for systems like Watson and am confident they will find a role in medicine. But we need to be careful about hype and we must employ robust evaluation methods to test our claims as well as determine how they are best used.


1. Ferrucci, D, Levas, A, et al. (2012). Watson: Beyond Jeopardy! Artificial Intelligence. Epub ahead of print.
2. Hersh, WR, Pentecost, J, et al. (1996). A task-oriented approach to information retrieval evaluation. Journal of the American Society for Information Science. 47: 50-56.
3. Hersh, W, Turpin, A, et al. (2001). Challenging conventional assumptions of automated information retrieval with real users:  Boolean searching and batch retrieval evaluations. Information Processing and Management. 37: 383-402.
4. Hersh, WR, Crabtree, MK, et al. (2002). Factors associated with success for searching MEDLINE and applying evidence to answer clinical questions. Journal of the American Medical Informatics Association. 9: 283-293.


  1. Very well thought out points on Watson. I agree with Dr. Hersh on the need for more peer review articles about Watson technology and applications.

    Watson falls under the category of cognitive computing, where , the computer learns from output feedback mechanism and is constantly tweaking it self.

    Watson's benefit over Google is in providing answers from validated medical publications, guidelines etc with a certain level of confidence along with supporting documents.

  2. Well said! There is value even in taking google's search capabilities and wrapping them in a nice usable package but the big challenge is how to do more than concept matching (albeit complex).. Can Watson understand the convoluted logic,we physicians some times use in our documents and figure how it relates to the question? Can it resolve conflicts between X medical papers whose abstracts all negate one another? What about taking patients interview/complaints and turning them into structured data that can be further analyzed by Watson (or any other).
    I know that its not meant to do all of these yet, and therefore the "going to medical school" and other metaphors might be a bit too much

  3. There is a huge difference between performing impressively on a televised quiz show and replicating that sort of performance in a scientifically controlled double blind test. After an extensive internet search I was unable to find any reference to any scientific paper on Watson at all and I wonder if any such test has ever been performed. Watson certainly has access to an impressive database and I have no doubt it can function very efficiently as a search mechanism, but I am extremely skeptical of its ability to parse the especially tricky sentences typical of Jeopardy "answers." Even if it had such capabilities, it's hard to understand how, in so many instances, it would just happen to come up with the exactly right "question" in cases where several different possibilities would seem equally likely.

    So, I'm wondering: has Watson ever been subject to rigorous double blind testing? And if so where have the results been published?

    1. DocG,

      Reference #1 above (which has now been published: Ferrucci, D, Levas, A, et al. (2012). Watson: beyond Jeopardy! Artificial Intelligence. 199-200: 93-105.) is a test assessing how well Watson does with medical questions. As I note in my original posting, Watson is not compared in this study with other sources. The best study would be to compare Watson with real users in real clinical settings (or at least high-fidelity simulated situations) and compare it with other resources in giving actionable information. I hope someone does that study soon!

    2. Thanks for the prompt response. I skimmed the paper you referenced and found it interesting, but the testing focused more on how DeepQA (the "intelligence" software behind Watson) responds to questions posed in a standardized and thus very predictable manner, as opposed to the deliberately tricky questions posed on Jeopardy. My concern is not with the ability of Watson to function as a reasonably useful "expert system" in dealing with standard medical questions, but with its touted ability to accurately parse and thus "understand" complex and often ambiguous English sentences, to the point of being capable of choosing precisely the right type of database query, in the face of so many other possibilities. And in order to prevail in Jeopardy, one would have to provide "questions" for those "answers" both very quickly and with a high degree of accuracy almost every time.

      I find myself in roughly the same situation as that of Mr. Markopoulis when he questioned the validity of Bernie Madoff's consistent investment gains over so many years as extremely unlikely and thus almost certainly due to fraud. I find the ability of Watson to function so successfully on this quiz show to be comparably unlikely. And yet, as it would seem, these results have been accepted (just as Madoff's were) essentially without question, as though there were no difference between performance on a quiz show and a scientifically valid, controlled test. Why would medical professionals trust a system such as Watson unless it had been tested in a truly scientific manner. And my question remains: has it?

    3. I concur that Watson needs more research before being ready for prime time. Hopefully IBM will encourage and maybe even fund this.