Thursday, December 8, 2016

Coping With Adversarial Information Retrieval in Modern Times

When I first chose my area of research focus in my postdoctoral fellowship in biomedical informatics in the late 1980s, I was intrigued by information retrieval (IR; also known as search). While most in informatics were still focused on artificial intelligence and expert systems, I was fascinated by the notion that computers could provide information in response to users entering text. At that time, of course, there were only modest amounts of information to retrieve. The main source was bibliographic databases such as MEDLINE. While the full text of journals and even some textbooks was starting to become available, it was mostly text and not figures or images.

The world of search started to change with the advent of the World Wide Web in the early 1990s. I had actually been skeptical that the Web could even deliver more than text in real-time, given how slow the Internet was at that time. This was also a time when my colleagues at Oregon Health & Science University (OHSU) started putting on continuing medical education (CME) courses for physicians about the growing amount of information available (including via CD-ROM drives). But when we taught about searching the Web, we presented many caveats, especially because there was no control over the quality of information [1].

A related happening about this same time was the growth of spam email [2]. In the 1980s and even into the early 1990s, the only real users of Internet email were academics and techies. But as the Web and underlying Internet spread to broader populations, so did spam email, especially because it was so easy to reach massive numbers of people.

These developments all gave rise to the notion of “adversarial” IR, something that was initially difficult to fathom when we were trying to develop the most effective methods to provide access to the highest quality information available [3]. But as content emerged that we hoped users would not retrieve, there started an additional focus in IR that considered ways to avoid providing users the worst information.

One advance that improved the ability of Web searching to retrieve high-quality material was Google and its PageRank algorithm. A major change pioneered by Google was to rank results based not on measures of similarity between words in the query and page, at the time considered to be our best approach, but instead by how many other pages pointed to them. While not perfect, the number of links to a page is indeed associated with its quality, e.g.,, more pages will point to those from the National Library of Medicine or Mayo Clinic than a less credible site.

Of course, this situation resulted in a number of other consequences, not the least of which was the emergence of search engine optimization (SEO), enabling people to fight against PageRank and related algorithms [5]. It also set off a tit-for-tat battle of search engine sites hiring armies of engineers to figure out how people were trying to game their systems [6]. In more recent years, the emergence of new information streams, most notably the Facebook newsfeed, has provided new opportunities and led to the proliferation of “fake news” attributed to impacting the recent US president election [7].

While technology will play some role in solving the adversarial IR problem, it will not succeed by itself. Clever programmers and others will likely always find ways to exploit approaches to limiting the spread of false or incorrect information. The sheer volume of such information makes human intervention an unlikely solution, and of course one person’s high quality information is another person’s trash heap.

The main way to solve the problem, however, is through education. It is all part of basic modern information literacy everyone must have in the 21st century. Just as I have argued that statistics should be a topic taught in high school if not earlier, so should modern information literacy, including related to health. While there will always be shades of gray in terms of information quality, people can and should be taught how to recognize that which is flagrantly false.

I hope we will learn from fake news, newer variants of spam email such as phishing, and other risks of the Internet era that we must train society to better understand our new information ecosystem, and how we can benefit from its value while minimizing its risk.

References

1. Hersh, WR, Gorman, PN, et al. (1998). Applicability and quality of information for answering clinical questions on the Web. Journal of the American Medical Association. 280: 1307-1308.
2. Goodman, J, Cormack, GV, et al. (2007). Spam and the ongoing battle for the inbox. Communications of the ACM. 50(2): 25-33.
3. Castillo, C and Davison, BD (2011). Adversarial Web Search. Delft, Netherlands, now Publishers.
4. Brin, S and Page, L (1998). The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems. 30: 107-117. http://infolab.stanford.edu/pub/papers/google.pdf.
5. Anonymous (2015). The Beginner's Guide to SEO. Seattle, WA, SEOmoz. http://moz.com/beginners-guide-to-seo.
6. Singhal, A (2004). Challenges in Running a Commercial Web Search Engine. Mountain View, CA, Google. http://www.research.ibm.com/haifa/Workshops/searchandcollaboration2004/papers/haifa.pdf.
7. Davis, W (2016). Fake Or Real? How To Self-Check The News And Get The Facts. Washington, DC, National Public Radio. http://www.npr.org/sections/alltechconsidered/2016/12/05/503581220/fake-or-real-how-to-self-check-the-news-and-get-the-facts.

No comments:

Post a Comment