Thursday, April 16, 2020

TREC-COVID: A New Information Retrieval Challenge for Covid-19

Calling all information retrieval (IR) and biomedical informatics researchers interested in IR! My colleagues and I are pleased to announce  a new research challenge related to Covid-19. TREC-COVID aims to develop and evaluate methods to optimize search engines for the current and rapidly expanding number of scientific papers about Covid-19 and related topics. A group of information retrieval (IR) researchers from the Allen Institute for Artificial Intelligence (AI2), the National Institute of Standards and Technology (NIST), the National Library of Medicine (NLM), Oregon Health and Science University (OHSU), and the University of Texas Health Science Center at Houston (UTHealth) have organized the challenge. A press release and official Web site for the project have been posted. Although not official to the project, I am also maintaining a page about the project.

TREC-COVID applies well-known IR evaluation methods from the NIST Text Retrieval Conference (TREC), an annual challenge evaluation that evaluates retrieval methods with data from news sources, Web sites, social media, and biomedical publications. In an IR challenge evaluation, there is typically a collection of documents or other content, a set of topics based on real-world information needs, and relevance assessments to determine which documents are relevant to each topic. Different research teams submit runs of the topics over the collection from their own search systems, from which metrics derived from recall and precision are calculated using the relevance judgments.

The document collection for TREC-COVID comes from AI2, which has created the COVID-19 Open Research Dataset (CORD-19), a free resource of scholarly articles about COVID-19 and other coronaviruses. CORD-19 is updated weekly, although fixed versions will be used for each round of TREC-COVID. It includes not only articles published in journals but also those posted on preprint servers, including bioRxiv, medRxiv, and others.

Because the dataset (along with the world's corpus of scientific literature on Covid-19) is being updated frequently, there will be multiple rounds of the challenge, with later ones focused on identifying newly emerging research. There may also be other IR-related tasks, such as question-answering and fact-checking. The search topics for the first round are based on those submitted to a variety of sources and were developed by myself, Kirk Roberts of UTHealth and Dina Demner-Fushman of NLM. Relevance judgments will be done by people with medical expertise, such as medical students and NLM indexers. I am overseeing the initial relevance judging process, which is being carried out by OHSU medical students who are currently sidelined from clinical activities due to the Covid-19 crisis.

No comments:

Post a Comment