The National Institutes of Health (NIH), the premiere biomedical research organization in the US (and the world), has issued a Request for Information (RFI) that solicits input for their draft Strategic Plan for Data Science. As I did with the request for public input to the now-published Strategic Plan for the National Library of Medicine (NLM), I am posting my comments in this blog as well as submitting them through the formal collection process. I also made a similar posting with my comments on the NLM's RFI for promising directions and opportunities for next-generation data science challenges in health and biomedicine.
The draft NIH data science plan is a well-motivated and well-written overview of what NIH should be doing to insure that the value of data science is leveraged to maximize its benefit to biomedical research and human health. The goals of connecting all NIH and other relevant data, modernizing the ecosystem, developing tools and the workforce skills to use it, and making is sustainable are all important and articulated well in the draft plan.
However, there are three additional aspects that are critical to achieving the value of data science in biomedical research that are inadequately addressed in the draft. The first of these is the establishment of a research agenda around data science itself. We still do not understand all the best practices and other nuances around the optimal use of data science in biomedical research and human health. There are questions of how we best standardize data for use and re-use. What are the standards needed for best use of data? What are the gaps in current standards that can improve them to improve use of data in biomedical research, especially data that is not originally collected for research purposes, such as clinical data from the electronic health record and patient data from wearables, sensors, or that is directly entered?
There also must be further research into the human factors around data use. How do we best organize workflows for optimal input, extraction, and utilization of data? What are the best human-computer interfaces for such work? How do we balance personal privacy and security versus the public good of learning from such data? What are the ethical issues that must be addressed?
The second inadequately addressed aspect concerns the workforce for data science. While the draft properly notes the critical need to train specialists in data science, there is no explicit mention of the discipline that has been at the forefront of “data science” before the term came into widespread use. This is the field of biomedical informatics, whose education and training programs have been training a wide spectrum of those who work in data science, from the specialists who carry out the direct work as well as the applied professionals who work with researchers, the public, and others who implement the work of the specialists. NIH should acknowledge and leverage the wide spectrum of the workforce that will analyze and apply the results of data science work. The large number of biomedical (and related flavors of) informatics programs should expand their established role in translating data science from research to practice.
The final underspecified aspect concerns the organizational home for data science within NIH. The most logical home would be the National Library of Medicine (NLM), which is the new home of the Big Data to Knowledge (BD2K) program that was launched by NIH several years ago. The newly released NLM strategic plan is a logical complement to this plan. (Ideally, the NLM should be more appropriately named the National Institute for Biomedical Informatics and Data Science - NIBIDS - with the Library function being one of its critical functions.)
With the addition of these concerns, the NIH data science plan can make an important contribution to realizing the potential for data science in improving human health as well as preventing and treating disease.