I wrote recently that one of my concerns for data science is the Big Data over-emphasis on one of its four Vs, namely volume. Since then, I was emailing with Dr. Shaun Grannis and other colleagues from the Indiana Health Information Exchange (IHIE). I asked them about size of their data for near 6 billion clinical observations from the 17 million patients in their system. I was somewhat surprised to hear that the structured data only takes up 26 terabytes. I joked that I almost have that much disk storage lying around my office and home. That is a huge amount of data, but some in data science seem to imply that data sizes that do not seem to start with at least “peta-” are somehow not real data science.
Of course, imaging and other binary data add much more to the size of the IHIE data, as will the intermediate products of various processing that are carried out when doing analysis. But it is clear that the information “density” or “value” contained in that 26 terabytes is probably much higher than a comparable amount of binary (e.g., imaging, genome, etc.) data. This leads me to wonder whether we should be thinking about how we might measure the density or value of different types of biomedical and health information, especially if we are talking about the Vs of Big Data.
The measurement of information is decades old. Its origin is attributed to Shannon and Weaver from a seminal publication in 1949 . They defined information as the number of forms a message could take. As such, a coin flip has 2 bits of information (heads or tails), a single die has 6 bits, and a letter in the English language has 26 bits. This measure is of course simplistic in that it assumes the value of each form in the message is equal. For this reason, others such as Bar Hillel and Carnap began adding semantics (meaning) that, among other things, allowed differing values for each form .
We can certainly think of plenty of biomedical examples where the number of different forms that data can take yields widely divergent value of the information. For example, the human genome contains 3 billion nucleotide pairs, each of which can take 4 forms. Uncompressed, and not accounting for the fact a large proportion is identical across all humans , this genome by Shannon and Weaver’s measure would have 12 billion bits of information. The real picture of human genomic variation is more complex (such as through copy number variations), and the point is that there is less information density in the huge amount of data in a genome than in, say, a short clinical fact, such as a physical exam finding or a diagnosis.
By the same token, images also have different information density than clinical facts. This is especially so as the resolution of digital images continues to increase. There is certainly value in higher-resolution images, but there are also diminishing returns in terms of the information value. Doubling or quintupling or any other increase of pixels or their depth will create more information as measured by Shannon and Weaver’s formula but not necessarily provide more value of that information.
Even clinical data may have diminishing returns based on its size. Some interesting work from OHSU faculty Nicole Weiskopf and colleagues demonstrates an obvious finding but one that has numerous implications for secondary use of clinical data, which is that sicker patients have more data in the electronic health record (EHR) [4-5]. The importance of this is that sicker patients may be “oversampled” in clinical data sets and thus skew secondary analysis by over-representing patients who have received more healthcare.
There are a number of implications for increasing volumes of data that we must take into consideration, especially when using such data for purposes for which it was not collected. This is probably true for any Big Data endeavor, where the data may be biased by the frequency and depth of its measuring. The EHR in particular is not a continuous sampling of a patient’s course, but rather represents periods of sampling that course. With the EHR there is also the challenge that different individual clinicians collect and enter data differently.
Another implication of data volumes is its impact on statistical significance testing. This is one form of what many criticize in science as “p-hacking,” where researchers modify the presentation of their data in order to achieve a certain value for the p statistic that measures the likelihood that differences are not due to chance . Most researchers are well aware that their samples must be of sufficient size in order to achieve the statistical power to attain a significant difference. However, on the flip side, it is very easy to obtain a p value that shows small, perhaps meaningless, differences are statistically significant when one has very large quantities of data.
The bottom line is that as we think about using data science, certainly in biomedicine and health, and the development of information systems to store and analyze it, we must consider the value of information. Just because data is big does not mean it is more important than when data is small. Data science needs to focus on all types and sizes of data.
1. Shannon, CE and Weaver, W (1949). The Mathematical Theory of Communication. Urbana, IL, University of Illinois Press.
2/ Bar-Hillel, Y and Carnap, R (1953). Semantic information. British Journal for the Philosophy of Science. 4: 147-157.
3. Abecasis, GR, Auton, A, et al. (2012). An integrated map of genetic variation from 1,092 human genomes. Nature. 491: 56-65.
4. Weiskopf, NG, Rusanov, A, et al. (2013). Sick patients have more data: the non-random completeness of electronic health records. AMIA Annual Symposium Proceedings 2013, Washington, DC. 1472-1477.
5. Rusanov, A, Weiskopf, NG, et al. (2014). Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research. BMC Medical Informatics & Decision Making. 14: 51. http://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/1472-6947-14-51.
6. Head, ML, Holman, L, et al. (2015). The extent and consequences of p-hacking in science. PLoS Biology. 13: e1002106. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002106.