Stop it people, the plural of anecdote IS data!

Some observations are better than other observations, but all are data if they’re accurate and used appropriately.

written Apr 6, 2023 • by Jon Sullivan • Category: Wild Soapbox

tangle banner

While we need more large datasets of clean structured data that are easy for data analysts to work with, we also need much bigger collections of anecdotal nature observations. They're all data.

There’s a phrase often quoted in data analysis circles: “the plural of anecdote is not data.” It’s a criticism of datasets created by amassing lots of casual unstructured observations (anecdotes). These kinds of data are something all sensible data analysts avoid whenever possible—it’s hard to analyse collections of anecdotes. They’re wrong though—the plural of anecdote is data—and that’s true for two quite different reasons.

The first reason, which has bothered me for a long time, is that it’s “baby and bath water” wrong. By dismissing collections of anecdotes as not being data, it discourages the collection of anecdotes. Yet, many important questions can be answered with collections of anecdotes. A big bath tub of anecdotes can contain a lot of babies, if you take the time to look carefully. I’ll dive into the details of that in a moment.

The second reason, which I was astonished to discover, is that the originator of this quotation knew all this too. The original quotation, by Ray Wolfinger, is that “the plural of anecdote is data”. Exactly! And, yes, that’s a complete reversal of meaning from its now more popular, mis-quoted mutation.

The mis-quote is currently on 5.2 times as many Google indexed webpages (21,800) as the original quotation (4,150). It’s popularity, I expect, is because it’s difficult to statistically reveal reliable patterns and trends from datasets of anecdotes, as they’re filled with biases. The mutant quotation seems to ring true to more people. Yet, importantly, much more can be revealed from data than just patterns and trends.

For me, the mutant quotation has the same snooty, counterproductive vibe as Ernest Rutherford’s famous put-down that all science except physics is stamp collecting. In both cases, the sometimes immense value of special cases is being ignored, or grossly undervalued, in the search for generalities.

One way to underscore the importance of collections of anecdotes is to consider some examples: natural history museums, herbaria, iNaturalist. All of these are largely made up of collections of casually collected specimens and observations. Calculating patterns and trends in populations of species is difficult using these data, but, that’s not really their point. The point is that these collections contain exceptionally important finds.

butterfly display, Canterbury Museum, NZ
A display of pinned butterfly specimens from around the world, at the Canterbury Museum in Ōtautahi-Christchurch, NZ. When specimens are well-labelled and properly curated, they are a valuable source of knowledge about their species. In other words, collections of specimens are data.

I’ll be cheeky and also suggest that most of the scientific literature amounts to a large collection of carefully quantified anecdotes. That’s because it is strongly skewed by publication bias and a lack of study replication. Like the specimens in museum collections, a few scientific papers can be immensely important and influential. And, like museum collections, doing meta-analyses of the scientific literature for bigger patterns and trends is complicated by the many, difficult to quantify, biases in the data.

So what babies can be found swimming in a large bath of anecdotal nature observations? Lots! Just one important observation can lead to the discovery of a new species, and allow it to be formally described. Species thought extinct can be rediscovered. Species’ distributions can be expanded by the discovery of new populations. The arrival of new pests can be promptly detected, and potentially eradicated.

More detailed observations can also discover new behaviours and interactions, never before documented. Unusual genetics and chemistry can be described, some of which can be hugely valuable for their medical or industrial or agricultural applications.

Groups of observations can also be important, even when collected in haphazard ways without proper survey methods. The known distributions of species can be mapped. Species taxonomy can be revised, by better describing and delimiting the variation within and between closely related species. Known locations can be correlated with environmental conditions and extrapolated with niche modelling statistics to predict the potential full distributions of species. That can predict where species should be, and could be, now, and in future scenarios with environment changes such as caused by global warming, and also in new countries should they invade and become pests. Detailed observations, or specimens, made in the same locations over many decades, can even reveal evolutionary changes in traits and genetics.

All of this can be done with collections of anecdotal nature observations. These collections make knowledge. That surely meets the definition of data. More than that though, these uses scale with the size of the collections. The bigger the museum collection or citizen science project, the more of these important discoveries and uses will be found.

So, some data scientists still look at the masses of anecdotes in museum collections and iNaturalist and misquote that “the plural of anecdote is not data.” I say “No!” Not only is that wrong, but it’s very, very wrong. The plural of anecdote makes useful and important data. While we need more large datasets of clean structured data that are easy for data analysts to work with, we also need much bigger collections of anecdotal nature observations. They’re all data.