“Data” is a plural noun!

How many times have you seen (or heard someone say) something like, “The data shows that 40,000 women will die of breast cancer this year”?

Actually, “the data shows” nothing of the sort.

For a start, the word “data” is the plural of the singular word “datum” — from Latin, and the word “datum” originally meant a “gift” or “present.” Over the past 2,000 years, in many different European languages, the term “datum” came to mean things like a piece of information (e.g., my given name is Michael), or a date (e.g., May 2, 1955), or a fact that can be established based on direct observation (e.g., “This patient has a fracture of the right clavicle”). Thus, “data” — in English and American — has come to mean a set of facts. More recently, “big data” has come to mean the sort of vast set of facts that might include all sorts of information about people, their health, their genetic risks for groups of disorders, and how well they are being cared for. And so …

The correct statement (arguably) would have been, “The data show that ….”

But … even that statement can be misleading.

Data on their own may tell us very little. Here are some data:

  • The numbers of people drinking beer in the bar at 7:00 p.m. on Friday night for the past month were: 43, 8, 136, and 27.

I can inform you that these data are accurate, factual, and “established based on direct observation.” What do you think you can tell from these data?

Interpretation of data requires insight. Insight can be used to convert data into knowledge. But the accuracy of that knowledge is not necessarily 100%. Here’s an example.

At the beginning of every year since 1951, the American Cancer Society (ACS) has issued a document called “Cancer facts and figures” (click here for a link to all such documents since 2005). These annual documents contain detailed estimates  and projections of things like how many people will be diagnosed with and die from specific types of cancer here in America during the upcoming 12-month period. So let’s look at the latest breast cancer data provided by the ACS. The ACS projects that, in the US, in 2019, there will be 271,270 new diagnoses of breast cancer (268,600 in women; 2,670 in men) and that 42,260 people will die from breast cancer (41,760 women; 500 men) in the same year. (With rare exceptions, most of the people who get diagnosed with breast cancer in 2019 are not the ones who will die of breast cancer in 2019.)

However, if you read all the relevant information with care, it gradually becomes clear that we don’t actually know how many people get diagnosed with or die from breast cancer in any particular year. What we know are: (a) how many people are given a diagnosis of breast cancer and (b) how many people are classified as having died of breast cancer in representative samples of the population over specific time periods. These data are collected through the Surveillance, Epidemiology, and End Results (SEER) program. The most accurate estimates we have, based on the SEER data, are currently from the period from 2011 to 2015. Thus, the ACS projections for new diagnoses and deaths from breast cancer in 2019 are based on data that are at least 5 years old.

Now the projections made by the ACS each year are reasonably accurate. But they are also only estimated projections. Worse still, since we never really know exactly how many people get diagnosed with and actually die from breast cancer in America each year — even after the event — it is very difficult to tell quite how accurate these projections are. And that’s not even accounting for the fact that it can be incredibly difficult to know exactly what someone in their 80s with metastatic breast cancer actually died from.

What’s the point of all this?

Data and knowledge aren’t the same things at all. Most people think that the numbers issued by the ACS at the beginning of each year are “real”. But they aren’t. They are sophisticated guesstimates based on mathematical models that have been refined over the years. And because the American population is still growing in size, there is a difference each year between the estimated numbers of people who will get diagnosed with and die from breast cancer (which are both slowly going up) and the estimated, age-adjusted number of people per 100,000 in the population who will get diagnosed with and die from breast cancer (which are now both about the same from year to year).

If you’re confused, don’t be surprised. The use of statistics to make particular points is rarely as straightforward as one might think.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d