Most COVID Opinions Are Bullsh*t

Did you know that the number of epidemiologists grew exponentially after COVID started? Kidding. If you were a frequent social media user over the past two years, then you likely encountered an overwhelming number of people analyzing and interpreting COVID-related data. While I admire the renewed interest in data science among society, it is unfortunate that knowledge of statistics and probability was often left at the door once these discussions ensued. One of the most dangerous consequences of the democratization of data is the misuse and misunderstanding of the limitations of data.

Image by author

While the use (and misuse) of data is growing in society, the spread of data literacy is growing much slower. Even though I have a strong background in statistics, I have been relatively quiet about COVID-related data on social media. Why do you think that is? In order to draw conclusions about COVID-related data, you need to consider so many different questions. The study of statistics requires understanding uncertainty and how to draw inferences beyond what the data tells you. Below is the minimum set of questions that would get you close to a statistically sound conclusion regarding COVID data.

  • Is the frequency of testing consistent among states/cities?
  • What is the false positive vs. false negative rate among different testing instruments?
  • What is the level of data quality coming from this source in comparison to others?
    • To what level are COVID rates being overcounted by including those in the hospital “with COVID” and not “from COVID”?
    • To what level are COVID rates being undercounted by self-testing?
    • To what level are COVID rates being undercounted by a lack of testing in some regions?
  • What is the efficacy of each type of vaccine?
    • Has this remained consistent over time?
  • What is the hospitalization rate in each state/city?
    • Is there seasonality factors in this data (e.g. higher hospitalization during flu season)?
  • How much does the density of population affect the spread of the virus?
  • How much does the local vaccine rate affect the spread of the virus?
  • To what degree is this data lagging in terms of data collection (e.g. hospitalizations lags compared to virus spread)?
  • After how large of a sample size of vaccines, can we confidently say that the vaccine is a much lower risk than the virus?*
  • Will vaccines change the rate of virus mutations among society?
    • What is the typical behavior of virus mutations?
    • What is the level of data quality regarding prior viruses in society (e.g. Spanish flu)?
  • What is the level of uncertainty among each piece of data and each conclusion?
    • Is an overreaction to stopping the virus necessary given the high level of uncertainty?
    • Is an overreaction justified given the economic impacts on society?

Whenever you read your friends’, doctors’, family members’, data scientists’, or news stations’ conclusions based on COVID-related data, do you think they took into consideration all of the questions listed above? Do you think they considered even two of these questions? Given enough time and the right data set, I could answer some of these questions. However, I would never be able to answer all of these questions without the assistance of an actual epidemiologist and other statisticians to verify my findings. Regardless of how much analysis you do, your conclusions will never be 100 percent settled. Uncertainty surrounding data quality, bias, and variance will always be present no matter the methodology you chose. This uncertainty justifies the need to incorporate qualitative thinking coupled with probabilistic thinking for any analysis or viewpoint to be trustworthy.

It comes as no surprise that some of the best wisdom I have heard about the virus has come from Nassim Taleb, an expert on uncertainty. Next time you analyze data, I would caution you against being certain of your conclusion and swinging too far to the quantitative side of the pendulum. This becomes even more critical when the level of data quality is in question.

~ The Data Generalist

*I believe Taleb is saying that if vaccines were more dangerous than the virus, then there would have been millions of issues popping up fairly quickly. This would be the left side of the tail when looking at a probability distribution that describes all possible outcomes for taking the vaccine over a specific time period. Therefore, the vaccine is less risky than the virus.


Other Recent Posts

Leave a Reply