By Joshua Anderson
Since the 1970s, the world has been in the “Information Age” with mass advancements in electronics, especially computers. The Information Age has led to the immense economic and cultural value of information technology. In more recent years, the continual advancement of computing power along with the normalization of receiving huge amounts of information on a daily basis has driven statistical and mathematical modeling into everyday conversations. Media outlets constantly discuss statistical insights about racial issues, the stock market, business decisions, and nearly any other topic that comes to mind. With this overflow of information comes an unprecedented quantity of false information. With the 2020 presidential election coming up, there is a clear mental tug-of-war between the political parties, which are using statistical information to convince voters to support their party. I hope to discuss some areas in statistics and statistical modeling that are consistently misused in hopes that you can critically evaluate information presented to you and cast your vote in confidence for your candidate.
Unsurprisingly, there are many ways that even the most basic forms of statistics are intentionally misleading. One of the most common mistakes is through data visualization. Graphs are extremely useful and much more interesting to observe than raw data, yet must be used with extreme caution to present data truthfully. Although there are plenty of ways to skew graphs, one concrete tell of a misleading graph is the axes. Let us take a look at two different examples (note this is not real data):
The bar graph describes the amount of new jobs and unemployment claims in the auto industry. You might look at this and think that there are a significant number of more unemployment claims than new jobs. On the other hand, the scatter plot describes a relationship between modern and older cars with respect to their price and mileage. Looking at this may imply that lower mileage cars are increasingly priced higher, especially in older cars. Now let us look at these graphs with different axes:
These charts are very different, yet we are using the same exact data. Now the bar graph appears to show a drastic difference in the number of new jobs versus unemployment claims. The scatter plot now looks like there is hardly a relationship between mileage and price. Even though the same data is being presented, the apparent implications are vastly different.
Politicians and political activists in particular have used strategies such as this to manipulate the facts to fit their narrative. They are still indeed using factual data, yet the data is presented in a light that changes the overall story. Manipulating statistics in such a way can come in many forms other than visualizations. A few examples include: inaccuracies and biases that come from how data is collected, ignoring uneven distribution of certain categories, not citing a reliable source, and failing to state specific conditions under which the statistic requires to be true.
Another brief example of simple statistics being used incorrectly is from the last presidential debate. President Trump claimed that coronavirus has been at its worst in states that are predominantly led by Democratic leadership while former Vice President Biden claimed that coronavirus has been at its worst in predominantly Republican-led states. In fact, they are both correct, but refused to use the context referenced by the other candidate. Trump was referring to a large portion of cases from the first two spikes in the U.S. being from New York, Delaware, California, and Illinois. Biden was referring to the large portion of cases from this fall in America’s third spike from midwestern states like Wisconsin, South Dakota, and Alabama. Both filter information that supports their point. Other methods used to falsify statistics can be learned by taking a college level introductory course in statistics.
What I find to be more difficult for people to understand and critically evaluate is statistical modeling. In recent years so many cases of uncertainty have been addressed using “models.” These are more advanced mathematical functions that try to make sense of data presented to them. Typically models are used to make predictions or provide inference about the data. We see these ubiquitous in media coverage of uncertain events such as predictions for coronavirus, forecasts of the presidential election, and disparities in income based on gender or race. The issue with this relatively new method of understanding the world around us in the Information Age is the lack of accountability we give the people who make these models. Dr. Anthony Fauci recently expressed this sentiment at a press briefing of the pandemic saying, “I know my modeling colleagues are not going to be happy with me, but models are as good as the assumptions you put into them.” This emphasizes the fact that most models are used in environments of uncertainty and change as we learn more about the problem.
By far, the most crucial mistake made by journalists and reporters in presenting these models is that they overlook the fact that association does not imply causation. In most statistical models, a dataset consists of some sample of a population and is used to calculate coefficients to their respective mathematical equations. Data scientists either use that model to make predictions, or use those coefficients to infer the value of a given variable. Inference is where most non-technical people will misinterpret results. These models, along with these coefficients, output a metric called a p-value, which in short is used to determine the probability that a given variable (independent variable) will affect what we are trying to predict (dependent variable) due to chance rather than from correlation. When that number is low enough, statisticians will declare the association between the independent and dependent variables is likely not due to chance (i.e., they are correlated).
The issue that arises when the media tries to explain this association is that they assume since the effect one factor has on our prediction is not likely due to chance, it must be the cause. This is absolutely and wholly incorrect. For example, if a group of individuals was to contract an illness, they likely would see a doctor. If that illness was severe enough, they would be admitted into the hospital. A statistical model that tries to predict whether an individual will be admitted into the hospital that uses visits to the doctor as an independent variable will likely show they are correlated. Using the logic from the media, one could claim this model proves that if you visit the doctor, you are more likely to be hospitalized. Obviously this is false, but why? Take a look at this diagram:
Our predictive model only gives us statistical inference showing correlation. If we wish to prove an independent variable is the cause of our dependent variable, further analysis is needed. Intuitively we know that the illness is the cause of hospital admissions – not visits to the doctors – but in many instances, these models are used to find unknown relationships. These models often are used to infer causal relationships when all they have proven is association. This has been a major contributor to the spike in false information in recent years.
Statistics is not an easy science and much of the information I discussed may have been difficult to understand if you are not from a technical background. Most of what I study is the theory behind constructing these models. So, if there is anything you should take away from this, it is that there are countless examples of these models either being misinterpreted, being deliberately misrepresented, or missing the whole picture causing significant misdirection in understanding uncertainty. If we wish to find truth in the age of false information, we can no longer take information at face value, but rather we should critically evaluate how it is presented to determine its honesty. When you fill out your ballot this year, I hope you take into consideration how statistical persuasion may have a role in the information provided by political groups to aid you in making the best decision.
Joshua Anderson is a first-year graduate student at Chapman University studying Computational and Data Sciences. He is a technology columnist for The Hesperian.