Inferential Statistics 101: Hypothesis Testing

This article is an excerpt from the Shortform book guide to "Naked Statistics" by Charles Wheelan. Shortform has the world's best summaries and analyses of books you should be reading.

Like this article? Sign up for a free trial here .

What is hypothesis testing? How do you know if a hypothesis is true?

Hypothesis testing is an inferential statistical method by which we determine whether our tentative assumptions are true. Based on our statistical analyses, we can either accept these hypotheses as true or reject them as false with varying degrees of certainty.

Let’s look at the common conventions around inferential statistics and hypothesis testing.

Hypothesis Testing

Inferential statistics test hypotheses, which are educated guesses about how the world works. There are common conventions around testing a hypothesis with inferential statistics. We’ll give a general overview of some of these conventions and apply them to an example in the following sections.

A Good Hypothesis Takes Work

The word hypothesis is often used colloquially to mean a guess. But this colloquial use can create misconceptions about what a scientific hypothesis is. A scientific hypothesis is based on background subject knowledge and research, a review of related studies, and a sound understanding of any statistics that will be performed during the study. Therefore, when researchers arrive at a quality scientific hypothesis, they have already put in a great deal of time and work.

The Null and Alternative Hypotheses

In inferential statistics, hypothesis testing begins with a null hypothesis. A null hypothesis assumes a relationship between two variables that we’ll accept or reject. Wheelan explains that the convention is to begin with a null hypothesis that we hope or expect to reject. For example, a vitamin company might hope to reject the null hypothesis that absorption of their new vitamin is no better than absorption of their previous formula.

Rejecting a null hypothesis in inferential statistics means “accepting” an alternative hypothesis. The alternative hypothesis is the logical inverse of the null hypothesis. If the vitamin company’s null hypothesis is that absorption of their new vitamin is no better than absorption of their previous formula, their alternative hypothesis is that absorption of their new vitamin is better than absorption of their previous formula.

Why the Null Hypothesis?

Establishing both null and alternative hypotheses might seem like a redundant step because they are inverses of one another, but the null hypothesis upholds a fundamental rule in science: You can only disprove hypotheses, never prove them.

For example, say your hypothesis is that a certain species of bear only eats meat because that is all you have ever observed this type of bear eating. You can’t possibly collect data on every individual bear of that species and everything they have ever eaten to prove your hypothesis. However, if you observe one of the bears eating berries, your hypothesis is quickly disproven.

While we can’t prove anything in science, we can collect data to support a hypothesis, which is the role of the alternative hypothesis. Therefore, the null hypothesis provides the opportunity to disprove an idea, and the alternative hypothesis provides the opportunity to use the results of our calculations to support an alternative idea.

Communicating Confidence in Statistics

Since there are no definitive answers in inferential statistics, we can never accept or reject a null hypothesis with complete certainty. Instead, we accept or reject a null hypothesis with a specified degree of confidence, known as the confidence level.

The person running the statistical test sets the confidence level. Wheelan explains that commonly used confidence levels are .05, .01, and .1, but researchers can set their confidence level wherever they choose.

Confidence levels represent the uncertainty that remains when we reject the null hypothesis. So at a .05 confidence level, we can be 95% confident in rejecting the null hypothesis (.05 as a percent is 5%, and 100-5 gives us 95%). At a .01 confidence level, we can be 99% confident in rejecting the null hypothesis.

In statistics, when we reject a null hypothesis, we accept the alternative hypothesis as (probably) true. For example, if the vitamin company rejects their null hypothesis at a .05 confidence level, it means they are 95% certain that the alternative hypothesis—that their new formula has better absorption than their previous formula—is true.

Type l and Type ll Errors

Wheelan explains that another way to think about confidence levels is as the “burden of proof” that you’re putting on your statistical analyses. While a high burden of proof might seem intuitively desirable, there are trade-offs to consider when selecting confidence levels for statistical analyses. To understand these trade-offs, we need to understand Type I and Type II errors.

Type I errors are false positives, which means we reject a null hypothesis that is actually true (and accept an alternative hypothesis that is actually false). For example, if the absorption of our vitamin company’s new formula wasn’t any better than the previous formula, but their study led them to conclude that it was, they would be making a Type l error.

Type lI errors are false negatives, which means we accept a null hypothesis that is actually false (and reject an alternative hypothesis that is actually true). For example, if the absorption of our vitamin company’s new formula was better than the previous formula, but their study led them to conclude that it wasn’t, they would be making a Type ll error.

Setting your burden of proof (confidence level) high, say, .001, makes it statistically more difficult to reject the null hypothesis because, at that level, you have to be 99.9% certain the alternative hypothesis is true before you can reject the null. This makes you more likely to make a Type ll error by accepting the null hypothesis as true when it’s not.

In contrast, setting your confidence level lower, say, .1, reduces the burden of proof necessary to reject the null hypothesis because you only need to be 90% certain the alternative hypothesis is true before you can reject the null. This makes you more likely to make a Type l error by rejecting a null hypothesis that is actually true (and accepting an alternative hypothesis that is actually false).

Reasons for Type l and Type ll Errors

A small sample size can be the cause of some Type ll errors. For example, say you were studying whether malaria is more common in mosquitoes in one region than another, but you only collect 20 mosquitoes from each location. It’s possible that you’d miss a statistically significant difference between the larger mosquito populations because you simply didn’t survey enough mosquitoes. Therefore, large sample sizes can help reduce Type ll errors.

A significant relationship other than the one you’re studying can also be the cause of some Type l errors. For example, say you’re studying the effects of a healthy lunch initiative on school children’s attention levels in class. Your statistics show a significant relationship between increased attention levels and participation in the healthy lunch program. However, this increased attention may actually be caused (at least in part) by the new outdoor exercise program the school implemented at the same time, and students are focusing better in class after coming in from an outdoor walk. Therefore, careful consideration of possible confounding factors can help reduce some Type l errors.

Since you can never fully eliminate the possibility of making Type l and Type ll errors, the circumstances around your research and statistical analysis will determine which one you are more willing to accept.

For example, as a medical researcher working on a new vaccine, you might have a very low tolerance for rejecting a true null hypothesis (Type l error). Since people’s health is at stake, you want to be very sure that your vaccine works. Therefore, you might opt for a high confidence level, say, .001, meaning that you are 99.9% confident when you reject the null hypothesis that your vaccine is not effective against a certain malady. In this case, by opting for such a high level of confidence, you increase your chances of making a Type ll error and concluding that your vaccine is not effective when, in fact, it is.

In contrast, you might be more inclined to accept a Type l error as a social sciences researcher looking to implement a recreational therapy program at a local senior center. You might even set your confidence level at .25, meaning that you would be 75% confident in rejecting the null hypothesis that your recreation program does not produce clinically significant results. In this case, your rationale might be that even if the results of the program are not clinically measurable, the community as a whole will still enjoy it, and there is little risk of doing harm.

Statistical Significance

The results of statistical analyses are often reported as being “statistically significant” or “not statistically significant.” When results are statistically significant, it means that you can be reasonably certain that your observed results are due to the variable you are measuring and not to random chance.

Statistical significance is often reported with a statistic called the p-value. The p-value pinpoints how likely your results are given a true null hypothesis. In other words, it tells you the likelihood of getting your results if the variable you’re measuring really has no effect and chance alone influenced the data.

A small p-value adds confidence to rejecting the null hypothesis. When the p-value is less than your established confidence level, you can report your results as statistically significant.

For example, say our vitamin company chooses a .05 confidence level for their statistical analysis, meaning that they want to be 95% confident when they reject the null hypothesis and move forward with the new formula. Their statistical analysis shows a p-value of .01. This means that there is a 1% chance that their results are due to random chance, and the null hypothesis is true (in other words, there is a 99% chance that these results are not just due to chance). Since this p-value falls within their 95% confidence level for rejecting the null hypothesis, they report their results as “statistically significant.”

P-value Misconceptions

As Wheelan explains, the p-value is a tool for determining whether the values in our dataset are “significant” from a statistical perspective. Since the p-value can be a tricky value to grasp, people have several common misconceptions about it. We’ll highlight two of them.

First, people mistakenly believe the p-value communicates how strong an effect one variable has on another. For example, if your data showed a statistically significant relationship between eating spinach and being able to lift heavier weights an hour later, and your p-value was .001, you might infer that spinach has a large impact on strength. However, a small p-value only means a small likelihood that your results were due to random chance. The reality might be that participants were able to lift just an extra two ounces after eating spinach.

The second common misconception is that a small p-value means that you are likely to get the same result if you run the experiment again. In our above example, for instance, since your p-value of .001 indicates a .1% chance that your results were due to random chance, you might infer that you don’t need to do the experiment again because your results are pretty conclusive. However, the p-value is specific to the dataset being analyzed and can vary widely even between different samples testing the same variable. This is especially true with smaller sample sizes.

Inferential Statistics 101: Hypothesis Testing