Statology

Statistics Made Easy

How to Write a Null Hypothesis (5 Examples)

A hypothesis test uses sample data to determine whether or not some claim about a population parameter is true.

Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms:

H 0 (Null Hypothesis): Population parameter =,  ≤, ≥ some value

H A  (Alternative Hypothesis): Population parameter <, >, ≠ some value

Note that the null hypothesis always contains the equal sign .

We interpret the hypotheses as follows:

Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

Alternative hypothesis: The sample data  does provide sufficient evidence to support the claim being made by an individual.

For example, suppose it’s assumed that the average height of a certain species of plant is 20 inches tall. However, one botanist claims the true average height is greater than 20 inches.

To test this claim, she may go out and collect a random sample of plants. She can then use this sample data to perform a hypothesis test using the following two hypotheses:

H 0 : μ ≤ 20 (the true mean height of plants is equal to or even less than 20 inches)

H A : μ > 20 (the true mean height of plants is greater than 20 inches)

If the sample data gathered by the botanist shows that the mean height of this species of plants is significantly greater than 20 inches, she can reject the null hypothesis and conclude that the mean height is greater than 20 inches.

Read through the following examples to gain a better understanding of how to write a null hypothesis in different situations.

Example 1: Weight of Turtles

A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles.

Here is how to write the null and alternative hypotheses for this scenario:

H 0 : μ = 300 (the true mean weight is equal to 300 pounds)

H A : μ ≠ 300 (the true mean weight is not equal to 300 pounds)

Example 2: Height of Males

It’s assumed that the mean height of males in a certain city is 68 inches. However, an independent researcher believes the true mean height is greater than 68 inches. To test this, he goes out and collects the height of 50 males in the city.

H 0 : μ ≤ 68 (the true mean height is equal to or even less than 68 inches)

H A : μ > 68 (the true mean height is greater than 68 inches)

Example 3: Graduation Rates

A university states that 80% of all students graduate on time. However, an independent researcher believes that less than 80% of all students graduate on time. To test this, she collects data on the proportion of students who graduated on time last year at the university.

H 0 : p ≥ 0.80 (the true proportion of students who graduate on time is 80% or higher)

H A : μ < 0.80 (the true proportion of students who graduate on time is less than 80%)

Example 4: Burger Weights

A food researcher wants to test whether or not the true mean weight of a burger at a certain restaurant is 7 ounces. To test this, he goes out and measures the weight of a random sample of 20 burgers from this restaurant.

H 0 : μ = 7 (the true mean weight is equal to 7 ounces)

H A : μ ≠ 7 (the true mean weight is not equal to 7 ounces)

Example 5: Citizen Support

A politician claims that less than 30% of citizens in a certain town support a certain law. To test this, he goes out and surveys 200 citizens on whether or not they support the law.

H 0 : p ≥ .30 (the true proportion of citizens who support the law is greater than or equal to 30%)

H A : μ < 0.30 (the true proportion of citizens who support the law is less than 30%)

Additional Resources

Introduction to Hypothesis Testing Introduction to Confidence Intervals An Explanation of P-Values and Statistical Significance

Featured Posts

5 Regularization Techniques You Should Know

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

2 Replies to “How to Write a Null Hypothesis (5 Examples)”

you are amazing, thank you so much

Say I am a botanist hypothesizing the average height of daisies is 20 inches, or not? Does T = (ave – 20 inches) / √ variance / (80 / 4)? … This assumes 40 real measures + 40 fake = 80 n, but that seems questionable. Please advise.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 13: Inferential Statistics

Understanding Null Hypothesis Testing

Learning Objectives

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the   null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favour of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favour of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high  p  value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to conclude that it is true. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favour of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.

Long Descriptions

“Null Hypothesis” long description: A comic depicting a man and a woman talking in the foreground. In the background is a child working at a desk. The man says to the woman, “I can’t believe schools are still teaching kids about the null hypothesis. I remember reading a big study that conclusively disproved it years ago.” [Return to “Null Hypothesis”]

“Conditional Risk” long description: A comic depicting two hikers beside a tree during a thunderstorm. A bolt of lightning goes “crack” in the dark sky as thunder booms. One of the hikers says, “Whoa! We should get inside!” The other hiker says, “It’s okay! Lightning only kills about 45 Americans a year, so the chances of dying are only one in 7,000,000. Let’s go on!” The comic’s caption says, “The annual death rate among people who know that statistic is one in six.” [Return to “Conditional Risk”]

Media Attributions

  • Null Hypothesis by XKCD  CC BY-NC (Attribution NonCommercial)
  • Conditional Risk by XKCD  CC BY-NC (Attribution NonCommercial)
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Values in a population that correspond to variables measured in a study.

The random variability in a statistic from sample to sample.

A formal approach to deciding between two interpretations of a statistical relationship in a sample.

The idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error.

The idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

When the relationship found in the sample would be extremely unlikely, the idea that the relationship occurred “by chance” is rejected.

When the relationship found in the sample is likely to have occurred by chance, the null hypothesis is not rejected.

The probability that, if the null hypothesis were true, the result found in the sample would occur.

How low the p value must be before the sample result is considered unlikely in null hypothesis testing.

When there is less than a 5% chance of a result as extreme as the sample result occurring and the null hypothesis is rejected.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

null hypothesis method statistics

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

4.4: Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 283

  • David Diez, Christopher Barr, & Mine Çetinkaya-Rundel
  • OpenIntro Statistics

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Is the typical US runner getting faster or slower over time? We consider this question in the context of the Cherry Blossom Run, comparing runners in 2006 and 2012. Technological advances in shoes, training, and diet might suggest runners would be faster in 2012. An opposing viewpoint might say that with the average body mass index on the rise, people tend to run slower. In fact, all of these components might be influencing run time.

In addition to considering run times in this section, we consider a topic near and dear to most students: sleep. A recent study found that college students average about 7 hours of sleep per night.15 However, researchers at a rural college are interested in showing that their students sleep longer than seven hours on average. We investigate this topic in Section 4.3.4.

Hypothesis Testing Framework

The average time for all runners who finished the Cherry Blossom Run in 2006 was 93.29 minutes (93 minutes and about 17 seconds). We want to determine if the run10Samp data set provides strong evidence that the participants in 2012 were faster or slower than those runners in 2006, versus the other possibility that there has been no change. 16 We simplify these three options into two competing hypotheses :

  • H 0 : The average 10 mile run time was the same for 2006 and 2012.
  • H A : The average 10 mile run time for 2012 was different than that of 2006.

We call H 0 the null hypothesis and H A the alternative hypothesis.

Null and alternative hypotheses

  • The null hypothesis (H 0 ) often represents either a skeptical perspective or a claim to be tested.
  • The alternative hypothesis (H A ) represents an alternative claim under consideration and is often represented by a range of possible parameter values.

15 theloquitur.com/?p=1161

16 While we could answer this question by examining the entire population data (run10), we only consider the sample data (run10Samp), which is more realistic since we rarely have access to population data.

The null hypothesis often represents a skeptical position or a perspective of no difference. The alternative hypothesis often represents a new perspective, such as the possibility that there has been a change.

Hypothesis testing framework

The skeptic will not reject the null hypothesis (H 0 ), unless the evidence in favor of the alternative hypothesis (H A ) is so strong that she rejects H 0 in favor of H A .

The hypothesis testing framework is a very general tool, and we often use it without a second thought. If a person makes a somewhat unbelievable claim, we are initially skeptical. However, if there is sufficient evidence that supports the claim, we set aside our skepticism and reject the null hypothesis in favor of the alternative. The hallmarks of hypothesis testing are also found in the US court system.

Exercise \(\PageIndex{1}\)

A US court considers two possible claims about a defendant: she is either innocent or guilty. If we set these claims up in a hypothesis framework, which would be the null hypothesis and which the alternative? 17

Jurors examine the evidence to see whether it convincingly shows a defendant is guilty. Even if the jurors leave unconvinced of guilt beyond a reasonable doubt, this does not mean they believe the defendant is innocent. This is also the case with hypothesis testing: even if we fail to reject the null hypothesis, we typically do not accept the null hypothesis as true. Failing to find strong evidence for the alternative hypothesis is not equivalent to accepting the null hypothesis.

17 H 0 : The average cost is $650 per month, \(\mu\) = $650.

In the example with the Cherry Blossom Run, the null hypothesis represents no difference in the average time from 2006 to 2012. The alternative hypothesis represents something new or more interesting: there was a difference, either an increase or a decrease. These hypotheses can be described in mathematical notation using \(\mu_{12}\) as the average run time for 2012:

  • H 0 : \(\mu_{12} = 93.29\)
  • H A : \(\mu_{12} \ne 93.29\)

where 93.29 minutes (93 minutes and about 17 seconds) is the average 10 mile time for all runners in the 2006 Cherry Blossom Run. Using this mathematical notation, the hypotheses can now be evaluated using statistical tools. We call 93.29 the null value since it represents the value of the parameter if the null hypothesis is true. We will use the run10Samp data set to evaluate the hypothesis test.

Testing Hypotheses using Confidence Intervals

We can start the evaluation of the hypothesis setup by comparing 2006 and 2012 run times using a point estimate from the 2012 sample: \(\bar {x}_{12} = 95.61\) minutes. This estimate suggests the average time is actually longer than the 2006 time, 93.29 minutes. However, to evaluate whether this provides strong evidence that there has been a change, we must consider the uncertainty associated with \(\bar {x}_{12}\).

1 6 The jury considers whether the evidence is so convincing (strong) that there is no reasonable doubt regarding the person's guilt; in such a case, the jury rejects innocence (the null hypothesis) and concludes the defendant is guilty (alternative hypothesis).

We learned in Section 4.1 that there is fluctuation from one sample to another, and it is very unlikely that the sample mean will be exactly equal to our parameter; we should not expect \(\bar {x}_{12}\) to exactly equal \(\mu_{12}\). Given that \(\bar {x}_{12} = 95.61\), it might still be possible that the population average in 2012 has remained unchanged from 2006. The difference between \(\bar {x}_{12}\) and 93.29 could be due to sampling variation, i.e. the variability associated with the point estimate when we take a random sample.

In Section 4.2, confidence intervals were introduced as a way to find a range of plausible values for the population mean. Based on run10Samp, a 95% confidence interval for the 2012 population mean, \(\mu_{12}\), was calculated as

\[(92.45, 98.77)\]

Because the 2006 mean, 93.29, falls in the range of plausible values, we cannot say the null hypothesis is implausible. That is, we failed to reject the null hypothesis, H 0 .

Double negatives can sometimes be used in statistics

In many statistical explanations, we use double negatives. For instance, we might say that the null hypothesis is not implausible or we failed to reject the null hypothesis. Double negatives are used to communicate that while we are not rejecting a position, we are also not saying it is correct.

Example \(\PageIndex{1}\)

Next consider whether there is strong evidence that the average age of runners has changed from 2006 to 2012 in the Cherry Blossom Run. In 2006, the average age was 36.13 years, and in the 2012 run10Samp data set, the average was 35.05 years with a standard deviation of 8.97 years for 100 runners.

First, set up the hypotheses:

  • H 0 : The average age of runners has not changed from 2006 to 2012, \(\mu_{age} = 36.13.\)
  • H A : The average age of runners has changed from 2006 to 2012, \(\mu _{age} 6 \ne 36.13.\)

We have previously veri ed conditions for this data set. The normal model may be applied to \(\bar {y}\) and the estimate of SE should be very accurate. Using the sample mean and standard error, we can construct a 95% con dence interval for \(\mu _{age}\) to determine if there is sufficient evidence to reject H 0 :

\[\bar{y} \pm 1.96 \times \dfrac {s}{\sqrt {100}} \rightarrow 35.05 \pm 1.96 \times 0.90 \rightarrow (33.29, 36.81)\]

This confidence interval contains the null value, 36.13. Because 36.13 is not implausible, we cannot reject the null hypothesis. We have not found strong evidence that the average age is different than 36.13 years.

Exercise \(\PageIndex{2}\)

Colleges frequently provide estimates of student expenses such as housing. A consultant hired by a community college claimed that the average student housing expense was $650 per month. What are the null and alternative hypotheses to test whether this claim is accurate? 18

Sample distribution of student housing expense. These data are moderately skewed, roughly determined using the outliers on the right.

H A : The average cost is different than $650 per month, \(\mu \ne\) $650.

18 Applying the normal model requires that certain conditions are met. Because the data are a simple random sample and the sample (presumably) represents no more than 10% of all students at the college, the observations are independent. The sample size is also sufficiently large (n = 75) and the data exhibit only moderate skew. Thus, the normal model may be applied to the sample mean.

Exercise \(\PageIndex{3}\)

The community college decides to collect data to evaluate the $650 per month claim. They take a random sample of 75 students at their school and obtain the data represented in Figure 4.11. Can we apply the normal model to the sample mean?

If the court makes a Type 1 Error, this means the defendant is innocent (H 0 true) but wrongly convicted. A Type 2 Error means the court failed to reject H 0 (i.e. failed to convict the person) when she was in fact guilty (H A true).

Example \(\PageIndex{2}\)

The sample mean for student housing is $611.63 and the sample standard deviation is $132.85. Construct a 95% confidence interval for the population mean and evaluate the hypotheses of Exercise 4.22.

The standard error associated with the mean may be estimated using the sample standard deviation divided by the square root of the sample size. Recall that n = 75 students were sampled.

\[ SE = \dfrac {s}{\sqrt {n}} = \dfrac {132.85}{\sqrt {75}} = 15.34\]

You showed in Exercise 4.23 that the normal model may be applied to the sample mean. This ensures a 95% confidence interval may be accurately constructed:

\[\bar {x} \pm z*SE \rightarrow 611.63 \pm 1.96 \times 15.34 \times (581.56, 641.70)\]

Because the null value $650 is not in the confidence interval, a true mean of $650 is implausible and we reject the null hypothesis. The data provide statistically significant evidence that the actual average housing expense is less than $650 per month.

Decision Errors

Hypothesis tests are not flawless. Just think of the court system: innocent people are sometimes wrongly convicted and the guilty sometimes walk free. Similarly, we can make a wrong decision in statistical hypothesis tests. However, the difference is that we have the tools necessary to quantify how often we make such errors.

There are two competing hypotheses: the null and the alternative. In a hypothesis test, we make a statement about which one might be true, but we might choose incorrectly. There are four possible scenarios in a hypothesis test, which are summarized in Table 4.12.

A Type 1 Error is rejecting the null hypothesis when H0 is actually true. A Type 2 Error is failing to reject the null hypothesis when the alternative is actually true.

Exercise 4.25

In a US court, the defendant is either innocent (H 0 ) or guilty (H A ). What does a Type 1 Error represent in this context? What does a Type 2 Error represent? Table 4.12 may be useful.

To lower the Type 1 Error rate, we might raise our standard for conviction from "beyond a reasonable doubt" to "beyond a conceivable doubt" so fewer people would be wrongly convicted. However, this would also make it more difficult to convict the people who are actually guilty, so we would make more Type 2 Errors.

Exercise 4.26

How could we reduce the Type 1 Error rate in US courts? What influence would this have on the Type 2 Error rate?

To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from "beyond a reasonable doubt" to "beyond a little doubt". Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.

Exercise 4.27

How could we reduce the Type 2 Error rate in US courts? What influence would this have on the Type 1 Error rate?

A skeptic would have no reason to believe that sleep patterns at this school are different than the sleep patterns at another school.

Exercises 4.25-4.27 provide an important lesson:

If we reduce how often we make one type of error, we generally make more of the other type.

Hypothesis testing is built around rejecting or failing to reject the null hypothesis. That is, we do not reject H 0 unless we have strong evidence. But what precisely does strong evidence mean? As a general rule of thumb, for those cases where the null hypothesis is actually true, we do not want to incorrectly reject H 0 more than 5% of the time. This corresponds to a significance level of 0.05. We often write the significance level using \(\alpha\) (the Greek letter alpha): \(\alpha = 0.05.\) We discuss the appropriateness of different significance levels in Section 4.3.6.

If we use a 95% confidence interval to test a hypothesis where the null hypothesis is true, we will make an error whenever the point estimate is at least 1.96 standard errors away from the population parameter. This happens about 5% of the time (2.5% in each tail). Similarly, using a 99% con dence interval to evaluate a hypothesis is equivalent to a significance level of \(\alpha = 0.01\).

A confidence interval is, in one sense, simplistic in the world of hypothesis tests. Consider the following two scenarios:

  • The null value (the parameter value under the null hypothesis) is in the 95% confidence interval but just barely, so we would not reject H 0 . However, we might like to somehow say, quantitatively, that it was a close decision.
  • The null value is very far outside of the interval, so we reject H 0 . However, we want to communicate that, not only did we reject the null hypothesis, but it wasn't even close. Such a case is depicted in Figure 4.13.

In Section 4.3.4, we introduce a tool called the p-value that will be helpful in these cases. The p-value method also extends to hypothesis tests where con dence intervals cannot be easily constructed or applied.

alt

Formal Testing using p-Values

The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in favor of the alternative. Formally the p-value is a conditional probability.

definition: p-value

The p-value is the probability of observing data at least as favorable to the alternative hypothesis as our current data set, if the null hypothesis is true. We typically use a summary statistic of the data, in this chapter the sample mean, to help compute the p-value and evaluate the hypotheses.

A poll by the National Sleep Foundation found that college students average about 7 hours of sleep per night. Researchers at a rural school are interested in showing that students at their school sleep longer than seven hours on average, and they would like to demonstrate this using a sample of students. What would be an appropriate skeptical position for this research?

This is entirely based on the interests of the researchers. Had they been only interested in the opposite case - showing that their students were actually averaging fewer than seven hours of sleep but not interested in showing more than 7 hours - then our setup would have set the alternative as \(\mu < 7\).

alt

We can set up the null hypothesis for this test as a skeptical perspective: the students at this school average 7 hours of sleep per night. The alternative hypothesis takes a new form reflecting the interests of the research: the students average more than 7 hours of sleep. We can write these hypotheses as

  • H 0 : \(\mu\) = 7.
  • H A : \(\mu\) > 7.

Using \(\mu\) > 7 as the alternative is an example of a one-sided hypothesis test. In this investigation, there is no apparent interest in learning whether the mean is less than 7 hours. (The standard error can be estimated from the sample standard deviation and the sample size: \(SE_{\bar {x}} = \dfrac {s_x}{\sqrt {n}} = \dfrac {1.75}{\sqrt {110}} = 0.17\)). Earlier we encountered a two-sided hypothesis where we looked for any clear difference, greater than or less than the null value.

Always use a two-sided test unless it was made clear prior to data collection that the test should be one-sided. Switching a two-sided test to a one-sided test after observing the data is dangerous because it can inflate the Type 1 Error rate.

TIP: One-sided and two-sided tests

If the researchers are only interested in showing an increase or a decrease, but not both, use a one-sided test. If the researchers would be interested in any difference from the null value - an increase or decrease - then the test should be two-sided.

TIP: Always write the null hypothesis as an equality

We will find it most useful if we always list the null hypothesis as an equality (e.g. \(\mu\) = 7) while the alternative always uses an inequality (e.g. \(\mu \ne 7, \mu > 7, or \mu < 7)\).

The researchers at the rural school conducted a simple random sample of n = 110 students on campus. They found that these students averaged 7.42 hours of sleep and the standard deviation of the amount of sleep for the students was 1.75 hours. A histogram of the sample is shown in Figure 4.14.

Before we can use a normal model for the sample mean or compute the standard error of the sample mean, we must verify conditions. (1) Because this is a simple random sample from less than 10% of the student body, the observations are independent. (2) The sample size in the sleep study is sufficiently large since it is greater than 30. (3) The data show moderate skew in Figure 4.14 and the presence of a couple of outliers. This skew and the outliers (which are not too extreme) are acceptable for a sample size of n = 110. With these conditions veri ed, the normal model can be safely applied to \(\bar {x}\) and the estimated standard error will be very accurate.

What is the standard deviation associated with \(\bar {x}\)? That is, estimate the standard error of \(\bar {x}\). 25

The hypothesis test will be evaluated using a significance level of \(\alpha = 0.05\). We want to consider the data under the scenario that the null hypothesis is true. In this case, the sample mean is from a distribution that is nearly normal and has mean 7 and standard deviation of about 0.17. Such a distribution is shown in Figure 4.15.

alt

The shaded tail in Figure 4.15 represents the chance of observing such a large mean, conditional on the null hypothesis being true. That is, the shaded tail represents the p-value. We shade all means larger than our sample mean, \(\bar {x} = 7.42\), because they are more favorable to the alternative hypothesis than the observed mean.

We compute the p-value by finding the tail area of this normal distribution, which we learned to do in Section 3.1. First compute the Z score of the sample mean, \(\bar {x} = 7.42\):

\[Z = \dfrac {\bar {x} - \text {null value}}{SE_{\bar {x}}} = \dfrac {7.42 - 7}{0.17} = 2.47\]

Using the normal probability table, the lower unshaded area is found to be 0.993. Thus the shaded area is 1 - 0.993 = 0.007. If the null hypothesis is true, the probability of observing such a large sample mean for a sample of 110 students is only 0.007. That is, if the null hypothesis is true, we would not often see such a large mean.

We evaluate the hypotheses by comparing the p-value to the significance level. Because the p-value is less than the significance level \((p-value = 0.007 < 0.05 = \alpha)\), we reject the null hypothesis. What we observed is so unusual with respect to the null hypothesis that it casts serious doubt on H 0 and provides strong evidence favoring H A .

p-value as a tool in hypothesis testing

The p-value quantifies how strongly the data favor H A over H 0 . A small p-value (usually < 0.05) corresponds to sufficient evidence to reject H 0 in favor of H A .

TIP: It is useful to First draw a picture to find the p-value

It is useful to draw a picture of the distribution of \(\bar {x}\) as though H 0 was true (i.e. \(\mu\) equals the null value), and shade the region (or regions) of sample means that are at least as favorable to the alternative hypothesis. These shaded regions represent the p-value.

The ideas below review the process of evaluating hypothesis tests with p-values:

  • The null hypothesis represents a skeptic's position or a position of no difference. We reject this position only if the evidence strongly favors H A .
  • A small p-value means that if the null hypothesis is true, there is a low probability of seeing a point estimate at least as extreme as the one we saw. We interpret this as strong evidence in favor of the alternative.
  • We reject the null hypothesis if the p-value is smaller than the significance level, \(\alpha\), which is usually 0.05. Otherwise, we fail to reject H 0 .
  • We should always state the conclusion of the hypothesis test in plain language so non-statisticians can also understand the results.

The p-value is constructed in such a way that we can directly compare it to the significance level ( \(\alpha\)) to determine whether or not to reject H 0 . This method ensures that the Type 1 Error rate does not exceed the significance level standard.

alt

If the null hypothesis is true, how often should the p-value be less than 0.05?

About 5% of the time. If the null hypothesis is true, then the data only has a 5% chance of being in the 5% of data most favorable to H A .

alt

Exercise 4.31

Suppose we had used a significance level of 0.01 in the sleep study. Would the evidence have been strong enough to reject the null hypothesis? (The p-value was 0.007.) What if the significance level was \(\alpha = 0.001\)? 27

27 We reject the null hypothesis whenever p-value < \(\alpha\). Thus, we would still reject the null hypothesis if \(\alpha = 0.01\) but not if the significance level had been \(\alpha = 0.001\).

Exercise 4.32

Ebay might be interested in showing that buyers on its site tend to pay less than they would for the corresponding new item on Amazon. We'll research this topic for one particular product: a video game called Mario Kart for the Nintendo Wii. During early October 2009, Amazon sold this game for $46.99. Set up an appropriate (one-sided!) hypothesis test to check the claim that Ebay buyers pay less during auctions at this same time. 28

28 The skeptic would say the average is the same on Ebay, and we are interested in showing the average price is lower.

Exercise 4.33

During early October, 2009, 52 Ebay auctions were recorded for Mario Kart.29 The total prices for the auctions are presented using a histogram in Figure 4.17, and we may like to apply the normal model to the sample mean. Check the three conditions required for applying the normal model: (1) independence, (2) at least 30 observations, and (3) the data are not strongly skewed. 30

30 (1) The independence condition is unclear. We will make the assumption that the observations are independent, which we should report with any nal results. (2) The sample size is sufficiently large: \(n = 52 \ge 30\). (3) The data distribution is not strongly skewed; it is approximately symmetric.

H 0 : The average auction price on Ebay is equal to (or more than) the price on Amazon. We write only the equality in the statistical notation: \(\mu_{ebay} = 46.99\).

H A : The average price on Ebay is less than the price on Amazon, \(\mu _{ebay} < 46.99\).

29 These data were collected by OpenIntro staff.

Example 4.34

The average sale price of the 52 Ebay auctions for Wii Mario Kart was $44.17 with a standard deviation of $4.15. Does this provide sufficient evidence to reject the null hypothesis in Exercise 4.32? Use a significance level of \(\alpha = 0.01\).

The hypotheses were set up and the conditions were checked in Exercises 4.32 and 4.33. The next step is to find the standard error of the sample mean and produce a sketch to help find the p-value.

alt

Because the alternative hypothesis says we are looking for a smaller mean, we shade the lower tail. We find this shaded area by using the Z score and normal probability table: \(Z = \dfrac {44.17 \times 46.99}{0.5755} = -4.90\), which has area less than 0.0002. The area is so small we cannot really see it on the picture. This lower tail area corresponds to the p-value.

Because the p-value is so small - specifically, smaller than = 0.01 - this provides sufficiently strong evidence to reject the null hypothesis in favor of the alternative. The data provide statistically signi cant evidence that the average price on Ebay is lower than Amazon's asking price.

Two-sided hypothesis testing with p-values

We now consider how to compute a p-value for a two-sided test. In one-sided tests, we shade the single tail in the direction of the alternative hypothesis. For example, when the alternative had the form \(\mu\) > 7, then the p-value was represented by the upper tail (Figure 4.16). When the alternative was \(\mu\) < 46.99, the p-value was the lower tail (Exercise 4.32). In a two-sided test, we shade two tails since evidence in either direction is favorable to H A .

Exercise 4.35 Earlier we talked about a research group investigating whether the students at their school slept longer than 7 hours each night. Let's consider a second group of researchers who want to evaluate whether the students at their college differ from the norm of 7 hours. Write the null and alternative hypotheses for this investigation. 31

Example 4.36 The second college randomly samples 72 students and nds a mean of \(\bar {x} = 6.83\) hours and a standard deviation of s = 1.8 hours. Does this provide strong evidence against H 0 in Exercise 4.35? Use a significance level of \(\alpha = 0.05\).

First, we must verify assumptions. (1) A simple random sample of less than 10% of the student body means the observations are independent. (2) The sample size is 72, which is greater than 30. (3) Based on the earlier distribution and what we already know about college student sleep habits, the distribution is probably not strongly skewed.

Next we can compute the standard error \((SE_{\bar {x}} = \dfrac {s}{\sqrt {n}} = 0.21)\) of the estimate and create a picture to represent the p-value, shown in Figure 4.18. Both tails are shaded.

31 Because the researchers are interested in any difference, they should use a two-sided setup: H 0 : \(\mu\) = 7, H A : \(\mu \ne 7.\)

alt

An estimate of 7.17 or more provides at least as strong of evidence against the null hypothesis and in favor of the alternative as the observed estimate, \(\bar {x} = 6.83\).

We can calculate the tail areas by rst nding the lower tail corresponding to \(\bar {x}\):

\[Z = \dfrac {6.83 - 7.00}{0.21} = -0.81 \xrightarrow {table} \text {left tail} = 0.2090\]

Because the normal model is symmetric, the right tail will have the same area as the left tail. The p-value is found as the sum of the two shaded tails:

\[ \text {p-value} = \text {left tail} + \text {right tail} = 2 \times \text {(left tail)} = 0.4180\]

This p-value is relatively large (larger than \(\mu\)= 0.05), so we should not reject H 0 . That is, if H 0 is true, it would not be very unusual to see a sample mean this far from 7 hours simply due to sampling variation. Thus, we do not have sufficient evidence to conclude that the mean is different than 7 hours.

Example 4.37 It is never okay to change two-sided tests to one-sided tests after observing the data. In this example we explore the consequences of ignoring this advice. Using \(\alpha = 0.05\), we show that freely switching from two-sided tests to onesided tests will cause us to make twice as many Type 1 Errors as intended.

Suppose the sample mean was larger than the null value, \(\mu_0\) (e.g. \(\mu_0\) would represent 7 if H 0 : \(\mu\) = 7). Then if we can ip to a one-sided test, we would use H A : \(\mu > \mu_0\). Now if we obtain any observation with a Z score greater than 1.65, we would reject H 0 . If the null hypothesis is true, we incorrectly reject the null hypothesis about 5% of the time when the sample mean is above the null value, as shown in Figure 4.19.

Suppose the sample mean was smaller than the null value. Then if we change to a one-sided test, we would use H A : \(\mu < \mu_0\). If \(\bar {x}\) had a Z score smaller than -1.65, we would reject H 0 . If the null hypothesis is true, then we would observe such a case about 5% of the time.

By examining these two scenarios, we can determine that we will make a Type 1 Error 5% + 5% = 10% of the time if we are allowed to swap to the "best" one-sided test for the data. This is twice the error rate we prescribed with our significance level: \(\alpha = 0.05\) (!).

alt

Caution: One-sided hypotheses are allowed only before seeing data

After observing data, it is tempting to turn a two-sided test into a one-sided test. Avoid this temptation. Hypotheses must be set up before observing the data. If they are not, the test must be two-sided.

Choosing a Significance Level

Choosing a significance level for a test is important in many contexts, and the traditional level is 0.05. However, it is often helpful to adjust the significance level based on the application. We may select a level that is smaller or larger than 0.05 depending on the consequences of any conclusions reached from the test.

  • If making a Type 1 Error is dangerous or especially costly, we should choose a small significance level (e.g. 0.01). Under this scenario we want to be very cautious about rejecting the null hypothesis, so we demand very strong evidence favoring H A before we would reject H 0 .
  • If a Type 2 Error is relatively more dangerous or much more costly than a Type 1 Error, then we should choose a higher significance level (e.g. 0.10). Here we want to be cautious about failing to reject H 0 when the null is actually false. We will discuss this particular case in greater detail in Section 4.6.

Significance levels should reflect consequences of errors

The significance level selected for a test should reflect the consequences associated with Type 1 and Type 2 Errors.

Example 4.38

A car manufacturer is considering a higher quality but more expensive supplier for window parts in its vehicles. They sample a number of parts from their current supplier and also parts from the new supplier. They decide that if the high quality parts will last more than 12% longer, it makes nancial sense to switch to this more expensive supplier. Is there good reason to modify the significance level in such a hypothesis test?

The null hypothesis is that the more expensive parts last no more than 12% longer while the alternative is that they do last more than 12% longer. This decision is just one of the many regular factors that have a marginal impact on the car and company. A significancelevel of 0.05 seems reasonable since neither a Type 1 or Type 2 error should be dangerous or (relatively) much more expensive.

Example 4.39

The same car manufacturer is considering a slightly more expensive supplier for parts related to safety, not windows. If the durability of these safety components is shown to be better than the current supplier, they will switch manufacturers. Is there good reason to modify the significance level in such an evaluation?

The null hypothesis would be that the suppliers' parts are equally reliable. Because safety is involved, the car company should be eager to switch to the slightly more expensive manufacturer (reject H 0 ) even if the evidence of increased safety is only moderately strong. A slightly larger significance level, such as \(\mu = 0.10\), might be appropriate.

Exercise 4.40

A part inside of a machine is very expensive to replace. However, the machine usually functions properly even if this part is broken, so the part is replaced only if we are extremely certain it is broken based on a series of measurements. Identify appropriate hypotheses for this test (in plain language) and suggest an appropriate significance level. 32

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.2 - writing hypotheses.

The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)).

When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the direction of the test (non-directional, right-tailed or left-tailed), and (3) the value of the hypothesized parameter.

  • At this point we can write hypotheses for a single mean (\(\mu\)), paired means(\(\mu_d\)), a single proportion (\(p\)), the difference between two independent means (\(\mu_1-\mu_2\)), the difference between two proportions (\(p_1-p_2\)), a simple linear regression slope (\(\beta\)), and a correlation (\(\rho\)). 
  • The research question will give us the information necessary to determine if the test is two-tailed (e.g., "different from," "not equal to"), right-tailed (e.g., "greater than," "more than"), or left-tailed (e.g., "less than," "fewer than").
  • The research question will also give us the hypothesized parameter value. This is the number that goes in the hypothesis statements (i.e., \(\mu_0\) and \(p_0\)). For the difference between two groups, regression, and correlation, this value is typically 0.

Hypotheses are always written in terms of population parameters (e.g., \(p\) and \(\mu\)).  The tables below display all of the possible hypotheses for the parameters that we have learned thus far. Note that the null hypothesis always includes the equality (i.e., =).

Null Hypothesis Examples

ThoughtCo / Hilary Allison

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating ​a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.

The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.

What Is the Null Hypothesis?

The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not​ need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.

To distinguish it from other hypotheses , the null hypothesis is written as ​ H 0  (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.

Examples of the Null Hypothesis

To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.

Other Types of Hypotheses

In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."

Key Takeaways

  • In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
  • Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
  • By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
  • Difference Between Independent and Dependent Variables
  • Examples of Independent and Dependent Variables
  • What Is a Hypothesis? (Science)
  • Definition of a Hypothesis
  • What 'Fail to Reject' Means in a Hypothesis Test
  • Null Hypothesis Definition and Examples
  • Scientific Method Vocabulary Terms
  • Null Hypothesis and Alternative Hypothesis
  • Hypothesis Test for the Difference of Two Population Proportions
  • How to Conduct a Hypothesis Test
  • What Is a P-Value?
  • What Are the Elements of a Good Hypothesis?
  • What Is the Difference Between Alpha and P-Values?
  • Hypothesis Test Example
  • Understanding Path Analysis
  • An Example of a Hypothesis Test

What is The Null Hypothesis & When Do You Reject The Null Hypothesis

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A null hypothesis is a statistical concept suggesting no significant difference or relationship between measured variables. It’s the default assumption unless empirical evidence proves otherwise.

The null hypothesis states no relationship exists between the two variables being studied (i.e., one variable does not affect the other).

The null hypothesis is the statement that a researcher or an investigator wants to disprove.

Testing the null hypothesis can tell you whether your results are due to the effects of manipulating ​ the dependent variable or due to random chance. 

How to Write a Null Hypothesis

Null hypotheses (H0) start as research questions that the investigator rephrases as statements indicating no effect or relationship between the independent and dependent variables.

It is a default position that your research aims to challenge or confirm.

For example, if studying the impact of exercise on weight loss, your null hypothesis might be:

There is no significant difference in weight loss between individuals who exercise daily and those who do not.

Examples of Null Hypotheses

When do we reject the null hypothesis .

We reject the null hypothesis when the data provide strong enough evidence to conclude that it is likely incorrect. This often occurs when the p-value (probability of observing the data given the null hypothesis is true) is below a predetermined significance level.

If the collected data does not meet the expectation of the null hypothesis, a researcher can conclude that the data lacks sufficient evidence to back up the null hypothesis, and thus the null hypothesis is rejected. 

Rejecting the null hypothesis means that a relationship does exist between a set of variables and the effect is statistically significant ( p > 0.05).

If the data collected from the random sample is not statistically significance , then the null hypothesis will be accepted, and the researchers can conclude that there is no relationship between the variables. 

You need to perform a statistical test on your data in order to evaluate how consistent it is with the null hypothesis. A p-value is one statistical measurement used to validate a hypothesis against observed data.

Calculating the p-value is a critical part of null-hypothesis significance testing because it quantifies how strongly the sample data contradicts the null hypothesis.

The level of statistical significance is often expressed as a  p  -value between 0 and 1. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis.

Probability and statistical significance in ab testing. Statistical significance in a b experiments

Usually, a researcher uses a confidence level of 95% or 99% (p-value of 0.05 or 0.01) as general guidelines to decide if you should reject or keep the null.

When your p-value is less than or equal to your significance level, you reject the null hypothesis.

In other words, smaller p-values are taken as stronger evidence against the null hypothesis. Conversely, when the p-value is greater than your significance level, you fail to reject the null hypothesis.

In this case, the sample data provides insufficient data to conclude that the effect exists in the population.

Because you can never know with complete certainty whether there is an effect in the population, your inferences about a population will sometimes be incorrect.

When you incorrectly reject the null hypothesis, it’s called a type I error. When you incorrectly fail to reject it, it’s called a type II error.

Why Do We Never Accept The Null Hypothesis?

The reason we do not say “accept the null” is because we are always assuming the null hypothesis is true and then conducting a study to see if there is evidence against it. And, even if we don’t find evidence against it, a null hypothesis is not accepted.

A lack of evidence only means that you haven’t proven that something exists. It does not prove that something doesn’t exist. 

It is risky to conclude that the null hypothesis is true merely because we did not find evidence to reject it. It is always possible that researchers elsewhere have disproved the null hypothesis, so we cannot accept it as true, but instead, we state that we failed to reject the null. 

One can either reject the null hypothesis, or fail to reject it, but can never accept it.

Why Do We Use The Null Hypothesis?

We can never prove with 100% certainty that a hypothesis is true; We can only collect evidence that supports a theory. However, testing a hypothesis can set the stage for rejecting or accepting this hypothesis within a certain confidence level.

The null hypothesis is useful because it can tell us whether the results of our study are due to random chance or the manipulation of a variable (with a certain level of confidence).

A null hypothesis is rejected if the measured data is significantly unlikely to have occurred and a null hypothesis is accepted if the observed outcome is consistent with the position held by the null hypothesis.

Rejecting the null hypothesis sets the stage for further experimentation to see if a relationship between two variables exists. 

Hypothesis testing is a critical part of the scientific method as it helps decide whether the results of a research study support a particular theory about a given population. Hypothesis testing is a systematic way of backing up researchers’ predictions with statistical analysis.

It helps provide sufficient statistical evidence that either favors or rejects a certain hypothesis about the population parameter. 

Purpose of a Null Hypothesis 

  • The primary purpose of the null hypothesis is to disprove an assumption. 
  • Whether rejected or accepted, the null hypothesis can help further progress a theory in many scientific cases.
  • A null hypothesis can be used to ascertain how consistent the outcomes of multiple studies are.

Do you always need both a Null Hypothesis and an Alternative Hypothesis?

The null (H0) and alternative (Ha or H1) hypotheses are two competing claims that describe the effect of the independent variable on the dependent variable. They are mutually exclusive, which means that only one of the two hypotheses can be true. 

While the null hypothesis states that there is no effect in the population, an alternative hypothesis states that there is statistical significance between two variables. 

The goal of hypothesis testing is to make inferences about a population based on a sample. In order to undertake hypothesis testing, you must express your research hypothesis as a null and alternative hypothesis. Both hypotheses are required to cover every possible outcome of the study. 

What is the difference between a null hypothesis and an alternative hypothesis?

The alternative hypothesis is the complement to the null hypothesis. The null hypothesis states that there is no effect or no relationship between variables, while the alternative hypothesis claims that there is an effect or relationship in the population.

It is the claim that you expect or hope will be true. The null hypothesis and the alternative hypothesis are always mutually exclusive, meaning that only one can be true at a time.

What are some problems with the null hypothesis?

One major problem with the null hypothesis is that researchers typically will assume that accepting the null is a failure of the experiment. However, accepting or rejecting any hypothesis is a positive result. Even if the null is not refuted, the researchers will still learn something new.

Why can a null hypothesis not be accepted?

We can either reject or fail to reject a null hypothesis, but never accept it. If your test fails to detect an effect, this is not proof that the effect doesn’t exist. It just means that your sample did not have enough evidence to conclude that it exists.

We can’t accept a null hypothesis because a lack of evidence does not prove something that does not exist. Instead, we fail to reject it.

Failing to reject the null indicates that the sample did not provide sufficient enough evidence to conclude that an effect exists.

If the p-value is greater than the significance level, then you fail to reject the null hypothesis.

Is a null hypothesis directional or non-directional?

A hypothesis test can either contain an alternative directional hypothesis or a non-directional alternative hypothesis. A directional hypothesis is one that contains the less than (“<“) or greater than (“>”) sign.

A nondirectional hypothesis contains the not equal sign (“≠”).  However, a null hypothesis is neither directional nor non-directional.

A null hypothesis is a prediction that there will be no change, relationship, or difference between two variables.

The directional hypothesis or nondirectional hypothesis would then be considered alternative hypotheses to the null hypothesis.

Gill, J. (1999). The insignificance of null hypothesis significance testing.  Political research quarterly ,  52 (3), 647-674.

Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method.  American Psychologist ,  56 (1), 16.

Masson, M. E. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing.  Behavior research methods ,  43 , 679-690.

Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy.  Psychological methods ,  5 (2), 241.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance test.  Psychological bulletin ,  57 (5), 416.

Print Friendly, PDF & Email

Related Articles

Qualitative Data Coding

Research Methodology

Qualitative Data Coding

What Is a Focus Group?

What Is a Focus Group?

Cross-Cultural Research Methodology In Psychology

Cross-Cultural Research Methodology In Psychology

What Is Internal Validity In Research?

What Is Internal Validity In Research?

What Is Face Validity In Research? Importance & How To Measure

Research Methodology , Statistics

What Is Face Validity In Research? Importance & How To Measure

Criterion Validity: Definition & Examples

Criterion Validity: Definition & Examples

13.1 Understanding Null Hypothesis Testing

Learning objectives.

  • Explain the purpose of null hypothesis testing, including the role of sampling error.
  • Describe the basic logic of null hypothesis testing.
  • Describe the role of relationship strength and sample size in determining statistical significance and make reasonable judgments about statistical significance based on these two factors.

  The Purpose of Null Hypothesis Testing

As we have seen, psychological research typically involves measuring one or more variables in a sample and computing descriptive statistics for that sample. In general, however, the researcher’s goal is not to draw conclusions about that sample but to draw conclusions about the population that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population. These corresponding values in the population are called  parameters . Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 adults with clinical depression and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for adults with clinical depression).

Unfortunately, sample statistics are not perfect estimates of their corresponding population parameters. This is because there is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of adults with clinical depression, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation (Pearson’s  r ) between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called  sampling error . (Note that the term error  here refers to random variability and does not imply that anyone has made a mistake. No one “commits a sampling error.”)

One implication of this is that when there is a statistical relationship in a sample, it is not always clear that there is a statistical relationship in the population. A small difference between two group means in a sample might indicate that there is a small difference between the two group means in the population. But it could also be that there is no difference between the means in the population and that the difference in the sample is just a matter of sampling error. Similarly, a Pearson’s  r  value of −.29 in a sample might mean that there is a negative relationship in the population. But it could also be that there is no relationship in the population and that the relationship in the sample is just a matter of sampling error.

In fact, any statistical relationship in a sample can be interpreted in two ways:

  • There is a relationship in the population, and the relationship in the sample reflects this.
  • There is no relationship in the population, and the relationship in the sample reflects only sampling error.

The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.

The Logic of Null Hypothesis Testing

Null hypothesis testing  is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the  null hypothesis  (often symbolized  H 0  and read as “H-naught”). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” The other interpretation is called the  alternative hypothesis  (often symbolized as  H 1 ). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

Again, every statistical relationship in a sample can be interpreted in either of these two ways: It might have occurred by chance, or it might reflect a relationship in the population. So researchers need a way to decide between them. Although there are many specific null hypothesis testing techniques, they are all based on the same general logic. The steps are as follows:

  • Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.
  • Determine how likely the sample relationship would be if the null hypothesis were true.
  • If the sample relationship would be extremely unlikely, then reject the null hypothesis  in favor of the alternative hypothesis. If it would not be extremely unlikely, then  retain the null hypothesis .

Following this logic, we can begin to understand why Mehl and his colleagues concluded that there is no difference in talkativeness between women and men in the population. In essence, they asked the following question: “If there were no difference in the population, how likely is it that we would find a small difference of  d  = 0.06 in our sample?” Their answer to this question was that this sample relationship would be fairly likely if the null hypothesis were true. Therefore, they retained the null hypothesis—concluding that there is no evidence of a sex difference in the population. We can also see why Kanner and his colleagues concluded that there is a correlation between hassles and symptoms in the population. They asked, “If the null hypothesis were true, how likely is it that we would find a strong correlation of +.60 in our sample?” Their answer to this question was that this sample relationship would be fairly unlikely if the null hypothesis were true. Therefore, they rejected the null hypothesis in favor of the alternative hypothesis—concluding that there is a positive correlation between these variables in the population.

A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the  p value . A low  p  value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A p  value that is not low means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. But how low must the  p  value be before the sample result is considered unlikely enough to reject the null hypothesis? In null hypothesis testing, this criterion is called  α (alpha)  and is almost always set to .05. If there is a 5% chance or less of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. When this happens, the result is said to be  statistically significant . If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. This does not necessarily mean that the researcher accepts the null hypothesis as true—only that there is not currently enough evidence to reject it. Researchers often use the expression “fail to reject the null hypothesis” rather than “retain the null hypothesis,” but they never use the expression “accept the null hypothesis.”

The Misunderstood  p  Value

The  p  value is one of the most misunderstood quantities in psychological research (Cohen, 1994) [1] . Even professional researchers misinterpret it, and it is not unusual for such misinterpretations to appear in statistics textbooks!

The most common misinterpretation is that the  p  value is the probability that the null hypothesis is true—that the sample result occurred by chance. For example, a misguided researcher might say that because the  p  value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is incorrect . The  p  value is really the probability of a result at least as extreme as the sample result  if  the null hypothesis  were  true. So a  p  value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the  p  value is not the probability that any particular  hypothesis  is true or false. Instead, it is the probability of obtaining the  sample result  if the null hypothesis were true.

image

“Null Hypothesis” retrieved from http://imgs.xkcd.com/comics/null_hypothesis.png (CC-BY-NC 2.5)

Role of Sample Size and Relationship Strength

Recall that null hypothesis testing involves answering the question, “If the null hypothesis were true, what is the probability of a sample result as extreme as this one?” In other words, “What is the  p  value?” It can be helpful to see that the answer to this question depends on just two considerations: the strength of the relationship and the size of the sample. Specifically, the stronger the sample relationship and the larger the sample, the less likely the result would be if the null hypothesis were true. That is, the lower the  p  value. This should make sense. Imagine a study in which a sample of 500 women is compared with a sample of 500 men in terms of some psychological characteristic, and Cohen’s  d  is a strong 0.50. If there were really no sex difference in the population, then a result this strong based on such a large sample should seem highly unlikely. Now imagine a similar study in which a sample of three women is compared with a sample of three men, and Cohen’s  d  is a weak 0.10. If there were no sex difference in the population, then a relationship this weak based on such a small sample should seem likely. And this is precisely why the null hypothesis would be rejected in the first example and retained in the second.

Of course, sometimes the result can be weak and the sample large, or the result can be strong and the sample small. In these cases, the two considerations trade off against each other so that a weak result can be statistically significant if the sample is large enough and a strong relationship can be statistically significant even if the sample is small. Table 13.1 shows roughly how relationship strength and sample size combine to determine whether a sample result is statistically significant. The columns of the table represent the three levels of relationship strength: weak, medium, and strong. The rows represent four sample sizes that can be considered small, medium, large, and extra large in the context of psychological research. Thus each cell in the table represents a combination of relationship strength and sample size. If a cell contains the word  Yes , then this combination would be statistically significant for both Cohen’s  d  and Pearson’s  r . If it contains the word  No , then it would not be statistically significant for either. There is one cell where the decision for  d  and  r  would be different and another where it might be different depending on some additional considerations, which are discussed in Section 13.2 “Some Basic Null Hypothesis Tests”

Although Table 13.1 provides only a rough guideline, it shows very clearly that weak relationships based on medium or small samples are never statistically significant and that strong relationships based on medium or larger samples are always statistically significant. If you keep this lesson in mind, you will often know whether a result is statistically significant based on the descriptive statistics alone. It is extremely useful to be able to develop this kind of intuitive judgment. One reason is that it allows you to develop expectations about how your formal null hypothesis tests are going to come out, which in turn allows you to detect problems in your analyses. For example, if your sample relationship is strong and your sample is medium, then you would expect to reject the null hypothesis. If for some reason your formal null hypothesis test indicates otherwise, then you need to double-check your computations and interpretations. A second reason is that the ability to make this kind of intuitive judgment is an indication that you understand the basic logic of this approach in addition to being able to do the computations.

Statistical Significance Versus Practical Significance

Table 13.1 illustrates another extremely important point. A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is closely related to Janet Shibley Hyde’s argument about sex differences (Hyde, 2007) [2] . The differences between women and men in mathematical problem solving and leadership ability are statistically significant. But the word  significant  can cause people to interpret these differences as strong and important—perhaps even important enough to influence the college courses they take or even who they vote for. As we have seen, however, these statistically significant differences are actually quite weak—perhaps even “trivial.”

This is why it is important to distinguish between the  statistical  significance of a result and the  practical  significance of that result.  Practical significance refers to the importance or usefulness of the result in some real-world context. Many sex differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

image

“Conditional Risk” retrieved from http://imgs.xkcd.com/comics/conditional_risk.png (CC-BY-NC 2.5)

Key Takeaways

  • Null hypothesis testing is a formal approach to deciding whether a statistical relationship in a sample reflects a real relationship in the population or is just due to chance.
  • The logic of null hypothesis testing involves assuming that the null hypothesis is true, finding how likely the sample result would be if this assumption were correct, and then making a decision. If the sample result would be unlikely if the null hypothesis were true, then it is rejected in favor of the alternative hypothesis. If it would not be unlikely, then the null hypothesis is retained.
  • The probability of obtaining the sample result if the null hypothesis were true (the  p  value) is based on two considerations: relationship strength and sample size. Reasonable judgments about whether a sample relationship is statistically significant can often be made by quickly considering these two factors.
  • Statistical significance is not the same as relationship strength or importance. Even weak relationships can be statistically significant if the sample size is large enough. It is important to consider relationship strength and the practical significance of a result in addition to its statistical significance.
  • Discussion: Imagine a study showing that people who eat more broccoli tend to be happier. Explain for someone who knows nothing about statistics why the researchers would conduct a null hypothesis test.
  • The correlation between two variables is  r  = −.78 based on a sample size of 137.
  • The mean score on a psychological characteristic for women is 25 ( SD  = 5) and the mean score for men is 24 ( SD  = 5). There were 12 women and 10 men in this study.
  • In a memory experiment, the mean number of items recalled by the 40 participants in Condition A was 0.50 standard deviations greater than the mean number recalled by the 40 participants in Condition B.
  • In another memory experiment, the mean scores for participants in Condition A and Condition B came out exactly the same!
  • A student finds a correlation of  r  = .04 between the number of units the students in his research methods class are taking and the students’ level of stress.
  • Cohen, J. (1994). The world is round: p < .05. American Psychologist, 49 , 997–1003. ↵
  • Hyde, J. S. (2007). New directions in the study of gender similarities and differences. Current Directions in Psychological Science, 16 , 259–263. ↵

Creative Commons License

Share This Book

  • Increase Font Size

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List

Logo of plosone

Why we habitually engage in null-hypothesis significance testing: A qualitative study

Jonah stunt.

1 Department of Health Sciences, Section of Methodology and Applied Statistics, Vrije Universiteit, Amsterdam, The Netherlands

2 Department of Radiation Oncology, Erasmus Medical Center, Rotterdam, The Netherlands

Leonie van Grootel

3 Rathenau Institute, The Hague, The Netherlands

4 Department of Philosophy, Vrije Universiteit, Amsterdam, The Netherlands

5 Department of Epidemiology and Data Science, Amsterdam University Medical Centers, Amsterdam, The Netherlands

David Trafimow

6 Psychology Department, New Mexico State University, Las Cruces, New Mexico, United States of America

Trynke Hoekstra

Michiel de boer.

7 Department of General Practice and Elderly Care, University Medical Center Groningen, Groningen, The Netherlands

Associated Data

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator)1. Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources). 1. Data requests can be send to ln.uv@mdr .

Null Hypothesis Significance Testing (NHST) is the most familiar statistical procedure for making inferences about population effects. Important problems associated with this method have been addressed and various alternatives that overcome these problems have been developed. Despite its many well-documented drawbacks, NHST remains the prevailing method for drawing conclusions from data. Reasons for this have been insufficiently investigated. Therefore, the aim of our study was to explore the perceived barriers and facilitators related to the use of NHST and alternative statistical procedures among relevant stakeholders in the scientific system.

Individual semi-structured interviews and focus groups were conducted with junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies. During the focus groups, important themes that emerged from the interviews were discussed. Data analysis was performed using the constant comparison method, allowing emerging (sub)themes to be fully explored. A theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes we identified.

Twenty-nine interviews and six focus groups were conducted. Several interrelated facilitators and barriers associated with the use of NHST and alternative statistical procedures were identified. These factors were subsumed under three main themes: the scientific climate, scientific duty, and reactivity. As a result of the factors, most participants feel dependent in their actions upon others, have become reactive, and await action and initiatives from others. This may explain why NHST is still the standard and ubiquitously used by almost everyone involved.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders. By taking small steps it should be possible to decrease the scientific community’s strong dependence on NHST and p-values.

Introduction

Empirical studies often start from the idea that there might be an association between a specific factor and a certain outcome within a population. This idea is referred to as the alternative hypothesis (H1). Its complement, the null hypothesis (H0), typically assumes no association or effect (although it is possible to test other effect sizes than no effect with the null hypothesis). At the stage of data-analysis, the probability of obtaining the observed, or a more extreme, association is calculated under the assumption of no effect in the population (H0) and a number of inferential assumptions [ 1 ]. The probability of obtaining the observed, or more extreme, data is known as ‘the p-value’. The p-value demonstrates the compatibility between the observed data and the expected data under the null hypothesis, where 0 is complete incompatibility and 1 is perfect compatibility [ 2 ]. When the p-value is smaller than a prespecified value (labelled as alpha, usually set at 5% (0.05)), results are generally declared to be statistically significant. At this point, researchers commonly reject the null hypothesis and accept the alternative hypothesis [ 2 ]. Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences about population effects. The procedure has become the prevailing paradigm in empirical science [ 3 ], and reaching and being able to report statistically significant results has become the ultimate goal for many researchers.

Despite its widespread use, NHST and the p-value have been criticized since its inception. Numerous publications have addressed problems associated with NHST and p-values. Arguably the most important drawback is the fact that NHST is a form of indirect or inverse inference: researchers usually want to know if the null or alternative hypothesis can be accepted and use NHST to conclude either way. But with NHST, the probability of a finding, or more extreme findings, given the null hypothesis is calculated [ 4 ]. Ergo, NHST doesn’t tell us what we want to know. In fact, p-values were never meant to serve as a basis to draw conclusions, but as a continuous measure of incompatibility between empirical findings and a statistical model [ 2 ]. Moreover, the procedure promotes a dichotomous way of thinking, by using the outcome of a significance test as a dichotomous indicator for an effect (p<0.05: effect, p>0.05: no effect). Reducing empirical findings to two categories also results in a great loss of information. Further, a significant outcome is often unjustly interpreted as relevant, but a p-value does not convey any information about the strength or importance of the association. Worse yet, the p-values on which NHST is based confound effect size and sample size. A trivial effect size may nevertheless result in statistical significance provided a sufficiently large sample size. Or an important effect size may fail to result in statistical significance if the sample size is too small. P-values do not validly index the size, relevance, or precision of an effect [ 5 ]. Furthermore, statistical models include not only null hypotheses, but additional assumptions, some of which are wrong, such as the ubiquitous assumption of random and independent sampling from a defined population [ 1 ]. Therefore, although p-values validly index the incompatibility of data with models, p-values do not validly index incompatibility of data with hypotheses that are embedded in wrong models. These are important drawbacks rendering NHST unsuitable as the default procedure for drawing conclusions from empirical data [ 2 , 3 , 5 – 13 ].

A number of alternatives have been developed that overcome these pitfalls, such as Bayesian inference methods [ 7 , 11 , 14 , 15 ], informative hypothesis testing [ 9 , 16 ] and a priori inferential statistics [ 4 , 17 ]. These alternatives build on the idea that research usually starts with a more informed research-question than one merely assuming the null hypothesis of no effect. These methods overcome the problem of inverse inference, although the first two might still lead to dichotomous thinking with the use of thresholds. Despite the availability of alternatives, statistical behavior in the research community has hardly changed. Researchers have been slow to adopt alternative methods and NHST is still the prevailing paradigm for making inferences about population effects [ 3 ].

Until now, reasons for the continuous and ubiquitous use of NHST and the p-value have scarcely been investigated. One explanation is that NHST provides a very simple means for drawing conclusions from empirical data, usually based on the 5% cut-off. Secondly, most researchers are unaware of the pitfalls of NHST; it has been shown that NHST and the p-value are often misunderstood and misinterpreted [ 2 , 3 , 8 , 11 , 18 , 19 ]. Thirdly, NHST has a central role in most methods and statistics courses in higher education. Courses on alternative methods are increasingly being offered but are usually not mandatory. To our knowledge, there is a lack of in depth, empirical research, aimed at elucidating why NHST nevertheless remains the dominant approach, or what actions can be taken to shift the sciences away from NHST. Therefore, the aim of our study was to explore the perceived barriers and facilitators, as well as behavioral intentions related to the use of NHST and alternatives statistical procedures, among all relevant stakeholders in the scientific system.

Theoretical framework

In designing our study, we used two theories. Firstly, we used the ‘diffusion of innovation theory’ of Rogers [ 20 ]. This theory describes the dissemination of an innovation as a process consisting of four elements: 1) an innovation is 2) communicated through certain channels 3) over time 4) among the members of a social system [ 20 ]. In the current study, the innovation consists of the idea that we should stop with the default use of NHST and instead consider using alternative methods for drawing conclusions from empirical data. The science system forms the social structure in which the innovation should take place. The most important members, and potential adopters of the innovation, we identified are researchers, lecturers, editors of scientific journals and representatives of funding agencies. Rogers describes phases in the adoption process, which coincide with characteristics of the (potential) adopters of the idea: 1) innovators, 2) early adopters, 3) early majority adopters, 4) late majority adopters and 5) laggards. Innovators are the first to adopt an innovation. There are few innovators but these few are very important for bringing in new ideas. Early adopters form the second group to adopt an innovation. This group includes opinion leaders and role models for other stakeholders. The largest group consists of the early and late majority who follow the early adopters, and then there is a smaller group of laggards who resist the innovation until they are certain the innovation will not fail. The process of innovation adoption by individuals is described as a normal distribution ( Fig 1 ). For these five groups, the adoption of a new idea is influenced by the following five characteristics of the innovative idea and 1) its relative advantage, 2) its compatibility with current experiences, 3) its complexity, 4) its flexibility, and 5) its visibility [ 20 ]. Members of all four stakeholder groups could play an important role in the diffusion of the innovation of replacing NHST by its alternatives.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g001.jpg

The innovativeness dimension, measured by the time at which an individual from an adopter category adopts an innovation. Each category is one of more standard deviations removed from the average time of adoption [ 20 ].

Another important theory for our study is the ‘theory of planned behavior’, that was developed in the 1960s [ 21 ]. This theory describes how human behavior in a certain context can be predicted and explained. The theory was updated in 2010, under the name ‘the reasoned action approach’ [ 22 ]. A central factor in this theory is the intention to perform a certain behavior, in this case, to change the default use of NHST. According to the theory, people’s intentions determine their behaviors. An intention indexes to what extent someone is motivated to perform the behavior. Intentions are determined by three independent determinants: the person’s attitudes toward the behavior—the degree to which a person sees the behavior as favorable or unfavorable, perceived subjective norms regarding the behavior—the perceived social pressure to perform the behavior or not, and perceptions of control regarding the behavior—the perceived ease or difficulty of performing the behavior. Underlying (i.e. responsible for) these three constructs are corresponding behavioral, normative, and control beliefs [ 21 , 22 ] (see Fig 2 ).

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g002.jpg

Both theories have served as a lens for both data collection and analysis. We used sensitizing concepts [ 23 ] within the framework of the grounded theory approach [ 24 ] from both theories as a starting point for this qualitative study, and more specifically, for the topic list for the interviews and focus groups, providing direction and guidance for the data collection and data analysis.

Many of the concepts of Rogers’ and Fishbein and Ajzen’s theory can be seen as facilitators and barriers for embracing and implementing innovation in the scientific system.

A qualitative study among stakeholders using semi-structured interviews and focus groups was performed. Data collection and analysis were guided by the principle of constant comparison traditional to the grounded theory approach we followed [ 24 ]. The grounded theory is a methodology that uses inductive reasoning, and aims to construct a theory through the collection and analysis of data. Constant comparison is the iterative process whereby each part of the data that emerges from the data analysis is compared with other parts of the data to thoroughly explore and validate the data. Concepts that have been extracted from the data are tagged with codes that are grouped into categories. These categories constitute themes, which (may) become the basis for a new theory. Data collection and analysis were continued until no new information was gained and data saturation had likely occurred within the identified themes.

The target population consisted of stakeholders relevant to our topic: junior and senior researchers, lecturers in statistics, editors of scientific journals and program leaders of funding agencies (see Tables ​ Tables1 1 and ​ and2). 2 ). We approached participants in the field of medical sciences, health- and life sciences and psychology. In line with the grounded theory approach, theoretical sampling was used to identify and recruit eligible participants. Theoretical sampling is a form of purposive sampling. This means that we aimed to purposefully select participants, based on their characteristics that fit the parameters of the research questions [ 25 ]. Recruitment took place by approaching persons in our professional networks and or the networks of the approached persons.

*The numbers between brackets represents the number of participants that were also interviewed.

Data collection

We conducted individual semi-structured interviews followed by focus groups. The aim of the interviews was to gain insight into the views of participants on the use of NHST and alternative methods and to examine potential barriers and facilitators related to these methods. The aim of the focus groups was to validate and further explore interview findings and to develop a comprehensive understanding of participants’ views and beliefs.

For the semi-structured interviews, we used a topic list (see Appendix 1 in S1 Appendix ). Questions addressed participants’ knowledge and beliefs about the concept of NHST, their familiarity with NHST, perceived attractiveness and drawbacks of the use of NHST, knowledge of the current NHST debate, knowledge of and views on alternative procedures and their views on the future of NHST. The topic list was slightly adjusted based on the interviews with editors and representatives from funding agencies (compared to the topic list for interviews with researchers and lecturers). Questions particularly focused on research and education were replaced by questions focused on policy (see Appendix 1 in S1 Appendix ).

The interviews were conducted between October 2017 and June 2018 by two researchers (L.v.G. and J.S.), both trained in qualitative research methods. Interviews lasted about one hour (range 31–86 minutes) and were voice-recorded. One interview was conducted by telephone; all others were face to face and took place at a location convenient for the participants, in most cases the participants’ work location.

Focus groups

During the focus groups, important themes that emerged from the interviews were discussed and explored. These include perceptions on NHST and alternatives and essential conditions to shift away from the default use of NHST.

Five focus groups included representatives from the different stakeholder groups. One focus group was homogenous, including solely lecturers. The focus groups consisted of ‘old’ as well as ‘new’ participants, that is, some of the participants of the focus groups were also in the interview sample. We also selected persons that were open for further contribution to the NHST debate and were willing to help think about (implementing) alternatives for NHST.

The focus groups were conducted between September and December 2018 by three researchers (L.v.G., J.S. and A.d.K.), all trained in qualitative research methods. The focus groups lasted about one-and-a-half hours (range 86–100 minutes).

Data analysis

All interviews and focus groups were transcribed verbatim. Atlas.ti 8.0 software was used for data management and analysis. All transcripts were read thoroughly several times to identify meaningful and relevant text fragments and analyzed by two researchers (J.S. and L.v.G.). Deductive predefined themes and theoretical concepts were used to guide the development of the topic list for the semi-structured interviews and focus groups, and were used as sensitizing concepts [ 23 ] in data collection and data analysis. Inductive themes were identified during the interview process and analysis of the data [ 26 ].

Transcripts were open-, axial- and selectively coded by two researchers (J.S. and L.v.G.). Open coding is the first step in the data-analysis, whereby phenomena found in the text are identified and named (coded). With axial coding, connections between codes are drawn. Selective coding is the process of selecting one central category and relating all other categories to that category, capturing the essence of the research. The constant comparison method [ 27 ] was applied allowing emerging (sub)themes to be fully explored. First, the two researchers independently developed a set of initial codes. Subsequently, findings were discussed until consensus was reached. Codes were then grouped into categories that were covered under subthemes, belonging to main themes. Finally, a theory substantiating the prevailing use of NHST was developed based on the main themes and subthemes.

Ethical issues

This research was conducted in accordance with the Dutch "General Data Protection Regulation" and the “Netherland’s code of conduct for research integrity”. The research protocol had been submitted for review and approved by the ethical review committee of the VU Faculty of Behavioral and Movement Sciences. In addition, the project had been submitted to the Medical Ethics Committee (METC) of the Amsterdam University Medical Centre who decided that the project is not subject to the Medical Research (Human Subjects) Act ( WMO). At the start of data collection, all participants signed an informed consent form.

A full study protocol, including a detailed data analysis plan, was preregistered ( https://osf.io/4qg38/ ). At the start of this study, preregistration forms for qualitative studies were not developed yet. Therefore, preregistration for this study is based on an outdated form. Presently, there is a preregistration form available for qualitative studies [ 28 ]. Information about data collection, data management, data sharing and data storage is described in a Data Management Plan. Sensitive data is stored in Darkstor, an offline archive for storing sensitive information or data (information that involves i.e., privacy or copyright). As the recordings and transcripts of the interviews and focus groups contain privacy-sensitive data, these files are archived in Darkstor and can be accessed only on request by authorized individuals (i.e., the original researcher or a research coordinator) (Data requests can be send to ln.uv@mdr ). Non-sensitive data is stored in DANS ( https://doi.org/10.17026/dans-2at-nzfs ) (Data Archiving and Networked Services; the Netherlands institute for permanent access to digital research resources).

Participant characteristics

Twenty-nine individual interviews and six focus groups were conducted. The focus groups included four to six participants per session. A total of 47 participants were included in the study (13 researchers, 15 lecturers, 11 editors of scientific journals and 8 representatives of funding agencies). Twenty-nine participants were interviewed. Twenty-seven participants took part in the focus group. Nine of the twenty-seven participants were both interviewed and took part in the focus groups. Some participants had multiple roles (i.e., editor and researcher, editor and lecturer or lecturer and researcher) but were classified based on their primary role (assistant professors were classified as lecturers). The lecturers in statistics in our sample were not statisticians themselves. Although they all received training in statistics, they were primarily trained as psychologists, medical doctors, or health scientists. Some lecturers in our sample taught an applied subject, with statistics as part of it. Other lectures taught Methodology and Statistics courses. Statistical skills and knowledge among lecturers varied from modest to quite advanced. Statistical skills and knowledge among participants from the other stakeholder groups varied from poor to quite advanced. All participants were working in the Netherlands. A general overview of the participants is presented in Table 1 . Participant characteristics split up by interviews and focus groups are presented in Table 2 .

Three main themes with sub-themes and categories emerged ( Fig 3 ): the green-colored compartments hold the three main themes: The scientific climate , The scientific duty and Reactivity . Each of these three main themes consists of subthemes, depicted by the yellow-colored compartments. In turn, some (but not all) of the 9 subthemes also have categories. These ‘lower level’ findings are not included in the figure but will be mentioned in the elaboration on the findings and are depicted in Appendix 2 in S1 Appendix . Fig 3 shows how the themes are related to each other. The blue arrows indicate that the themes are interrelated; factors influence each other. The scientific climate affects the way stakeholders perceive and fulfil their scientific duty, the way stakeholders give substance to their scientific duty shapes and maintain the scientific climate. The scientific duty and the scientific climate cause a state of reactivity. Many participants have adopted a ’wait and see’ attitude regarding behavioral changes with respect to statistical methods. They feel dependent on someone else’s action. This leads to a reactive (instead of a proactive) attitude and a low sense of responsibility. ‘Reactivity’ is the core theme, explaining the most critical problem with respect to the continuous and ubiquitous use of NHST.

An external file that holds a picture, illustration, etc.
Object name is pone.0258330.g003.jpg

Main themes and subthemes are numbered. Categories are mentioned in the body of the text in bold. ‘P’ stands for participant; ‘I’ stands for interviewer.

1. The scientific climate

The theme, ‘the scientific climate’, represents researchers’ (Dutch) perceptions of the many written and unwritten rules they face in the research environment. This theme concerns the opportunities and challenges participants encounter when working in the science system. Dutch academics feel pressured to publish fast and regularly, and to follow conventions and directions of those on whom they depend. They feel this comes at the expense of the quality of their work. Thus, the scientific climate in the Netherlands has a strong influence on the behavior of participants regarding how they set their priorities and control the quality of their work.

1 . 1 Quality control . Monitoring the quality of research is considered very important. Researchers, funding agencies and editors indicate they rely on their own knowledge, expertise, and insight, and those of their colleagues, to guarantee this quality. However, editors or funding agencies are often left with little choice when it comes to compiling an evaluation committee or a review panel. The choice is often like-knows-like-based. Given the limited choice, they are forced to trust the opinion of their consultants, but the question is whether this is justified.

I: “The ones who evaluate the statistics, do they have sufficient statistical knowledge?” P: “Ehhr, no, I don’t think so.” I: “Okay, interesting. So, there are manuscripts published of which you afterwards might think….” P: “Yes yes.” (Interview 18; Professor/editor, Medical Sciences)

1 . 2 Convention . The scientific system is built on mores and conventions, as this participant describes:

P: “There is science, and there is the sociology of science, that is, how we talk to each other, what we believe, how we connect. And at some point, it was agreed upon that we would talk to each other in this way.” (Interview 28, researcher, Medical Sciences)

And to these conventions, one (naturally) conforms. Stakeholders copy behavior and actions of others within their discipline, thereby causing particular behaviors and values to become conventional or normative. One of those conventions is the use of NHST and p-values. Everyone is trained with NHST and is used to applying this method. Another convention is the fact that significant results mean ‘success’, in the sense of successful research and being a successful researcher. Everyone is aware that ‘p is smaller than 0.05’ means the desired results are achieved and that publication and citation chances are increased.

P: “You want to find a significant result so badly. (…) Because people constantly think: I must find a significant result, otherwise my study is worthless.” (Focus group 4, lecturer, Medical Sciences)

Stakeholders rigidly hold on to the above-mentioned conventions and are not inclined to deviate from existing norms; they are, in other words, quite conservative . ‘We don’t know any better’ has been brought up as a valid argument by participants from various stakeholder groups to stick to current rules and conventions. Consequently, the status quo in the scientific system is being maintained.

P: “People hold on to….” I: ‘Everyone maintains the system?’ P: ‘Yes, we kind of hang to the conservative manner. This is what we know, what someone, everyone, accepts.” (Interview 17, researcher, Health Sciences)

Everyone is trained with NHST and considers it an accessible and easy to interpret method. The familiarity and perceived simplicity of NHST, user-friendly software such as SPSS and the clear cut-off value for significance are important facilitators for the use of NHST and at the same time barriers to start using alternative methods. Applied researchers stressed the importance of the accessibility of NHST as a method to test hypotheses and draw conclusions. This accessibility also justifies the use of NHST when researchers want to communicate their study results and messages in understandable ways to their readership.

P: “It is harder, also to explain, to use an alternative. So, I think, but maybe I’m overstepping, but if you want to go in that direction [alternative methods] it needs to be better facilitated for researchers. Because at the moment… I did some research, but, you know, there are those uncommon statistical packages.” (Interview 16, researcher/editor, Medical Sciences)

1 . 3 Publication pressure . Most researchers mentioned that they perceive publication pressure. This motivates them to use NHST and hope for significant results, as ‘significant p-values’ increase publication chances. They perceive a high workload and the way the scientific reward system is constructed as barriers for behavioral change pertaining to the use of statistical methods; potential negative consequences for publication and career chances prevent researchers from deviating from (un)written rules.

P: “I would like to learn it [alternative methods], but it might very well be that I will not be able to apply it, because I will not get my paper published. I find that quite tricky.” (Interview 1, Assistant Professor, Health Sciences)

2. The scientific duty

Throughout the interviews, participants reported a sense of duty in several variations. “What does it mean to be a scientific researcher?” seemed to be a question that was reflected upon during rather than prior to the interview, suggesting that many scientists had not really thought about the moral and professional obligations of being a scientist in general—let alone what that would mean for their use of NHST. Once they had given it some thought, the opinions concerning what constitutes the scientific duty varied to a large extent. Some participants attached great importance to issues such as reproducibility and transparency in scientific research and continuing education and training for researchers. For others, these topics seemed to play a less important role. A distinction was made between moral and professional obligations that participants described concerning their scientific duty.

2 . 1 Moral obligation . The moral obligation concerns issues such as doing research in a thorough and honest way, refraining from questionable research practices (QRPs) and investing in better research. It concerns tasks and activities that are not often rewarded or acknowledged.

Throughout the interviews and the focus groups, participants very frequently touched upon the responsibility they felt for doing ‘the right thing’ and making the right choice in doing research and using NHST, in particular. The extent to which they felt responsible varied among participants. When it comes to choices during doing research—for example, drawing conclusions from data—participants felt a strong sense of responsibility to do this correctly. However, when it comes to innovation and new practices, and feeling responsible for your own research, let alone improving scientific practice in general, opinions differed. This quotation from one of the focus groups illustrates that:

P1: “If you people [statisticians, methodologists] want me to improve the statistics I use in my research, then you have to hand it to me. I am not going to make any effort to improve that myself. “P3: “No. It is your responsibility as an academic to keep growing and learning and so, also to start familiarizing yourself when you notice that your statistics might need improvement.” (Focus group 2, participant 1 (PhD researcher, Medical Sciences) and 3 (Associate Professor, Health Sciences)

The sense of responsibility for improving research practices regarding the use of NHST was strongly felt and emphasized by a small group of participants. They emphasized the responsibility of the researcher to think, interpret and be critical when interpreting the p -value in NHST. It was felt that you cannot leave that up to the reader. Moreover, scrutinizing and reflecting upon research results was considered a primary responsibility of a scientist, and failing to do so, as not living up to what your job demands you to do:

P: “Yes, and if I want to be very provocative—and I often want that, because then people tend to wake up and react: then I say that hiding behind alpha.05 is just scientific laziness. Actually, it is worse: it is scientific cowardice. I would even say it is ‘relieving yourself from your duty’, but that may sound a bit harsh…” (Interview 2, Professor, Health Sciences)

These participants were convinced that scientists have a duty to keep scientific practice in general at the highest level possible.

The avoidance of questionable research practices (QRPs) was considered a means or a way to keep scientific practices high level and was often touched upon during the interviews and focus groups as being part of the scientific duty. Statisticians saw NHST as directly facilitating QRPs and providing ample examples of how the use of NHST leads to QRPs, whereas most applied researchers perceived NHST as the common way of doing research and were not aware of the risks related to QRPs. Participants did mention the violation of assumptions underlying NHST as being a QRP. Then, too, participants considered overinterpreting results as a QRP, including exaggerating the degree of significance. Although participants stated they were careful about interpreting and reporting p-values, they ‘admitted’ that statistical significance was a starting point for them. Most researchers indicated they search for information that could get their study published, which usually includes a low p-value (this also relates to the theme ‘Scientific climate’).

P: “We all know that a lot of weight is given to the p-value. So, if it is not significant, then that’s the end of it. If it ís significant, it just begins.” (Interview 5, lecturer, Psychology)

The term ‘sloppy science’ was mentioned in relation to efforts by researchers to reduce the p -value (a.k.a. p-hacking, data-dredging, and HARKing. HARKing is an acronym that refers to the questionable research question of Hypothesizing After the Results are Known. It occurs when researchers formulate a hypothesis after the data have been collected and analyzed, but make it look like it is an a priori hypothesis [ 29 ]). Preregistration and replication were mentioned as being promising solutions for some of the problems caused by NHST.

2 . 2 . Professional obligation . The theme professional obligation reflects participants’ expressions about what methodological knowledge scientists should have about NHST. In contrast moral obligations, there appeared to be some consensus about scientists’ professional obligations. Participants considered critical evaluation of research results a core professional obligation. Also, within all the stakeholder groups, participants agreed that sufficient statistical knowledge is required for using NHST, but they varied in their insights in the principles, potential and limitations of NHST. This also applied to the extent to which participants were aware of the current debate about NHST.

Participants considered critical thinking as a requirement for fulfilling their professional obligation. It specifically refers to the process of interpreting outcomes and taking all relevant contextual information into consideration. Critical thinking was not only literally referred to by participants, but also emerged by interpreting text fragments on the emphasis within their research. Researchers differed quite strongly in where the emphasis of their research outcomes should be put and what kind of information is required when reporting study results. Participants mentioned the proven effectiveness of a particular treatment, giving a summary of the research results, effect sizes, clinical relevance, p-values, or whether you have made a considerable contribution to science or society.

P: “I come back to the point where I said that people find it arbitrary to state that two points difference on a particular scale is relevant. They prefer to hide behind an alpha of 0.05, as if it is a God given truth, that it counts for one and for all. But it is just as well an invented concept and an invented guideline, an invented cut-off value, that isn’t more objective than other methods?” (Interview 2, Professor, Health Sciences)

For some participants, especially those representing funding agencies, critical thinking was primarily seen as a prerequisite for the utility of the research. The focus, when formulating the research question and interpreting the results, should be on practical relevance and the contribution the research makes to society.

The term ‘ignorance’ arose in the context of the participants’ concern regarding the level of statistical knowledge scientists and other stakeholders have versus what knowledge they should have to adequately apply statistical analysis in their research. The more statistically competent respondents in the sample felt quite strongly about how problematic the lack of knowledge about NHST is among those who regularly use it in their research, let alone the lack of knowledge about alternative methods. They felt that regularly retraining yourself in research methods is an essential part of the professional obligation one has. Applied researchers in the sample agreed that a certain level of background knowledge on NHST was required to apply it properly to research and acknowledged their own ignorance. However, they had different opinions about what level of knowledge is required. Moreover, not all of them regarded it as part of their scientific duty to be informed about all ins and outs of NHST. Some saw it as the responsibility of statisticians to actively inform them (see also the subtheme periphery). Some participants were not aware of their ignorance or stated that some of their colleagues are not aware of their ignorance, i.e., that they are unconsciously incompetent and without realizing it, poorly understood what the p-value and associated outcome measures actually mean.

P: “The worst, and I honestly think that this is the most common, is unconsciously incompetent, people don’t even understand that…” I: “Ignorance.” P: “Yes, but worse, ignorant and not even knowing you are ignorant.” (Interview 2, Professor, Health Sciences)

The lack of proper knowledge about statistical procedures was especially prevalent in the medical sciences. Participants working in or with the medical sciences all confirmed that there is little room for proper statistical training for medical students and that the level of knowledge is fairly low. NHST is often used because of its simplicity. It is especially attractive for medical PhD students because they need their PhD to get ahead in their medical career instead of pursuing a scientific career.

P: “I am not familiar with other ways of doing research. I would really like to learn, but I do not know where I could go. And I do not know whether there are better ways. So sometimes I do read studies of which I think: ‘this is something I could investigate with a completely different test. Apparently, this is also possible, but I don’t know how.’ Yes, there are courses, but I do not know what they are. And here in the medical center, a lot of research is done by medical doctors and these people have hardly been taught any statistics. Maybe they will get one or two statistics courses, they know how to do a t-test and that is about it. (…) And the courses have a very low level of statistics, so to say.” (Interview 1, Assistant Professor, Health Sciences)

Also, the term ‘ awareness ’ arose. Firstly, it refers to being conscious about the limitations of NHST. Secondly, it refers to the awareness of the ongoing discussions about NHST and more broadly, about the replication crisis. The statisticians in the sample emphasized the importance of knowing that NHST has limitations and that it cannot be considered the holy grail of data analysis. They also emphasized the importance of being aware of the debate. A certain level of awareness was considered a necessary requirement for critical thinking. There was variation in that awareness. Some participants were quite informed and were also fairly engaged in the discussion whereas others were very new to the discussion and larger contextual factors, such as the replication crisis.

I: “Are you aware of the debate going on in academia on this topic [NHST]? P: “No, I occasionally see some article sent by a colleague passing by. I have the idea that something is going on, but I do not know how the debate is conducted and how advanced it is. (Interview 6, lecturer, Psychology)

With respect to the theme, ‘the scientific duty’, participants differed to what extent they felt responsible for better and open science, for pioneering, for reviewing, and for growing and learning as a scientist. Participants had one commonality: although they strived for adherence to the norms of good research, the rampant feeling is that this is very difficult, due to the scientific climate. Consequently, participants perceive an internal conflict : a discrepancy between what they want or believe , and what they do . Participants often found themselves struggling with the responsibility they felt they had. Making the scientifically most solid choice was often difficult due to feasibility, time constraints, or certain expectations from supervisors (this is also directly related to the themes ‘Scientific climate’ and ‘Reactivity’). Thus, the scientific climate strongly influences the behavior of scientists regarding how they set their priorities and fulfill their scientific duties. The strong sense of scientific duty was perceived by some participants as a facilitator and by others as a barrier for the use of alternative methods.

3. Reactivity

A consequence of the foregoing factors is that most stakeholders have adopted a reactive attitude and behave accordingly. People are disinclined to take responsibility and await external signals and initiatives of others. This might explain why NHST is being continuously used and remains the default procedure to make inferences about population effects.

The core theme ‘reactivity’ can be explained by the following subthemes and categories:

3 . 1 Periphery . The NHST-problem resides in the periphery in several ways. First, it is a subject that is not given much priority. Secondly, some applied researchers and editors believe that methodological knowledge, as it is not their field of expertise, should not be part of their job requirement. This also applies to the NHST debate. Thirdly, and partly related to the second point, there is a lack of cooperation within and between disciplines.

The term ‘ priority’ was mentioned often when participants were asked to what extent the topic of NHST was subject of discussion in their working environment. Participants indicated that (too) little priority is given to statistics and the problems related to the subject. There is simply a lot going on in their research field and daily work, so there are always more important or urgent issues on the agenda.

P: “Discussions take place in the periphery; many people find it complicated. Or are just a little too busy.” (Interview 5, lecturer, Psychology)

As the NHST debate is not prioritized, initiatives with respect to this issue are not forthcoming. Moreover, researchers and lecturers claim there is neither time nor money available for training in statistics in general or acquiring more insight and skills with respect to (the use of) alternative methods. Busy working schedules were mentioned as an important barrier for improving statistical knowledge and skills.

P: “Well you can use your time once, so it is an issue low on the priority list.” (Focus group 5, researcher, Medical Sciences)

The NHST debate is perceived as the domain of statisticians and methodologists. Also, cooperation between different domains and domain-specific experts is perceived as complicated, as different perceptions and ways of thinking can clash. Therefore, some participants feel that separate worlds should be kept separate; put another way: stick to what you know!

P: “This part is not our job. The editorial staff, we have the assignment to ensure that it is properly written down. But the discussion about that [alternatives], that is outside our territory.” (Interview 26, editor, Medical Sciences)

Within disciplines, individuals tend to act on their own, not being aware that others are working on the same subject and that it would be worthwhile to join forces. The interviews and focus groups exposed that a modest number of participants actively try to change the current situation, but in doing that, feel like lone voices in the wilderness.

P1: “I mean, you become a lone voice in the wilderness.” P2: “Indeed, you don’t want that.” P1: “I get it, but no one listens. There is no audience.” (Focus Group 3, P1: MD, lecturer, medical Sciences, P2: editor, Medical Sciences)

To succeed at positive change, participants emphasized that it is essential that people (interdisciplinary) cooperate and join forces, rather than operate on individual levels, focusing solely on their own working environment.

The caution people show with respect to taking initiative is reenforced by the fear of encountering resistance from their working environment when one voices that change regarding the use of NHST is needed. A condition that was mentioned as essential to bring about change was tactical implementation , that is, taking very small steps. As everyone is still using NHST, taking big steps brings the risk of losing especially the more conservative people along the way. Also, the adjustment of policy, guidelines and educational programs are processes for which we need to provide time and scope.

P: “Everyone still uses it, so I think we have to be more critical, and I think we have to look at some kind of culture change, that means that we are going to let go of it (NHST) more and we will also use other tests, that in the long term will overthrow NHST. I: and what about alternatives? P: I think you should never be too fanatic in those discussion, because then you will provoke resistance. (…) That is not how it works in communication. You will touch them on a sore spot, and they will think: ‘and who are you?’ I: “and what works?” P: “well, gradualness. Tell them to use NHST, do not burn it to the ground, you do not want to touch peoples work, because it is close to their hearts. Instead, you say: ‘try to do another test next to NHST’. Be a pioneer yourself.” (Interview 5, lecturer, Psychology)

3 . 2 . Efficacy . Most participants stated they feel they are not in the position to initiate change. On the one hand, this feeling is related to their hierarchical positions within their working environments. On the other hand, the feeling is caused by the fact that statistics is perceived as a very complex field of expertise and people feel they lack sufficient knowledge and skills, especially about alternative methods.

Many participants stated they felt little sense of empowerment, or self-efficacy. The academic system is perceived as hierarchical, having an unequal balance of power. Most participants believe that it is not in their power to take a lead in innovative actions or to stand up against establishment, and think that this responsibility lies with other stakeholders, that have more status .

P: “Ideally, there would be a kind of an emergency letter from several people whose names open up doors, in which they indicate that in the medical sciences we are throwing away money because research is not being interpreted properly. Well, if these people that we listen to send such an emergency letter to the board of The Netherlands Organization for Health Research and Development [the largest Dutch funding agency for innovation and research in healthcare], I can imagine that this will initiate a discussion.” (…) I: “and with a big name you mean someone from within the science system? P: well, you know, ideally a chairman, or chairmen of the academic medical center. At that level. If they would put a letter together. Yes, that of course would have way more impact. Or some prominent medical doctors, yes, that would have more impact, than if some other person would send a letter yes.” (Interview 19, representative from funding agency, Physical Sciences)

Some participants indicated that they did try to make a difference but encountered too much resistance and therefore gave up their efforts. PhD students feel they have insufficient power to choose their own directions and make their own choices.

P: I am dependent on funding agencies and professors. In the end, I will write a grant application in that direction that gives me the greatest chance of eventually receiving that grant. Not primarily research that I think is the most optimal (…) If I know that reviewers believe the p-value is very important, well, of course I write down a method in which the p-value is central.” (Focus group 2, PhD-student, Medical Sciences)

With a sense of imperturbability, most participants accept that they cannot really change anything.

Lastly, the complexity of the subject is an obstacle for behavioral change. Statistics is perceived as a difficult subject. Participants indicate that they have a lack of knowledge and skills and that they are unsure about their own abilities. This applies to the ‘standard’ statistical methods (NHST), but to a greater extent to alternative methods. Many participants feel that they do not have the capacity to pursue a true understanding of (alternative) statistical methods.

P: “Statistics is just very hard. Time and again, research demonstrates that scientists, even the smartest, have a hard time with statistics.” (Focus group 3, PhD researcher, Psychology)

3 . 3 . Interdependency . As mentioned, participants feel they are not in a sufficiently strong position to take initiative or to behave in an anti-establishment manner. Therefore, they await external signals from people within the scientific system with more status, power, or knowledge. This can be people within their own stakeholder group, or from other stakeholder groups. As a consequence of this attitude, a situation arises in which peoples’ actions largely depend on others. That is, a complex state of interdependency evolves: scientists argue that if the reward system does not change, they are not able to alter their statistical behavior. According to researchers, editors and funding agencies are still very much focused on NHST and especially (significant) p-values, and thus, scientists wait for editors and funders to adjust their policy regarding statistics:

P: “I wrote an article and submitted it to an internal medicine journal. I only mentioned confidence intervals. Then I was asked to also write down the p-values. So, I had to do that. This is how they [editors] can use their power. They decide.” (Interview 1, Assistant Professor, Health Sciences)

Editors and funders in their turn claim they do not maintain a strict policy. Their main position is that scientists should reach consensus about the best statistical procedure, and they will then adjust their policy and guidelines.

P: “We actually believe that the research field itself should direct the quality of its research, and thus, also the discussions.” (Interview 22, representative from funding agency, Neurosciences)

Lecturers, for their part, argue that they cannot revise their educational programs due to the academic system, and university policies are adapted to NHST and p-values.

As most participants seem not to be aware of this process, a circle of interdependency arises that is difficult to break.

P: “Yes, the stupid thing about this perpetual circle is that you are educating people, let’s say in the department of cardiology. They must of course grow, and so they need to publish. If you want to publish you must meet the norms and values of the cardiology journals, so they will write down all those p-values. These people are trained and in twenty years they are on the editorial board of those journals, and then you never get rid of it [the p-value].” (Interview 18, Professor, editor, Medical Sciences)

3 . 4 . Degree of eagerness . Exerting certain behavior or behavioral change is (partly) determined by the extent to which people want to employ particular behavior, their behavioral intention [ 22 ]. Some participants indicated they are willing to change their behavior regarding the use of statistical methods, but only if it is absolutely necessary, imposed or if they think that the current conventions have too many negative consequences. Thus, true, intrinsic will-power to change behavior is lacking among these participants. Instead, they have a rather opportunistic attitude, meaning that their behavior is mostly driven by circumstances, not by principles.

P: “If tomorrow an alternative is offered by people that make that call, than I will move along. But I am not the one calling the shots on this issue.” (Interview 26, editor, Medical Sciences)

In addition, pragmatism often outweighs the perceived urgency to change. Participants argue they ‘just want to do their jobs’ and consider the practical consequences mainly in their actions. This attitude creates a certain degree of inertia. Although participants claim they are willing to change their behavior, this would contain much more than ‘doing their jobs, and thus, in the end, the NHST-debate is subject to ‘coffee talk’. People are open to discussion, but when it comes to taking action (and motivating others to do so), no one takes action.

P: “The endless analysis of your data to get something with a p-value less than 0.05… There are people that are more critical about that, and there are people that are less critical. But that is a subject for during the coffee break.” (Interview 18, professor, editor, Medical Sciences)

The goal of our study was to acquire in-depth insight into reasons why so many stakeholders from the scientific system keep using NHST as the default method to draw conclusions, despite its many well-documented drawbacks. Furthermore, we wanted to gain insight into the reasons for their reluctance to apply alternative methods. Using a theoretical framework [ 20 , 21 ], several interrelated facilitators and barriers associated with the use of NHST and alternative methods were identified. The identified factors are subsumed under three main themes: the scientific climate, the scientific duty and reactivity. The scientific climate is dominated by conventions, behavioral rules, and beliefs, of which the use of NHST and p-values is part. At the same time, stakeholders feel they have a (moral or professional) duty. For many participants, these two sides of the same coin are incompatible, leading to internal conflicts. There is a discrepancy between what participants want and what they do . As a result of these factors, the majority feels dependent on others and have thereby become reactive. Most participants are not inclined to take responsibility themselves but await action and initiatives from others. This may explain why NHST is still the standard and used by almost everyone involved.

The current study is closely related to the longstanding debate regarding NHST which recently increased to a level not seen before. In 2015, the editors of the journal ‘Basic and Applied Social Psychology’ (BASP) prohibited the use of NHST (and p-values and confidence intervals) [ 30 ]. Subsequently, in 2016, the American Statistical Association published the so-called ‘Statement on p-values’ in the American Statistician. This statement consists of critical standpoints regarding the use of NHST and p-values and warns against the abuse of the procedure. In 2019, the American Statistician devoted an entire edition to the implementation of reforms regarding the use of NHST; in more than forty articles, scientists debated statistical significance, advocated to embrace uncertainty, and suggested alternatives such as the use of s-values, False Positive Risks, reporting results as effect sizes and confidence intervals and more holistic approaches to p-values and outcome measures [ 31 ]. In addition, in the same year, several articles appeared in which an appeal was made to stop using statistical significance testing [ 32 , 33 ]. A number of counter-reactions were published [ 34 – 36 ], stating (i.e.) that banning statistical significance and, with that, abandoning clear rules for statistical analyses may create new problems with regard to statistical interpretation, study interpretations and objectivity. Also, some methodologists expressed the view that under certain circumstances the use of NHST and p-values is not problematic and can in fact provide useful answers [ 37 ]. Until recently, the NHST-debate was limited to mainly methodologists and statisticians. However, a growing number of scientists are getting involved in this lively debate and believe that a paradigm shift is desirable or even necessary.

The aforementioned publications have constructively contributed to this debate. In fact, since the publication of the special edition of the American Statistician, numerous scientific journals published editorials or revised, to a greater or lesser extent, their author guidelines [ 38 – 45 ]. Furthermore, following the American Statistical Association (ASA), the National Institute of Statistical Sciences (NISS) in the United States has also taken up the reform issue. However, real changes are still barely visible. It takes a long time before these kinds of initiatives translate into behavioral changes, and the widespread adoption by most of the scientific community is still far from accomplished. Debate alone will not lead to real changes, and therefore, our efforts to elucidate behavioral barriers and facilitators could provide a framework for potential effective initiatives that could be taken to reduce the default use of NHST. In fact, the debate could counteract behavioral change. If there is no consensus among statisticians and methodologists (the innovators), changing behavior cannot be expected from stakeholders with less statistical and methodological expertise. In other words, without agreement among innovators, early adopters might be reluctant to adopt the innovation.

Research has recently been conducted to explore the potential of behavioral change to improve Open Science behaviors. The adoption of open science behavior has increased in the last years, but uptake has been slow, due to firm barriers such as a lack of awareness about the subject, concerns about constrainment of the creative process, worries about being “scooped” and holding on to existing working practices [ 46 ]. The development regarding open science practices and the parallels these lines of research shows with the current study, might be of benefit to subserve behavioral change regarding the use of statistical methods.

The described obstacles to change behavior are related to features of both the ‘innovative idea’ and the potential adopters of the idea. First, there are characteristics of ‘the innovation’ that form barriers. The first barrier is the complexity of the innovation: most participants perceive alternative methods as difficult to understand and to use. A second barrier concerns the feasibility of trying the innovation; most people do not feel flexible about trying out or experimenting with the new idea. There is a lack of time and monetary resources to get acquainted with alternative methods (for example, by following a course). Also, the possible negative consequences of the use of alternatives (lower publications chances, the chance that the statistical method and message is too complicated for one’s readership) is holding people back from experimenting with these alternatives. And lastly, it is unclear for most participants what the visibility of the results of the new idea are. Up until now, the debate has mainly taken place among a small group of statisticians and methodologists. Many researchers are still not aware of the NHST debate and the idea to shift away from NHST and use alternative methods instead. Therefore, the question is how easily the benefits of the innovation can be made visible for a larger part of the scientific community. Thus, our study shows that, although the compatibility of the innovation is largely consistent with existing values (participants are critical about (the use of) NHST and the p-value and believe that there are better alternatives to NHST), important attributes of the innovative idea negatively affect the rate of adoption and consequently the diffusion of the innovation.

Due to the barriers mentioned above, most stakeholders do not have the intention to change their behavior and adopt the innovative idea. From the theory of planned behavior [ 21 ], it is known that behavioral intentions directly relate to performances of behaviors. The strength of the intention is shaped by attitudes, subjective norms, and perceived power. If people evaluate the suggested behavior as positive (attitude), and if they think others want them to perform the behavior (subjective norm), this leads to a stronger intention to perform that behavior. When an individual also perceives they have enough control over the behavior, they are likely to perform it. Although most participants have a positive attitude towards the behavior, or the innovative idea at stake, many participants think that others in their working environment believe that they should not perform the behavior—i.e., they do not approve of the use of alternative methods (social normative pressure). This is expressed, for example, in lower publication chances, negative judgements by supervisors or failing the requirements that are imposed by funding agencies. Thus, the perception about a particular behavior—the use of alternative methods—is negatively influenced by the (perceived) judgment of others. Moreover, we found that many participants have a low self-efficacy, meaning that there is a perceived lack of behavioral control, i.e., their perceived ability to engage in the behavior at issue is low. Also, participants feel a lack of authority (in the sense of knowledge and skills, but also power) to initiate behavioral change. The existing subjective norms and perceived behavioral control, and the negative attitudes towards performing the behavior, lead to a lower behavioral intention, and, ultimately, a lower chance of the performance of the actual behavior.

Several participants mentioned there is a need for people of stature (belonging to the group of early adopters) to take the lead and break down perceived barriers. Early adopters serve as role models and have opinion leadership, and form the next group (after the innovators, in this case statisticians and methodologists) to adopt an innovative idea [ 20 ] ( Fig 2 ). If early adopters would stand up, conveying a positive attitude towards the innovation, breaking down the described perceived barriers and facilitating the use of alternatives (for example by adjusting policy, guidelines and educational programs and making available financial resources for further training), this could positively affect the perceived social norms and self-efficacy of the early and late majority and ultimately laggards, which could ultimately lead to behavioral change among all stakeholders within the scientific community.

A strength of our study is that it is the first empirical study on views on the use of NHST, its alternatives and reasons for the prevailing use of NHST. Another strength is the method of coding which corresponds to the thematic approach from Braun & Clarke [ 47 ], which allows the researcher to move beyond just categorizing and coding the data, but also analyze how the codes are related to each other [ 47 ]. It provides a rich description of what is studied, linked to theory, but also generating new hypotheses. Moreover, two independent researchers coded all transcripts, which adds to the credibility of the study. All findings and the coding scheme were discussed by the two researchers, until consensus was reached. Also, interview results were further explored, enriched and validated by means of (mixed) focus groups. Important themes that emanated from the interviews, such as interdependency, perceptions on the scientific duty, perceived disadvantages of alternatives or the consequences of the current scientific climate, served as starting points and main subjects of the focus groups. This set-up provided more data, and more insight about the data and validation of the data. Lastly, the use of a theoretical framework [ 20 , 21 ] to develop the topic list, guide the interviews and focus groups, and guide their analysis is a strength as it provides structure to the analysis and substantiation of the results.

A limitation of this study is its sampling method. By using the network of members of the project group, and the fact that a relatively high proportion of those invited to participate refused because they thought they knew too little about the subject to be able to contribute, our sample was biased towards participants that are (somewhat) aware of the NHST debate. Our sample may also consist of people that are relatively critical towards the use of NHST, compared to the total population of researchers. It was not easy to include participants who were indifferent about or who were pro-NHST, as those were presumably less willing to make time and participate in this study. Even in our sample we found that the majority of our participants solely used NHST and perceived it as difficult if not impossible to change their behavior. These perceptions are thus probably even stronger in the target population. Another limitation, that is inherent to qualitative research, is the risk of interviewer bias. Respondents are unable, unwilling, or afraid to answer questions in good conscience, and instead provide socially desirable answers. In the context of our research, people are aware that, especially as a scientist, it does not look good to be conservative, complacent, or ignorant, or not to be open to innovation and new ideas. Therefore, some participants might have given a too favorable view of themselves. The interviewer bias can also take the other direction when values and expectations of the interviewer consciously or unconsciously influence the answers of the respondents. Although we have tried to be as neutral and objective as possible in asking questions and interpreting answers, we cannot rule out the chance that our views and opinions on the use of NHST have at times steered the respondents somewhat, potentially leading to the foregoing desirable answers.

Generalizability is a topic that is often debated in qualitative research methodology. Many researchers do not consider generalizability the purpose of qualitative research, but rather finding in-depth insights and explanations. However, this is an unjustified simplification, as generalizing of findings from qualitative research is possible. Three types of generalization in qualitative research are described: representational generalization (whether what is found in a sample can be generalized to the parent population of the sample), inferential generalization (whether findings from the study can be generalized to other settings), and theoretical generalization (where one draws theoretical statements from the findings of the study for more general application) [ 48 ]. The extent to which our results are generalizable is uncertain, as we used a theoretical sampling method, and our study was conducted exclusively in the Netherlands. We expect that the generic themes (reactivity, the scientific duty and the scientific climate) are applicable to academia in many countries across the world (inferential generalization). However, some elements, such as the Dutch educational system, will differ to a more or lesser extent from other countries (and thus can only be representationally generalized). In the Netherlands there is, for example, only one educational route after secondary school that has an academic orientation (scientific education, equivalent to the US university level education). This route consists of a bachelor’s program (typically 3 years), and a master’s program (typically 1, 2 or 3 years). Not every study program contains (compulsory) statistical courses, and statistical courses differ in depth and difficulty levels depending on the study program. Thus, not all the results will hold for other parts of the world, and further investigation is required.

Our findings demonstrate how perceived barriers to shift away from NHST set a high threshold for behavioral change and create a circle of interdependency. Behavioral change is a complex process. As ‘the stronger the intention to engage in a behavior, the more likely should be its performance’[ 21 ], further research on this subject should focus on how to influence the intention of behavior; i.e. which perceived barriers for the use of alternatives are most promising to break down in order to increase the intention for behavioral change. The present study shows that negative normative beliefs and a lack of perceived behavioral control regarding the innovation among individuals in the scientific system is a substantial problem. When social norms change in favor of the innovation, and control over the behavior increases, then the behavioral intention becomes a sufficient predictor of behavior [ 49 ]. An important follow-up question will therefore be: how can people be enthused and empowered, to ultimately take up the use of alternative methods instead of NHST? Answering this question can, in the long run, lead to the diffusion of the innovation through the scientific system as a whole.

NHST has been the leading paradigm for many decades and is deeply rooted in our science system, despite longstanding criticism. The aim of this study was to gain insight as to why we continue to use NHST. Our findings have demonstrated how perceived barriers to make a shift away from NHST set a high threshold for actual behavioral change and create a circle of interdependency between stakeholders in the scientific system. Consequently, people find themselves in a state of reactivity, which limits behavioral change with respect to the use of NHST. The next step would be to get more insight into ways to effectively remove barriers and thereby increase the intention to take a step back from NHST. A paradigm shift within a couple of years is not realistic. However, we believe that by taking small steps, one at a time, it is possible to decrease the scientific community’s strong dependence on NHST and p-values.

Supporting information

S1 appendix, acknowledgments.

The authors are grateful to Anja de Kruif for her contribution to the design of the study and for moderating one of the focus groups.

Funding Statement

This research was funded by the NWO (Nederlandse Organisatie voor Wetenschappelijk Onderzoek; Dutch Organization for Scientific Research) ( https://www.nwo.nl/ ) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

  • Math Article

Null Hypothesis

In mathematics, Statistics deals with the study of research and surveys on the numerical data. For taking surveys, we have to define the hypothesis. Generally, there are two types of hypothesis. One is a null hypothesis, and another is an alternative hypothesis .

In probability and statistics, the null hypothesis is a comprehensive statement or default status that there is zero happening or nothing happening. For example, there is no connection among groups or no association between two measured events. It is generally assumed here that the hypothesis is true until any other proof has been brought into the light to deny the hypothesis. Let us learn more here with definition, symbol, principle, types and example, in this article.

Table of contents:

  • Comparison with Alternative Hypothesis

Null Hypothesis Definition

The null hypothesis is a kind of hypothesis which explains the population parameter whose purpose is to test the validity of the given experimental data. This hypothesis is either rejected or not rejected based on the viability of the given population or sample . In other words, the null hypothesis is a hypothesis in which the sample observations results from the chance. It is said to be a statement in which the surveyors wants to examine the data. It is denoted by H 0 .

Null Hypothesis Symbol

In statistics, the null hypothesis is usually denoted by letter H with subscript ‘0’ (zero), such that H 0 . It is pronounced as H-null or H-zero or H-nought. At the same time, the alternative hypothesis expresses the observations determined by the non-random cause. It is represented by H 1 or H a .

Null Hypothesis Principle

The principle followed for null hypothesis testing is, collecting the data and determining the chances of a given set of data during the study on some random sample, assuming that the null hypothesis is true. In case if the given data does not face the expected null hypothesis, then the outcome will be quite weaker, and they conclude by saying that the given set of data does not provide strong evidence against the null hypothesis because of insufficient evidence. Finally, the researchers tend to reject that.

Null Hypothesis Formula

Here, the hypothesis test formulas are given below for reference.

The formula for the null hypothesis is:

H 0 :  p = p 0

The formula for the alternative hypothesis is:

H a = p >p 0 , < p 0 ≠ p 0

The formula for the test static is:

Remember that,  p 0  is the null hypothesis and p – hat is the sample proportion.

Also, read:

Types of Null Hypothesis

There are different types of hypothesis. They are:

Simple Hypothesis

It completely specifies the population distribution. In this method, the sampling distribution is the function of the sample size.

Composite Hypothesis

The composite hypothesis is one that does not completely specify the population distribution.

Exact Hypothesis

Exact hypothesis defines the exact value of the parameter. For example μ= 50

Inexact Hypothesis

This type of hypothesis does not define the exact value of the parameter. But it denotes a specific range or interval. For example 45< μ <60

Null Hypothesis Rejection

Sometimes the null hypothesis is rejected too. If this hypothesis is rejected means, that research could be invalid. Many researchers will neglect this hypothesis as it is merely opposite to the alternate hypothesis. It is a better practice to create a hypothesis and test it. The goal of researchers is not to reject the hypothesis. But it is evident that a perfect statistical model is always associated with the failure to reject the null hypothesis.

How do you Find the Null Hypothesis?

The null hypothesis says there is no correlation between the measured event (the dependent variable) and the independent variable. We don’t have to believe that the null hypothesis is true to test it. On the contrast, you will possibly assume that there is a connection between a set of variables ( dependent and independent).

When is Null Hypothesis Rejected?

The null hypothesis is rejected using the P-value approach. If the P-value is less than or equal to the α, there should be a rejection of the null hypothesis in favour of the alternate hypothesis. In case, if P-value is greater than α, the null hypothesis is not rejected.

Null Hypothesis and Alternative Hypothesis

Now, let us discuss the difference between the null hypothesis and the alternative hypothesis.

Null Hypothesis Examples

Here, some of the examples of the null hypothesis are given below. Go through the below ones to understand the concept of the null hypothesis in a better way.

If a medicine reduces the risk of cardiac stroke, then the null hypothesis should be “the medicine does not reduce the chance of cardiac stroke”. This testing can be performed by the administration of a drug to a certain group of people in a controlled way. If the survey shows that there is a significant change in the people, then the hypothesis is rejected.

Few more examples are:

1). Are there is 100% chance of getting affected by dengue?

Ans: There could be chances of getting affected by dengue but not 100%.

2). Do teenagers are using mobile phones more than grown-ups to access the internet?

Ans: Age has no limit on using mobile phones to access the internet.

3). Does having apple daily will not cause fever?

Ans: Having apple daily does not assure of not having fever, but increases the immunity to fight against such diseases.

4). Do the children more good in doing mathematical calculations than grown-ups?

Ans: Age has no effect on Mathematical skills.

In many common applications, the choice of the null hypothesis is not automated, but the testing and calculations may be automated. Also, the choice of the null hypothesis is completely based on previous experiences and inconsistent advice. The choice can be more complicated and based on the variety of applications and the diversity of the objectives. 

The main limitation for the choice of the null hypothesis is that the hypothesis suggested by the data is based on the reasoning which proves nothing. It means that if some hypothesis provides a summary of the data set, then there would be no value in the testing of the hypothesis on the particular set of data. 

Frequently Asked Questions on Null Hypothesis

What is meant by the null hypothesis.

In Statistics, a null hypothesis is a type of hypothesis which explains the population parameter whose purpose is to test the validity of the given experimental data.

What are the benefits of hypothesis testing?

Hypothesis testing is defined as a form of inferential statistics, which allows making conclusions from the entire population based on the sample representative.

When a null hypothesis is accepted and rejected?

The null hypothesis is either accepted or rejected in terms of the given data. If P-value is less than α, then the null hypothesis is rejected in favor of the alternative hypothesis, and if the P-value is greater than α, then the null hypothesis is accepted in favor of the alternative hypothesis.

Why is the null hypothesis important?

The importance of the null hypothesis is that it provides an approximate description of the phenomena of the given data. It allows the investigators to directly test the relational statement in a research study.

How to accept or reject the null hypothesis in the chi-square test?

If the result of the chi-square test is bigger than the critical value in the table, then the data does not fit the model, which represents the rejection of the null hypothesis.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Maths related queries and study materials

Your result is as below

Request OTP on Voice Call

null hypothesis method statistics

  • Share Share

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

close

Miles D. Williams

Logo

Visiting Assistant Professor | Denison University

  • Download My CV
  • Send Me an Email
  • View My LinkedIn
  • Follow Me on Twitter

When the Research Hypothesis Is the Null

Posted on May 13, 2024 by Miles Williams in Methods   Statistics  

Back to Blog

What should you do if your research hypothesis is the null hypothesis? In other words, how should you approach hypothesis testing if your theory predicts no effect between two variables? I and a coauthor are working on a paper where a couple of our proposed hypotheses look like this, and we got some push-back from a reviewer about it. This prompted me to go down a rabbit hole of journal articles and message boards to see how others handle this situation. I quickly found that I waded into a contentious issue that’s connected to a bigger philosophical debate about the merits of hypothesis testing in general and whether the null hypothesis in particular as a bench-mark for hypothesis testing is even logically sound.

There’s too much to unpack with this debate for me to cover in a single blog post (and I’m sure I’d get some of the key points wrong anyway if I tried). The main issue I want to explore in this post is the practical problem of how to approach testing a null research hypothesis. From an applied perspective, this is a tricky problem that raises issues with how we calculate and interpret p-values. Thankfully, there is a sound solution for the null research hypothesis which I explore in greater detail below. It’s called a two one-sided test, and it’s easy to implement once you know what it is.

The usual approach

Most of the time when doing research, a scientist usually has a research hypothesis that goes something like X has a positive effect on Y . For example, a political scientist might propose that a get-out-the-vote (GOTV) campaign ( X ) will increase voter turnout ( Y ).

The typical approach for testing this claim might be to estimate a regression model with voter turnout as the outcome and the GOTV campaign as the explanatory variable of interest:

Y = α + β X + ε

If the parameter β > 0, this would support the hypothesis that GOTV campaigns improve voter turnout. To test this hypothesis, in practice the researcher would actually test a different hypothesis that we call the null hypothesis. This is the hypothesis that says there is no true effect of GOTV campaigns on voter turnout.

By proposing and testing the null, we now have a point of reference for calculating a measure of uncertainty—that is, the probability of observing an empirical effect of a certain magnitude or greater if the null hypothesis is true. This probability is called a p-value, and by convention if it is less than 0.05 we say that we can reject the null hypothesis.

For the hypothetical regression model proposed above, to get this p-value we’d estimate β, then calculate its standard error, and then we’d take the ratio of the former to the latter giving us what’s called a t-statistic or t-value. Under the null hypothesis, the t-value has a known distribution which makes it really easy to map any t-value to a p-value. The below figure illustrates using a hypothetical data sample of size N = 200. You can see that the t-statistic’s distribution has a distinct bell shape centered around 0. You can also see the range of t-values in blue where if we observed them in our empirical data we’d fail to reject the null hypothesis at the p < 0.05 level. Values in gray are t-values that would lead us to reject the null hypothesis at this same level.

null hypothesis method statistics

When the null is the research hypothesis we want to test

There’s nothing new or special here. If you have even a basic stats background (particularly with Frequentist statistics), the conventional approach to hypothesis testing is pretty ubiquitous. Things get more tricky when our research hypothesis is that there is no effect. Say for a certain set of theoretical reasons we think that GOTV campaigns are basically useless at increasing voter turnout. If this argument is true, then if we estimate the following regression model, we’d expect β = 0.

The problem here is that our substantive research hypothesis is also the one that we want to try to find evidence against. We could just proceed like usual and just say that if we fail to reject the null this is evidence in support of our theory, but the problem with doing this is that failure to reject the null is not the same thing as finding support for the null hypothesis.

There are a few ideas in the literature for how we should approach this instead. Many of these approaches are Bayesian, but most of my research relies on Frequentist statistics, so these approaches were a no-go for me. However, there is one really simple approach that is consistent with the Frequentist paradigm: equivalence testing . The idea is simple. Propose some absolute effect size that is of minimal interest and then test whether an observed effect is different from it. This minimum effect is called the “smallest effect size of interest” (SESOI). I read about the approach in an article by Harms and Lakens (2018) in the Journal of Clinical and Translational Research .

Say, for example, that we deemed a t-value of +/-1.96 (the usual threshold for rejecting the null hypothesis) as extreme enough to constitute good evidence of a non-zero effect. We could make the appropriate adjustments to our t-distribution to identify a new range of t-values that would allow us to reject the hypothesis that an effect is non-zero. This is illustrated in the below figure. We can now see a range of t-values in the middle where we’d have t-values such that we could reject the non-zero hypothesis at the p < 0.05 level. This distribution looks like it’s been inverted relative to the usual null distribution. The reason is that with this approach what we’re doing is conducting a pair of alternative one-tailed tests. We’re testing both the hypothesis that β / se(β) - 1.96 > 0 and β / se(β) + 1.96 < 0. In the Harms and Lakens paper cited above, they call this approach two one-sided tests or TOST (I’m guessing this is pronounced “toast”).

null hypothesis method statistics

Something to pay attention to with this approach is that the observed t-statistic needs to be very small in absolute magnitude for us to reject the hypothesis of a non-zero effect. This means that the bar for testing a null research hypothesis is actually quite high. This is demonstrated using the following simulation in R. Using the {seerrr} package, I had R generate 1,000 random draws (each of size 200) for a pair of variables x and y where the former is a binary “treatment” and the latter is a random normal “outcome.” By design, there is no true causal relationship between these variables. Once I simulated the data, I then generated a set of estimates of the effect of x on y for each simulated dataset and collected the results in an object called sim_ests . I then visualized two metrics that that I calculated with the simulated results: (1) the rejection rate for the null hypothesis test and (2) the rejection rate for the two one-sided equivalence tests. As you can see, if we were to try to test a research null hypothesis the usual way, we’d expect to be able to fail to reject the null about 95% of the time. Conversely, if we were to use the two one-sided equivalence tests, we’d expect to reject the non-zero alternative hypothesis only about 25% of the time. I tested out a few additional simulations to see if a larger sample size would lead to improvements in power (not shown), but no dice.

null hypothesis method statistics

The two one-sided tests approach strikes me as a nice method when dealing with a null research hypothesis. It’s actually pretty easy to implement, too. The one downside is that this test is under-powered. If the null is true, it will only reject the alternative 25% of the time (though you could select a different non-zero alternative which would possibly give you more power). However, this isn’t all bad. The flip side of the coin is that this is a really conservative test, so if you can reject the alternative that puts you on solid rhetorical footing to show the data really do seem consistent with the null.

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 : The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

H a : The alternative hypothesis: It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 . This is usually what the researcher is trying to prove.

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are "reject H 0 " if the sample information favors the alternative hypothesis or "do not reject H 0 " or "decline to reject H 0 " if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30% of the registered voters in Santa Clara County voted in the primary election. p ≤ .30 H a : More than 30% of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 66
  • H a : μ __ 66

Example 9.3

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 45
  • H a : μ __ 45

Example 9.4

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p __ 0.40
  • H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics-2e/pages/9-1-null-and-alternative-hypotheses

© Dec 6, 2023 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Genetic effects and causal association analyses of 14 common conditions/diseases in multimorbidity patterns

Roles Data curation, Formal analysis, Writing – original draft

Affiliations Collaborative Innovation Center for Bone and Immunology between Sihong Hospital and Soochow University, Center for Genetic Epidemiology and Genomics, School of Public Health, Suzhou Medical College of Soochow University, Suzhou, Jiangsu P. R. China, Department of Orthopedics, Sihong Hospital, Suzhou, Jiangsu, P. R. China

Roles Conceptualization, Data curation, Formal analysis

Roles Conceptualization, Formal analysis

Affiliation Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Soochow University, Suzhou, Jiangsu, P. R. China

Roles Writing – original draft, Writing – review & editing

* E-mail: [email protected] (SFL); [email protected] (PH)

ORCID logo

Roles Writing – review & editing

Affiliations Collaborative Innovation Center for Bone and Immunology between Sihong Hospital and Soochow University, Center for Genetic Epidemiology and Genomics, School of Public Health, Suzhou Medical College of Soochow University, Suzhou, Jiangsu P. R. China, Department of Orthopedics, Sihong Hospital, Suzhou, Jiangsu, P. R. China, Changzhou Geriatric Hospital Affiliated to Soochow University, Changzhou, China

  • Ting Fu, 
  • Yi-Qun Yang, 
  • Chang-Hua Tang, 
  • Pei He, 
  • Shu-Feng Lei

PLOS

  • Published: May 16, 2024
  • https://doi.org/10.1371/journal.pone.0300740
  • Reader Comments

Table 1

Multimorbidity has become an important health challenge in the aging population. Accumulated evidence has shown that multimorbidity has complex association patterns, but the further mechanisms underlying the association patterns are largely unknown.

Summary statistics of 14 conditions/diseases were available from the genome-wide association study (GWAS). Linkage disequilibrium score regression analysis (LDSC) was applied to estimate the genetic correlations. Pleiotropic SNPs between two genetically correlated traits were detected using pleiotropic analysis under the composite null hypothesis (PLACO). PLACO-identified SNPs were mapped to genes by Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA), and gene set enrichment analysis and tissue differential expression were performed for the pleiotropic genes. Two-sample Mendelian randomization analyses assessed the bidirectional causality between conditions/diseases.

LDSC analyses revealed the genetic correlations for 20 pairs based on different two-disease combinations of 14 conditions/diseases, and genetic correlations for 10 pairs were significant after Bonferroni adjustment ( P <0.05/91 = 5.49E-04). Significant pleiotropic SNPs were detected for 11 pairs of correlated conditions/diseases. The corresponding pleiotropic genes were differentially expressed in the brain, nerves, heart, and blood vessels and enriched in gluconeogenesis and drug metabolism, biotransformation, and neurons. Comprehensive causal analyses showed strong causality between hypertension, stroke, and high cholesterol, which drive the development of multiple diseases.

Conclusions

This study highlighted the complex mechanisms underlying the association patterns that include the shared genetic components and causal effects among the 14 conditions/diseases. These findings have important implications for guiding the early diagnosis, management, and treatment of comorbidities.

Citation: Fu T, Yang Y-Q, Tang C-H, He P, Lei S-F (2024) Genetic effects and causal association analyses of 14 common conditions/diseases in multimorbidity patterns. PLoS ONE 19(5): e0300740. https://doi.org/10.1371/journal.pone.0300740

Editor: Chunyu Liu, State University of New York Upstate Medical University, UNITED STATES

Received: October 24, 2023; Accepted: March 4, 2024; Published: May 16, 2024

Copyright: © 2024 Fu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting Information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Multimorbidity is a coexistence state of two or more medical conditions/diseases [ 1 , 2 ], which commonly occurs in older adults. The estimated prevalence of multimorbidity was 67.8% in older adults aged 65 years and older by 2035 in the UK [ 3 ]. These chronic diseases mainly include long-term mental conditions/diseases, physical noncommunicable diseases of long duration, and long-term chronic diseases of the musculoskeletal system. Comorbidity substantially increases the medical and socioeconomic burden. Especially during the COVID-19 pandemic, individuals with multimorbidity have been at higher risk of infection and adverse outcomes [ 4 ].

The pathogenesis underlying the occurrence of multimorbidity is probably related to aging, heredity, and other factors (e.g., socioeconomic deprivation) [ 5 ]. In the comorbidity patterns, the correlations among the diseases are complex, and some diseases may be the driving factors of other diseases. For example, some chronic diseases contribute to the progression of other conditions/diseases, with slight clinical symptoms and undetectable symptoms, especially for older persons [ 6 ]. Diabetes and hypertension are the most common chronic diseases in Spanish people aged over 65 years with dementia [ 7 ]. In the observational study, Bendayan et al. ranked 14 common diseases to older adults in descending order of prevalence, with hypertension, heart disease, depression, and asthma [ 8 ]. A study that included 0.5 billion Chinese adults has identified four multimorbidity patterns, including cardiometabolic multimorbidity (diabetes, stroke and hypertension), cancer, psychiatric problems (neurasthenia) and arthritis [ 9 ]. A review study suggests that depression, hypertension, diabetes, arthritis and asthma can easily co-exist with other diseases [ 10 ]. However, in traditional cohort studies, the analysis of the association between diseases is easily affected by many confounding factors that are difficult to control or unknown, leading to unreliable results. For example, Poblador-Plou et al. studied multimorbidity patterns in people over 65 years of age in Spain [ 7 ]. They found that diabetes and hypertension were the most common chronic diseases in people with dementia. However, when they identified multimorbidity patterns, the patterns associated with dementia did not include diabetes or hypertension, but include Parkinson’s disease, osteoporosis, thyroid disease, anxiety, and neurosis. Therefore, in multimorbidity patterns, the complex correlation patterns and the underlying mechanisms for the correlations are still largely unknown.

Therefore, this study selected 14 conditions/diseases that were the main diseases experienced by middle and older age individuals and the major sources of long-standing illnesses for people aged 65 years and older in other national surveys (details can be found in https://www.elsa-project.ac.uk/data-and-documentation ). We performed systemic analyses (e.g., linkage disequilibrium score regression (LDSC) [ 11 ], the pleiotropic analysis method under the composite original hypothesis (PLACO) [ 12 ], and mapping and functional annotation of pleiotropic SNPs were conducted by FUMA) [ 13 ] to detect the genetic correlations and the underlying pleiotropic genes and the driving conditions/diseases by inferring the causal relations (Mendelian randomization analysis) [ 14 ] among 14 conditions/diseases. Understanding the genetic components shared by them and the causal association patterns is essential for elucidating phenotypic relationships and disease pathology, which could help to provide basic data for the prevention, diagnosis, and treatment of these conditions/diseases.

Methods and materials

Gwas summary statistics for 14 conditions/diseases.

Following the selection of the conditions/diseases in the English Longitudinal Study of Aging (ELSA) [ 8 ]. The study also selected the fourteen conditions/diseases, including hypertension, type 2 diabetes, cancer, chronic obstructive pulmonary disease, heart problems, stroke, psychiatric problems (mental health problems ever diagnosed by a professional: Anxiety, nerves or generalized anxiety disorder), arthritis, asthma, high cholesterol, cataracts, Parkinson’s disease, hip fracture, and depression. There are no ethical issues involved in this study and "ethical approval and consent to participate" is not applicable. The statistics of 14 conditions/diseases were downloaded from the GWAS ( Table 1 ) .

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0300740.t001

Genetic correlation analysis for each pair of condition/disease–LDSC.

LDSC can estimate heritability and confounding bias between SNPs [ 15 ], and its principle and algorithm are well described [ 16 ]. We assessed the genetic correlation between two phenotypes by using LDSC ( https://github.com/bulik/ldsc ). LD scores could be calculated from the European samples in the 1000 Genomes Project as a reference panel [ 17 ]. Quality controls were implemented on the SNPs before LDSC analysis to filter out unqualified SNPs (e.g., nonbiallelic SNPs, stranded-ambiguous SNPs, duplicate SNPs, SNPs without rs numbers, SNPs with minor allele frequencies<0.01, SNPs located in the major histocompatibility complex (MHC) region with a complex LD structure (HLA region markers) and SNPs that were not in or whose alleles did not match Phase 3 of the 1000 Genomes Project) [ 18 ].

Pleiotropic analysis under the composite hypothesis.

null hypothesis method statistics

The functional annotation of pleiotropic SNPs

Afterward, we functionally mapped and annotated the PLACO-screened pleiotropic SNPs using the online tool Functional Mapping and Annotation (FUMA, https://fuma.ctglab.nl/ .) [ 13 ]. Independent significant SNPs were identified according to the default thresholds of P <5E-08 and r 2 <0.6 in FUMA. These SNPs are further represented by lead SNPs, which are a subset of the independent significant SNPs if the SNP had r 2 < 0.1. Then, we used positional mapping of deleterious coding SNPs and expression quantitative trait loci (eQTLs) to map independent significant SNPs to genes. SNPs were mapped to certain genes if the combined annotation-dependent depletion (CADD) score was >12.37 [ 13 , 20 ]; otherwise, they were filtered. For eQTL mapping, all independent significant SNPs in LD were mapped to eQTLs in gene tissue expression (GTEx) v8 tissue types.

Finally, gene set enrichment analysis (GSEA) was undertaken to test the possible biological mechanisms of pleiotropic genes. Hypergeometric tests are conducted to assess the possible biological functions of genes of interest. A series of pathway enrichment analyses, including the Molecular Signatures Database (MSigDB) [ 21 ] and KEGG [ 22 ], were used. The expression (transcripts per million) of prioritized genes in different tissues was estimated from GTEx v8 following winsorization at 50 and log2 transformation [ 23 ]. The mRNA expression data were first normalized, and then, for each gene, Student’s two-sided t test was performed per tissue against all other tissues. The fold change (FC) of the gene expression level was utilized to represent the differential expression between different tissues. The adjusted p value < 0.05 ( P bon<0.05) and absolute log fold change 0.58 were defined as a DEG set in a given tissue.

Two-sample Mendelian randomization for inferring causal relationships among different conditions/diseases

Mendelian randomization (MR) uses genetic variants as instrumental variables to determine whether an observational association between a risk factor and an outcome is consistent with a causal effect [ 24 ]. Based on GWAS summary statistics, Mendelian randomization (MR) analyses were performed to detect potential causality between the 14 diseases. For instrumental variable (IV) selection, independent significant (IVs) ( P <5E-08) and LD r 2 <0.01 within 10Mb based on the 1000Genomes Project phase 3 Europeans were kept [ 25 ]. Besides, we incorporated the F score to prevent weak instrument IVs (SNPs with F<10 were excluded) [ 26 ]. The inverse variance weighted method was applied to estimate the causal association between various diseases (IVW P <0.05), with a supplementary analysis of the weighted median, weight mode, and MR‒Egger. To address multiple hypothesis testing, we estimated the false discovery rate (FDR) adjusted p values, in the main IVW MR analyses, using the sequential p value approach proposed by Benjamini & Hochberg [ 27 ]. For the trait pairs with heterogeneity, we used IVW analysis with different models. Cochran Q statistics confirm the robustness of the MR estimates [ 28 ]. All analyses were performed by using the R package “TwoSampleMR” (version 0.5.6).

Linkage disequilibrium score regression analysis (LDSC)

LDSC estimated the genetic correlations among the 14 phenotypes ( S1 Table in S2 File ). Twenty pairs of conditions/diseases showed significant genetic correlations ( P <0.05). The negative and positive coefficients ranged from -0.20~-0.08 and 0.16~0.75, respectively ( Table 2 ). Among the 20 studied pairs, genetic correlations for 10 pairs were significant after Bonferroni adjustment ( P <0.05/91 = 5.49E-04) ( Fig 1A ). We found that hypertension was significant genetic correlations with 5 diseases/conditions. It was positively genetically correlated with other 4 diseases/conditions, including T2D (rg = 0.23, P = 1.3E-08), stroke (rg = 0.49, P = 1.2E-21), high cholesterol (rg = 0.62, P = 3.2E-39) and depression (rg = 0.37, P = 1.3E-07). Besides, hypertension had a negative correlation with asthma (rg = -0.15, P = 2.0E-04). High cholesterol was positively genetically correlated with T2D (rg = 0.21, P = 4.7E-06), stroke (rg = 0.26, P = 2.05E-08) and depression (rg = 0.23, P = 1.0E-04). Depression was positively genetically correlated with psychiatric problems (mainly referring to anxiety and nerves) (rg = 0.75, P = 1.5E-16) and negatively correlated with asthma (rg = -0.20, P = 1.0E-04), respectively ( Table 2 and Fig 1A ).

thumbnail

(A) LDSC results. Square color indicates r(g), blue indicates a positive correlation, and red indicates a negative correlation. Square size indicates Bonferroni-adjusted p value, * indicates P < 5.49E-04. Psychiatric problems: Mental health problems ever diagnosed by a professional: Anxiety, nerves or generalized anxiety disorder. (B) Differential expression levels of identified pleiotropic genes between 11 pairs of phenotypes in GTEx v8 54 tissue types. The -log10 (P value) in the graph refers to the probability of hypergeometric testing. Red bars denote tissues that remained significantly prominent after Bonferroni correction (P<0.05). (C) The shared enrichment pathways and/or GO terms. In the Sankey bubble diagram, the first column on the left side indicates each phenotypic pairwise combination. The second column represents the gene set enrichment analysis of the genes associated with the combination, in which different phenotypic combinations identified the same enrichment signal. The bubble diagram on the right side shows the genes enriched in the corresponding gene set. Positional gene sets are sets of genes that are categorized based on their different locations on the chromosome. Psychiatric problems: Mental health problems ever diagnosed by a professional: Anxiety, nerves or generalized anxiety disorder. (D) Network plot obtained from two-sample Mendelian randomization analysis for the 14 diseases/conditions (IVW P <0.05). The starting point of the connecting line represents exposure, and the end point of the arrow indicates the outcome. COPD: Chronic obstructive pulmonary disease. Psychiatric problems: Mental health problems ever diagnosed by a professional: Anxiety, nerves or generalized anxiety disorder.

https://doi.org/10.1371/journal.pone.0300740.g001

thumbnail

https://doi.org/10.1371/journal.pone.0300740.t002

Functional annotation for pleiotropic SNPs and corresponding genes

Furthermore, PLACO was used to investigate pleiotropic SNPs for the significant pairs in the LDSC analysis (i.e., 20 condition pairs in Table 2 ). The results showed that 11/20 pairs of correlated traits had pleiotropic SNPs ( S2 Table in S2 File ). We further applied FUMA for functional annotation of the relevant SNPs. For the 11 pairs, a total of 8,927 lead SNPs were selected ( S3 Table in S2 File and S1 Fig in S1 File ). For the pleiotropic genes of 11 pairs, we performed tissue differential expression analysis in GTEx v8 54 tissue types and showed that the pleiotropic genes were mainly enriched in the brain, nerves, kidney, heart, and blood vessels, particularly for these up-regulated pleiotropic genes ( Fig 1B ), indicating that these genes probably play important biological functions. We also performed enrichment analysis for each significant pair ( S4 Table in S2 File ) and identified a total of 21 common sets of genes involved in at least two pairs. Seven significant KEGG pathways were shared pathways, such as drug metabolism cytochrome p450 and porphyrin and chlorophyll metabolism ( Fig 1C ).

For a more detailed explanation of the pleiotropic SNPs and corresponding genes, we further divided these associated traits into five subgroups.

Hypertension-associated subgroup.

The hypertension-associated group consisted of hypertension-psychiatric problems (anxiety and nerves), hypertension-arthritis, hypertension-asthma, and hypertension-high cholesterol. The pleiotropic SNPs identified by PLACO analysis were mapped to 1,030 pleiotropic genes ( S5 Table in S2 File ). The gene ontology (GO) gene sets showed that remarkably strong enrichment signals in the glucuronation and metabolism for the biological process (BP) terms. (i.e., GO_FLAVONOID_GLUCURONIDATION, P adj = 1.36E-04; GO_URONIC_ACID_METABOLIC_PROCESS, P adj = 5.58E-03;); For the KEGG, which them mainly were abundant in glucuronate interconversions and metabolism, (i.e., KEGG_PENTOSE_AND_GLUCURONIDATE_INTERCONVERSION, P adj = 5.99E-08; KEGG_STARCH_AND_SUCROSE_METABOLISM, P adj = 1.25E-04.) ( S6 Table in S2 File and S2 Fig in S1 File ). We found that the identified pleiotropic genes were significantly enriched in the brain, kidney, heart, and nerve tissue ( S3A Fig in S1 File ).

T2D-associated subgroup. A total of 1,582 unique pleiotropic genes were identified in the T2D-related cluster, which included T2D-cancer, T2D-asthma, and T2D-high cholesterol ( S7 Table in S2 File ). We found that the 1,582 genes showed strong enrichment signals in gene sets related to oncogenic transcription factors and the biocarta pathway ( S8 Table in S2 File ). Pleiotropic genes were mainly concentrated in brain tissue, down-regulated in the kidney and spleen, and up-regulated in vascular tissue and muscle ( S3B Fig in S1 File ).

Psychiatric problems-related group.

The psychiatric problems-associated group mainly refers to phenotypes associated with anxiety and nerves, including asthma and high cholesterol. A total of 1,455 unique pleiotropic genes were identified by positional mapping and eQTL mapping ( S9 Table in S2 File ). For the biological process (BP) terms of GO, which them mainly were associated with adhesion pathways (i.e., GO_BIOLOGICAL_ADHESION, P adj = 1.36E-04; GO_CELL_CELL_ADHESION, P adj = 4.22E-06). For the molecular function (MF) terms of GO, these gene sets were primarily involved in synapses and neurotransmission pathways (i.e., GO_SYNAPSE, P adj = 2.32E-03; GO_NEURON_PART, P adj = 1.72E-02 ( S10 Table in S2 File and S4 Fig in S1 File ), which accumulate in brain and nerve tissues. (S3C Fig in S1 File ) .

High cholesterol-depression subgroup.

A total of 303 unique genes were identified by positional mapping and eQTL mapping between high cholesterol and depression ( S11 Table in S2 File ). These genes were enriched in diseases or mutants of signal transduction and associated with apoptosis ( S12 Table in S2 File and S5 Fig in S1 File ). However, they did not show significant differences in 30 tissues (S3D Fig in S1 File ) .

Parkinson’s disease-hip fracture subgroup.

There were 985 unique putative pleiotropic genes detected between Parkinson’s disease and hip fracture via two mapping methods ( S13 Table in S2 File ). The Gene Ontology (GO) gene sets showed that 985 unique genes had significant enrichment signals in neurons and cells (e.g., GO_REGULATION_OF_CELL_PROJECTION_ORGANIZATION, P adj = 1.15E-04; GO_CELL_PROJECTION_ORGANIZATION, P adj = 2.13E-04) (S14 Table in S2 File and S6 Fig in S1 File ) . Interestingly, differential expression of these unique genes was significantly enriched in the prostate (S3E Fig in S1 File ).

Causal associations among 14 conditions/diseases inferred by Mendelian randomization analysis

Finally, our study systematically analyzed the bidirectional causal relationships among 14 conditions/diseases. Of all 182 examined associations, 82 pairs had instrumental variables with F-values larger than 10 ( S15 Table in S2 File ). Among these 82 pairs, we found 13 pairs with significant causal effects (IVW P <0.05). Notably, IVW analysis demonstrated that the three phenotypes (hypertension, stroke, and high cholesterol) not only displayed significant bidirectional causal associations among them but also showed unidirectional causal associations from these three phenotypes to others, which indicated that they increased the risk of other diseases (Figs 1D and 2 ) . After adjustments of P -values, eight pairs of traits still had significantly causal relationships: hypertension→stroke ( P adj = 2.73E-22), hypertension→high cholesterol ( P adj = 1.11E-19), T2D→high cholesterol ( P adj = 1.67E-03), stroke→hypertension ( P adj = 1.03E-02), high cholesterol→hypertension ( P adj = 6.01E-10), high cholesterol→cancer ( P adj = 4.20E-02), high cholesterol→stroke ( P adj = 4.56E-04), asthma→COPD ( P adj = 2.75E-05). However, there was no significant causal relationship between T2D and hypertension ( P adj = 5.26E-02) and when stroke was the exposure and high cholesterol was the outcome, the P adj = 9.68E-02, suggesting that the causality between high cholesterol and stroke may be unidirectional ( S16 Table in S2 File ). In addition, MR‒Egger regression analyses showed no potential pleiotropy (all P intercept  > 0.05), and Cochran’s Q test indicated that heterogeneity was present in more than half (61.54%) of 13 pairs. For disease pairs with heterogeneity ( P heter<0.05), the P-value of IVW with random effects was selected; for those without heterogeneity, the P-value of IVW with fixed effects was selected ( P heter>0.05) ( S17 Table in S2 File ). Scatter plots of the MR results are provided in the supplementary materials ( S7 Fig in S1 File ).

thumbnail

It contained the ratio of exposure to outcome (OR) with a 95% confidence interval (95% Cl). Blue squares indicate ORs, and the area of the square indicates the size of the OR value; the larger the OR value is, the larger the area of the square.

https://doi.org/10.1371/journal.pone.0300740.g002

Since the phenotypic correlations were probably caused by shared genetic effects and/or significant causal effects, we summarized the underlying mechanisms for the 24 studied pairs ( Table 3 ). The phenotypic correlations for six pairs (hypertension & T2D, hypertension & stroke, hypertension & high cholesterol, T2D & high cholesterol, stroke & high cholesterol, and stroke & Parkinson’s disease) were probably caused by both shared genetic effects and significant causal effects, but for other pairs, their correlations were probably caused by either shared genetic effects or significant causal effects. Hypertension, stroke, and high cholesterol have been frequently observed to be associated with other conditions/diseases.

thumbnail

https://doi.org/10.1371/journal.pone.0300740.t003

This study performed systemic analyses to detect the shared genetic and causal effects of 14 common conditions/diseases in multimorbidity patterns. LDSC analysis showed that the 20 studied pairs were genetically correlated, indicating that genetic effects were shared by the 14 conditions/diseases. The pleiotropic SNPs identified by PLACO and the corresponding mapping genes overlapped among these studied phenotypic pairs and were enriched in some pathways and/or GO terms with important biological functions. Compared to others, hypertension, stroke, and high cholesterol were observed to be causally/genetically associated with other phenotypes at a high frequency, and they most likely serve as the driving factors for other conditions/diseases.

Previous GWASs have shown that numerous genomic loci are significantly associated with a large diversity of traits [ 29 ] Extensive pleiotropic effects were identified as a possible explanation for comorbidity. This finding pointed to potential common genetic mechanisms between diseases. Most importantly, we found that hypertension, stroke, and high cholesterol were associated with more than one phenotype, emphasizing their prominence in the comorbidity pattern [ 30 ]. Among ~380 chronic disease combinations identified in older adults, hypertension was present in almost every combination of prevalent diseases [ 31 ].

Functional annotation analysis showed that putative pleiotropic genes of 11 pairs of traits were involved in biological processes such as biotransformation (e.g., glucuronate pathway, drug metabolism). The glucuronate pathway is indispensable for the physiological function of the body and has the functions of detoxification and excretion, and the metabolism of many drugs depends on it [ 32 ]. Several chronic inflammatory diseases (such as metabolic syndrome, T2D, and cardiovascular disease) could affect glucose metabolism [ 33 ]. We identified 985 pleiotropic genes in Parkinson’s disease and hip fracture that were involved in the development of neurons. Previous studies have shown that neuronal damage in Parkinson’s disease leads to symptoms such as tremors and muscle stiffness, which contribute to an increased number of falls and are also associated with fracture risk [ 34 ]. The above findings suggest that neuronal regulation plays an important role in the pathophysiology of both Parkinson’s disease and hip fracture.

The pleiotropic genes were mainly enriched in the brain, nerves, kidney, heart, and blood vessels. Previous evidence supports that vascular dysfunction is a common feature of several chronic diseases [ 35 ]. For example, hypertension, T2D, and kidney diseases were proven to be associated with cerebrovascular damage [ 36 ]. Chronic inflammatory diseases could induce higher chemokine production and activate intrarenal macrophages and dendritic cells in rats [ 37 ]. This suggested that the kidneys may be involved. In addition, we found that some disease pairs shared common biological functions. Seven KEGG pathways were associated with different trait pairs. The pathways involved in drug metabolism by cytochrome p450 and the metabolism of xenobiotics by cytochrome p450 were related to the balance between the oxidative stress system and antioxidant capacity and might play a vital role in hypertension-related conditions/diseases [ 38 ]. In addition, another study found that both hypertension and high cholesterol in animal models lead to increased systemic and vascular oxidative stress, resulting in reduced vasodilatation [ 39 ]. The oxidative system is closely associated with hypertension and T2D [ 40 ]. Based on the above evidence, we speculated that the dysregulation of the oxidative stress system may represent the major characteristics of the pathological process of hypertension-related diseases. The metabolism of porphyrin and chlorophyll and the metabolism of ascorbic acid and erythrosine in two trait pairs (hypertension—psychiatric problems and psychiatric problems—high cholesterol) may be newly discovered pathways because they have not been previously reported (psychiatric problems mainly refer to anxiety and nerves).

The LDSC analysis showed that hypertension was genetically related to T2D. The Mendelian randomization method demonstrated that T2D had a unidirectional causal effect on hypertension. Previous MR analyses have shown similar results. Zhu et al. used community-based disease GWAS data (p = 0.44 for SBP→T2D and p = 0.20 for DBP→T2D) [ 41 ], which showed no causal relations from hypertension to T2D, which was in line with our results with no adjustment. However, after correction, there was no causal effect of T2D on hypertension, thus, caution should be exercised when considering the causality of the two phenotypes. A cross-sectional study showed that clinically significant depressive symptoms were associated with elevated blood pressure, but they could not determine any causal relationship between hypertension and depression [ 42 ]. Our MR analysis showed that there was no causal association between them. In comparison, LDSC showed a strong genetic correlation between them (r(g) = 0.37, P = 1.30E-07). Evidence suggests that endothelial dysfunction plays an important role in the pathobiology of depression [ 43 ]. The vascular endothelium regulates vascular homeostasis and is a key factor in vascular health [ 44 ]. Therefore, we speculate that hypertension could lead to decreased vascular function and endothelial damage, which increases the risk of developing depression.

Hypertension and high cholesterol are common clinical comorbidities that have been previously described [ 45 ]. Hernández et al. analyzed the chronic combination pattern of 6,101 older adults aged 50 years and older and found that hypertension and high cholesterol were the most common coexisting conditions/diseases [ 46 ]. The 267 pleiotropic genes shared between them were mainly enriched in the synthesis and degradation of fibrinogen complexes, which have been reported to play an important role in high cholesterol patients with hypertension [ 47 ]. There also existed bidirectional causality between hypertension and stroke. Both the LDSC and MR results suggest that there are shared genetic determinants between the two phenotypes. A meta-analysis revealed that for every 10 mmHg increase in systolic blood pressure, the risk of stroke increased by 22%, and mortality increased by 56% [ 48 ]. This may be attributed to the possibility that hypertension accelerates the atherosclerotic process, thus increasing the probability of stroke. We confirmed the association from a genetic perspective.

We noticed that no pleiotropic SNPs/genes were found between hypertension and stroke or between stroke and high cholesterol. The possible reason is that we conducted SNP filtering under stricter thresholds (P<5E-08), which helped increase the efficacy of the test but also led to the removal of real pleiotropic SNPs. Moreover, the pleiotropy of complex traits/diseases is an essential basis for genetic correlations between diseases [ 49 ]. Hypertension, T2D, and high cholesterol seem to occupy an influential position in comorbidity patterns, which are presumably the dominant diseases to study the patterns of other conditions/diseases. Due to the uncertainty that all confounding factors have been identified, observational trials studying associations between diseases are susceptible to interference by many external factors, such as medications [ 50 ]. For example, certain diseases are bad outcomes for the treatment of specific diseases (e.g., clozapine and olanzapine, used for treating psychiatric disorders, can lead to insulin sensitivity and lipid metabolism changes, which increase the risk of diabetes and cardiovascular disease) [ 50 ]. However, this co-occurrence is not a genetic association of the diseases themselves but rather an iatrogenic condition resulting from psychiatric medications. Therefore, even if a strong statistical association between exposure and outcome is observed, causal conclusions should be drawn with caution. The present study assessed the relationship between diseases by using an associated genetic variation, which is usually irrespective of environmental or lifestyle factors and could strengthen causal inference. An in-depth analysis of correlations between the disease/conditions from the perspective of genetic variation, identification of pleiotropic genes and biological pathways, and causal effects was conducted to provide genetic insights for observational studies.

However, there are two limitations to our study. First, the summary data in our study only represent European populations; thus, caution should be taken when generalizing the findings to other ethnic groups (e.g., East Asians). Second, although we used multiple MR approaches to mitigate confounding due to pleiotropy, residual bias cannot be entirely ruled out, as it is an established limitation of the MR methodology [ 51 ].

The high phenotypic correlations for the 14 conditions/diseases in the multimorbidity patterns were caused by shared genetic effects and/or significant causal effects. Compared to others, hypertension, stroke, and high cholesterol were most frequently observed to be causally/genetically associated with other phenotypes, and they most likely serve as the driving factors for other conditions/diseases. These findings may provide basic data for the prevention, diagnosis, and treatment of these conditions/diseases.

Supporting information

S1 checklist. human participants research checklist..

https://doi.org/10.1371/journal.pone.0300740.s001

https://doi.org/10.1371/journal.pone.0300740.s002

https://doi.org/10.1371/journal.pone.0300740.s003

Acknowledgments

We thank all the studies for making the summary association statistics data publicly available.

  • View Article
  • PubMed/NCBI
  • Google Scholar

U.S. flag

A .gov website belongs to an official government organization in the United States.

A lock ( ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

  • Contact Information
  • Guidelines for Examining Unusual Patterns of Cancer and Environmental Concerns
  • Resources and Tools

Appendix A: Statistical Considerations

At a glance.

  • This appendix provides guidance on epidemiologic and descriptive statistical methods to assess cancer occurrences.
  • The standardized incidence ratio (SIR) is often used to assess if there is an excess number of cancer cases.
  • Alpha, beta, and statistical power relate to the types of errors that can occur during hypothesis testing.

Aerial view of a neighborhood with many houses and trees.

This section provides general guidance regarding epidemiologic and descriptive statistical methods most commonly used to assess occurrences of cancer. Frequencies, proportions, rates, and other descriptive statistics are useful first steps in evaluating the suspected unusual pattern of cancer. These statistics can be calculated by geographical location (e.g., census tracts) and by demographic variables such as age category, race, ethnicity, and sex. Comparisons can then be made across different stratifications using statistical summaries such as ratios.

Standardized incidence ratio

The standardized incidence ratio (SIR) is often used to assess whether there is an excess number of cancer cases, considering what is “expected” to occur within an area over time given existing knowledge of the type of cancer and the local population at risk. The SIR is a ratio of the number of observed cancer cases in the study population compared to the number that would be expected if the study population experienced the same cancer rates as a selected reference population. Typically, the state as a whole is used as a reference population. The equation is as follows:

Standardized Incidence Ratio

Adjusting for factors

The SIR can be adjusted for factors such as age, sex, race, or ethnicity, but it is most commonly adjusted for differences in age between two populations. In cancer analyses, adjusting for age is important because age is a risk factor for many cancers, and the population in an area of interest could be, on average, younger or older than the reference population 1 2 . In these instances, comparing the crude counts or rates would present a biased comparison.

For more guidance, this measure is explained in many epidemiologic textbooks, sometimes under standardized mortality ratio, which uses the same method but measures mortality instead of incidence rates 3 4 5 6 7 8 9 . Two ways are generally used to adjust via standardization, an indirect and a direct method. An example of one method is shown below, but a discussion of other methods is provided in several epidemiologic textbooks 3 and reference manuals 10 .

An example is provided in the table below, adjusting for age groups. The second column, denoted with an "O," is the observed number of cases in the area of interest, which in this example is a particular county within the state. The third column shows the population totals for each age group within the county of interest, designated as "A." The state age-specific cancer rates are shown in the fourth column, denoted as "B." To get the expected number of cases in the fifth column, A and B must be multiplied for each row. The total observed cases and the total expected cases are then summarized.

*Number of cases in a specified time frame. † Number of cases in the state divided by the state population for the specified time frame. Rates are typically expressed per 100,000 or 1,000,000 population.

The number of observed cancer cases can then be compared to the expected. The SIR is calculated using the formula below.

Standardized Incidence Ratio Example

Confidence intervals

A confidence interval (CI) is one of the most important statistics to be calculated, as it helps to provide understanding of both statistical significance and precision of the estimate. The narrower the confidence interval, the more precise the estimate 4 .

A common way of calculating confidence intervals for the SIR is shown below 4 :

Confidence Intervals

Using the example above produces this result:

Confidence Intervals Example

If the confidence interval for the SIR includes 1.0, the SIR is not considered statistically significant. However, there are many considerations when using the SIR. Because the statistics can be impacted by small case counts, or the proportion of the population within an area of interest, and other factors, the significance of the SIR should not be used as the sole metric to determine further assessment in the investigation of unusual patterns of cancer. Additionally, in instances of a small sample, exact statistical methods, which are directly calculated from data probabilities such as a chi-square or Fisher’s exact test, can be considered. These calculations can be performed using software such as R, Microsoft Excel, SAS, and STATA 9 . A few additional topics regarding the SIR are summarized below.

Reference population

Decisions about the reference population should be made prior to calculating the SIR. The reference population used for the SIR could be people in the surrounding census tracts, other counties in the state, or the entire state. Selecting the appropriate reference population is dependent upon the hypothesis being tested and should be large enough to provide relatively stable reference rates. One issue to consider is the size of the study population relative to the reference population. If the study population is small relative to the overall state population, including the study population in the reference population calculation will not yield substantially different results. However, excluding the study population from the reference population may reduce bias. If the reference population is smaller than the state as a whole (such as another county), the reference population should be “similar” to the study population in terms of factors that could be confounders (like age distribution, socioeconomic status and environmental exposures other than the exposure of interest). However, the reference population should not be selected to be similar to the study population in terms of the exposure of interest. Appropriate comparisons may also better address issues of environmental justice and health equity. Ultimately, careful consideration of the refence population is necessary since the choice can impact appropriate interpretation of findings and can introduce biases resulting in a decrease in estimate precision.

Limitations and further considerations for the SIR

One difficulty in community cancer investigations is that the population under study is generally a community or part of a community, leading to a relatively small number of individuals comprising the total population (e.g., small denominator for rate calculations). Small denominators frequently yield wide confidence intervals, meaning that estimates like the SIR may be imprecise 5 . Other methods, such as qualitative analyses or geospatial/spatial statistics methods, can provide further examination of the cancer and area of concern to better discern associations. Further epidemiologic studies may help calculate other statistics, such as logistic regression or Poisson regression. These methods are described in Appendix B. Other resources can provide additional guidance on use of p-values, confidence intervals, and statistical tests 3 4 9 11 12 .

Alpha, beta, and statistical power

Another important consideration in community cancer investigations is the types of errors that can occur during hypothesis testing and the related alpha, beta, and statistical power for the investigation. A type I error occurs when the null hypothesis (H o ) is rejected but actually true (e.g., concluding that there is a difference in cancer rates between the study population and the reference population when there is actually no difference). The probability of a type I error is often referred to as alpha or α 13 .

Alpha, Beta, and Statistical Power

A type II error occurs when the null hypothesis is not rejected and it should have been (e.g., concluding that there is no difference in cancer rates when there actually is a difference). The probability of a type II error is often referred to as beta or β.

Alpha, Beta, and Statistical Power 2

Power is the probability of rejecting the null hypothesis when the null hypothesis is actually false (e.g., concluding there is a difference in cancer rates between the study population and reference population when there actually is a difference). Power is equal to 1-beta. Power is related to the sample size of the study—the larger the sample size, the larger the power. Power is also related to several other factors including the following:

  • The size of the effect (e.g., rate ratio or rate difference) to be detected
  • The probability of incorrectly rejecting the null hypothesis (alpha)
  • Other features related to the study design, such as the distribution and variability of the outcome measure

As with other epidemiologic analyses, in community cancer investigations, a power analysis can be conducted to estimate the minimum number of people (sample size) needed in a study for detection of an effect (e.g., rate ratio or rate difference) of a given size with a specified level of power (1-beta) and a specified probability of rejecting the null hypothesis when the null hypothesis is true (alpha), given an assumed distribution for the outcome. Typically, a power value of 0.8 (equivalent to a beta value of 0.2) and an alpha value of 0.05 are used. An alpha value of 0.05 corresponds to a 95% confidence interval. Selection of an alpha value larger than 0.05 (e.g., 0.10: 90% confidence interval) can increase the possibility of concluding that there is a difference when there is actually no difference (Type I error). Selection of a smaller alpha value (e.g., 0.01: 99% confidence interval) can decrease the possibility of that risk and is sometimes considered when many SIRs are computed. The rationale for doing this is that one would expect to see some statistically significant apparent associations just by chance. As the number of SIRs examined increases, the number of SIRs that will be statistically significant by chance alone also increases (if alpha is 0.05, then 5% of the results are expected to be statistically significant by chance alone). However, one may consider this fact when interpreting results, rather than using a lower alpha value 14 . Decreasing the alpha value used will also decrease power for detection of differences between the population of interest and the reference population.

In many investigations of suspected unusual patterns of cancer, the number of people in the study population is determined by factors that may prevent the selection of a sample size sufficient to detect statistically significant differences. In these situations, a power analysis can be used to estimate the power of the study for detecting a difference in rates of a given magnitude. This information can be used to decide if or what type of statistical analysis is appropriate. Therefore, the results of a power calculation can be informative regarding how best to move forward.

Additional Contributing Authors:

Andrea Winquist, Angela Werner

  • Waller LA, Gotway CA. Applied spatial statistics for public health data. New York: John Wiley and Sons; 2004.
  • National Cancer Institute. Cancer Incidence Statistics [Internet]. 2021 [cited 2022 Jan 7]. Available from: https://surveillance.cancer.gov/statistics/types/incidence.html
  • Gordis L. Epidemiology. Philadelphia, PA: Elsevier Saunders; 2014.
  • Merrill R. Environmental Epidemiology, Principles and Methods. Sudbury, MA: Jones and Bartlett Publishers, Inc.; 2008.
  • Kelsey JL, Whittemore AS, Evans AS, Thompson WD. Methods in observational epidemiology. 2nd ed. New York, NY: Oxford University Press; 1996.
  • Sahai H, Khurshid A. Statistics in epidemiology: methods, techniques, and applications. Boca Raton: CRC; 1996.
  • Selvin S. Statistical analysis of epidemiologic data. New York, NY: Oxford University Press; 1996.
  • Breslow NE, Day NE. Statistical methods in cancer research. Volume I – The analysis of case-control studies. IARC Sci Publ. 1980;(32):5–338.
  • Rothman KJ, Greenland S, Lash TL. Modern epidemiology. 3rd ed. Philadelphia, PA: Lippincott Williams & Wilkins; 2008.
  • United Kingdom and Ireland Association of Cancer Registries. Standard Operating Procedure: Investigating and Analysing Small-Area Cancer Clusters [Internet]. 2015. Available from: Cancer Cluster SOP_0.pdf (ukiacr.org)
  • Greenland S, Senn S, Rothman KJ, Carlin J, Poole C, Goodman S, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337–50.
  • Amrhein V, Greenland S, McShane B. Scientists rise up against statistical significance. Nature. 2019;567:305–7.
  • Pagano M, Gauvreau K. Principles of Biostatistics. 2nd ed. Pacific Grove, CA: Duxbury Thomson Learning; 2000.
  • Rothman KJ. No Adjustments Are Needed for Multiple Comparisons. Vol. 1. 1990.

Unusual Cancer Patterns

Guidelines and resources for examining unusual patterns of cancer and environmental concerns.

For Everyone

Public health.

IMAGES

  1. 15 Null Hypothesis Examples (2024)

    null hypothesis method statistics

  2. Null Hypothesis

    null hypothesis method statistics

  3. Null Hypothesis Examples

    null hypothesis method statistics

  4. Write A Statistical Hypothesis

    null hypothesis method statistics

  5. Null Hypothesis Significance Testing Overview

    null hypothesis method statistics

  6. PPT

    null hypothesis method statistics

VIDEO

  1. Null & Alternative Hypothesis |Statistical Hypothesis #hypothesis #samplingdistribution #statistics

  2. Probability and Statistics

  3. null hypothesis vs alternative hypothesis #statistics #probability #shs #melcs #modular

  4. Testing of hypothesis |Part-2|statistics and numerical methods-MA3251

  5. Hypothesis Testing (Null and Alternative Hypothesis)

  6. Hypothesis Testing

COMMENTS

  1. Null & Alternative Hypotheses

    The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test: Null hypothesis (H 0): There's no effect in the population. ... Shaun learned how to apply scientific and statistical methods to his research in ecology. Now he loves to teach students how to collect and ...

  2. Null Hypothesis: Definition, Rejecting & Examples

    The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. It is one of two mutually exclusive hypotheses about a population in a hypothesis test. When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant.

  3. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0, the —null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

  4. How to Write a Null Hypothesis (5 Examples)

    H 0 (Null Hypothesis): Population parameter =, ≤, ≥ some value. H A (Alternative Hypothesis): Population parameter <, >, ≠ some value. Note that the null hypothesis always contains the equal sign. We interpret the hypotheses as follows: Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

  5. 9.1: Null and Alternative Hypotheses

    Review. In a hypothesis test, sample data is evaluated in order to arrive at a decision about some type of claim.If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we: Evaluate the null hypothesis, typically denoted with \(H_{0}\).The null is not rejected unless the hypothesis test shows otherwise.

  6. Null hypothesis

    Basic definitions. The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. . The test of significance is designed ...

  7. Null Hypothesis Definition and Examples, How to State

    Step 1: Figure out the hypothesis from the problem. The hypothesis is usually hidden in a word problem, and is sometimes a statement of what you expect to happen in the experiment. The hypothesis in the above question is "I expect the average recovery period to be greater than 8.2 weeks.". Step 2: Convert the hypothesis to math.

  8. 16.3: The Process of Null Hypothesis Testing

    16.3.5 Step 5: Determine the probability of the data under the null hypothesis. This is the step where NHST starts to violate our intuition - rather than determining the likelihood that the null hypothesis is true given the data, we instead determine the likelihood of the data under the null hypothesis - because we started out by assuming that the null hypothesis is true!

  9. S.3 Hypothesis Testing

    hypothesis testing. S.3 Hypothesis Testing. In reviewing hypothesis tests, we start first with the general idea. Then, we keep returning to the basic procedures of hypothesis testing, each time adding a little more detail. The general idea of hypothesis testing involves: Making an initial assumption. Collecting evidence (data).

  10. Understanding Null Hypothesis Testing

    A crucial step in null hypothesis testing is finding the likelihood of the sample result if the null hypothesis were true. This probability is called the p value. A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. A high p value means that the sample ...

  11. Null Hypothesis Definition and Examples

    Null Hypothesis Examples. "Hyperactivity is unrelated to eating sugar " is an example of a null hypothesis. If the hypothesis is tested and found to be false, using statistics, then a connection between hyperactivity and sugar ingestion may be indicated. A significance test is the most common statistical test used to establish confidence in a ...

  12. 4.4: Hypothesis Testing

    We can set up the null hypothesis for this test as a skeptical perspective: the students at this school average 7 hours of sleep per night. The alternative hypothesis takes a new form reflecting the interests of the research: the students average more than 7 hours of sleep. We can write these hypotheses as. H 0: μ = 7.

  13. 5.2

    5.2 - Writing Hypotheses. The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis ( H 0) and an alternative hypothesis ( H a ). When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the ...

  14. Choosing the Right Statistical Test

    Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test. Significance is usually denoted by a p-value, or probability value. Statistical significance is arbitrary - it depends on the threshold, or alpha value, chosen by the ...

  15. How to Formulate a Null Hypothesis (With Examples)

    Null Hypothesis Examples. The null hypothesis —which assumes that there is no meaningful relationship between two variables—may be the most valuable hypothesis for the scientific method because it is the easiest to test using a statistical analysis. This means you can support your hypothesis with a high level of confidence.

  16. Null hypothesis significance testing: a short tutorial

    Abstract: "null hypothesis significance testing is the statistical method of choice in biological, biomedical and social sciences to investigate if an effect is likely". No, NHST is the method to test the hypothesis of no effect. I agree - yet people use it to investigate (not test) if an effect is likely.

  17. What Is The Null Hypothesis & When To Reject It

    The insignificance of null hypothesis significance testing. Political research quarterly, 52(3), 647-674. Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16. Masson, M. E. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing.

  18. 13.1 Understanding Null Hypothesis Testing

    The Logic of Null Hypothesis Testing. Null hypothesis testing is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the null hypothesis (often symbolized H 0 and read as "H-naught"). This is the idea that there is no relationship in the population and that the ...

  19. Support or Reject Null Hypothesis in Easy Steps

    Use the P-Value method to support or reject null hypothesis. Step 1: State the null hypothesis and the alternate hypothesis ("the claim"). H o :p ≤ 0.23; H 1 :p > 0.23 (claim) Step 2: Compute by dividing the number of positive respondents from the number in the random sample: 63 / 210 = 0.3.

  20. Why we habitually engage in null-hypothesis significance testing: A

    Assessing statistical significance by means of contrasting the data with the null hypothesis is called Null Hypothesis Significance Testing (NHST). NHST is the best known and most widely used statistical procedure for making inferences about population effects. The procedure has become the prevailing paradigm in empirical science [ 3 ], and ...

  21. Null Hypothesis

    In statistics, the null hypothesis is usually denoted by letter H with subscript '0' (zero), such that H 0. It is pronounced as H-null or H-zero or H-nought. ... In this method, the sampling distribution is the function of the sample size. Composite Hypothesis .

  22. When the Research Hypothesis Is the Null

    When the null is the research hypothesis we want to test. There's nothing new or special here. If you have even a basic stats background (particularly with Frequentist statistics), the conventional approach to hypothesis testing is pretty ubiquitous. Things get more tricky when our research hypothesis is that there is no effect.

  23. Implications of Not Rejecting Null Hypothesis in BI

    A non-rejection of the null hypothesis in BI can signal the need for a strategy shift. It may prompt you to revisit your research questions or to refine your data strategy, including considering ...

  24. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0: The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

  25. Genetic effects and causal association analyses of 14 common conditions

    Methods Summary statistics of 14 conditions/diseases were available from the genome-wide association study (GWAS). Linkage disequilibrium score regression analysis (LDSC) was applied to estimate the genetic correlations. ... Chatterjee N. A powerful method for pleiotropic analysis under composite null hypothesis identifies novel shared loci ...

  26. Appendix A: Statistical Considerations

    At a glance. This appendix provides guidance on epidemiologic and descriptive statistical methods to assess cancer occurrences. The standardized incidence ratio (SIR) is often used to assess if there is an excess number of cancer cases. Alpha, beta, and statistical power relate to the types of errors that can occur during hypothesis testing.

  27. Testing the hypothesis that treatments have negligible effects: Minimum

    Researchers are often interested in testing the hypothesis that the effects of treatments, interventions, and so on are negligibly small rather than testing the hypothesis that treatments have no effect whatsoever. A number of procedures for conducting such tests have been suggested but have yet to be widely adopted. In this article, simple methods of testing such minimum-effect hypotheses are ...

  28. Analyzing and forecasting with time series data using ARIMA models in

    If the null hypothesis cannot be rejected, then this test may provide evidence that the series is non-stationary. This means that when you run the ADF test, if the test statistic is less than the critical value and p-value is less than 0.05, then we reject the null hypothesis, which means the time series is stationary.

  29. Answered: Systematic sampling method is often…

    Systematic sampling method is often used to sample products as they come out of an assemblyline,in order to check that they meet quality standards. The population of interest consists of all 240 items in the assemblyline. You plan to obtain a sample of 12 items to check for quality standards. Apply a systematic sampling method to construct your ...