Statology

Statistics Made Easy

Hypothesis Test vs. Confidence Interval: What’s the Difference?

Two of the most commonly used procedures in statistics are hypothesis tests and confidence intervals .

Here’s the difference between the two:

  • A hypothesis test is a formal statistical test that is used to determine if some hypothesis about a population parameter is true.

A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence.

This tutorial shares a brief overview of each method along with their similarities and differences.

The Basics of Hypothesis Tests

A hypothesis test is used to test whether or not some hypothesis about a population parameter is true.

To perform a hypothesis test in the real world, researchers will obtain a random sample from the population and perform a hypothesis test on the sample data, using a null and alternative hypothesis:

  • Null Hypothesis (H 0 ): The sample data occurs purely from chance.
  • Alternative Hypothesis (H A ): The sample data is influenced by some non-random cause.

If the p-value of the hypothesis test is less than some significance level (e.g. α = .05), then we can reject the null hypothesis and conclude that we have sufficient evidence to say that the alternative hypothesis is true.

Hypothesis Test Example

Suppose a manufacturing facility wants to test whether or not some new method changes the number of defective widgets produced per month, which is currently 250.

To test this, they may measure the mean number of defective widgets produced before and after using the new method for one month.

They can perform a hypothesis test using the following hypotheses:

  • H 0 : μ after = μ before (the mean number of defective widgets is the same before and after using the new method)
  • H A : μ after ≠ μ before (the mean number of defective widgets produced is different before and after using the new method)

Suppose they perform a one sample t-test and end up with a p-value of .0032.

Since this p-value is less than α = .05, the facility can reject the null hypothesis and conclude that the new method leads to a change in the number of defective widgets produced per month.

The Basics of Confidence Intervals

To calculate a confidence interval in the real world, researchers will obtain a random sample from the population and use the following formula to calculate a confidence interval for the population mean:

Confidence Interval =  x   +/-  z*(s/√ n )

  • x : sample mean
  • z:  the chosen z-value
  • s:  sample standard deviation
  • n:  sample size

The z-value that you will use is dependent on the confidence level that you choose. The following table shows the z-value that corresponds to popular confidence level choices:

Confidence Interval Example

Suppose a biologist wants to estimate the mean weight of turtles in a certain population so she collects a random sample of turtles with the following information:

  • Sample size  n = 25
  • Sample mean weight x  = 300
  • Sample standard deviation  s = 18.5

Here is how to find calculate the 90% confidence interval for the true population mean weight:

90% Confidence Interval:  300 +/-  1.645*(18.5/√25) =  [293.91, 306.09]

The biologist can be 90% confident that the true mean weight of a turtle in this population is between 293.1 pounds and 306.09 pounds.

Hypothesis Test vs. Confidence Interval: When to Use Each

The decision to use a hypothesis test or a confidence interval depends on the question you’re attempting to answer.

You should use a confidence interval when you want to estimate the value of a population parameter.

You should use a hypothesis test when you want to determine if some hypothesis about a population parameter is likely true or not.

To test your knowledge of when to use each procedure, consider the following scenarios.

Scenario 1: Hours Spent Studying

Suppose an academic researcher wants to measure the mean number of hours that college students spend studying per week.

Which procedure should she use to answer this question?

She should use a confidence interval because she’s interested in estimating the value of a population parameter.

Scenario 2: New Medication

Suppose a doctor wants to test whether or not a new medication is able to reduce blood pressure more than the current standard medication.

Which procedure should he use to answer this question?

He should use a hypothesis test because he’s interested in understanding whether or not a specific assumption about a population parameter is true.

Additional Resources

The following tutorials provide additional information about hypothesis tests :

Introduction to Hypothesis Testing Introduction to the One Sample t-test Introduction to the Two Sample t-test Introduction to the Paired Samples t-test

The following tutorials provide additional information about confidence intervals :

Introduction to Confidence Intervals Confidence Interval for a Mean Confidence Interval for a Proportion

Featured Posts

5 Regularization Techniques You Should Know

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.6 - confidence intervals & hypothesis testing.

Confidence intervals and hypothesis tests are similar in that they are both inferential methods that rely on an approximated sampling distribution. Confidence intervals use data from a sample to estimate a population parameter. Hypothesis tests use data from a sample to test a specified hypothesis. Hypothesis testing requires that we have a hypothesized parameter. 

The simulation methods used to construct bootstrap distributions and randomization distributions are similar. One primary difference is a bootstrap distribution is centered on the observed sample statistic while a randomization distribution is centered on the value in the null hypothesis. 

In Lesson 4, we learned confidence intervals contain a range of reasonable estimates of the population parameter. All of the confidence intervals we constructed in this course were two-tailed. These two-tailed confidence intervals go hand-in-hand with the two-tailed hypothesis tests we learned in Lesson 5. The conclusion drawn from a two-tailed confidence interval is usually the same as the conclusion drawn from a two-tailed hypothesis test. In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always reject the null hypothesis.

Example: Mean Section  

This example uses the Body Temperature dataset built in to StatKey for constructing a  bootstrap confidence interval and conducting a randomization test . 

Let's start by constructing a 95% confidence interval using the percentile method in StatKey:

  

The 95% confidence interval for the mean body temperature in the population is [98.044, 98.474].

Now, what if we want to know if there is enough evidence that the mean body temperature is different from 98.6 degrees? We can conduct a hypothesis test. Because 98.6 is not contained within the 95% confidence interval, it is not a reasonable estimate of the population mean. We should expect to have a p value less than 0.05 and to reject the null hypothesis.

\(H_0: \mu=98.6\)

\(H_a: \mu \ne 98.6\)

\(p = 2*0.00080=0.00160\)

\(p \leq 0.05\), reject the null hypothesis

There is evidence that the population mean is different from 98.6 degrees. 

Selecting the Appropriate Procedure Section  

The decision of whether to use a confidence interval or a hypothesis test depends on the research question. If we want to estimate a population parameter, we use a confidence interval. If we are given a specific population parameter (i.e., hypothesized value), and want to determine the likelihood that a population with that parameter would produce a sample as different as our sample, we use a hypothesis test. Below are a few examples of selecting the appropriate procedure. 

Example: Cheese Consumption Section  

Research question: How much cheese (in pounds) does an average American adult consume annually? 

What is the appropriate inferential procedure? 

Cheese consumption, in pounds, is a quantitative variable. We have one group: American adults. We are not given a specific value to test, so the appropriate procedure here is a  confidence interval for a single mean .

Example: Age Section  

Research question:  Is the average age in the population of all STAT 200 students greater than 30 years?

There is one group: STAT 200 students. The variable of interest is age in years, which is quantitative. The research question includes a specific population parameter to test: 30 years. The appropriate procedure is a  hypothesis test for a single mean .

Try it! Section  

For each research question, identify the variables, the parameter of interest and decide on the the appropriate inferential procedure.

Research question:  How strong is the correlation between height (in inches) and weight (in pounds) in American teenagers?

There are two variables of interest: (1) height in inches and (2) weight in pounds. Both are quantitative variables. The parameter of interest is the correlation between these two variables.

We are not given a specific correlation to test. We are being asked to estimate the strength of the correlation. The appropriate procedure here is a  confidence interval for a correlation . 

Research question:  Are the majority of registered voters planning to vote in the next presidential election?

The parameter that is being tested here is a single proportion. We have one group: registered voters. "The majority" would be more than 50%, or p>0.50. This is a specific parameter that we are testing. The appropriate procedure here is a  hypothesis test for a single proportion .

Research question:  On average, are STAT 200 students younger than STAT 500 students?

We have two independent groups: STAT 200 students and STAT 500 students. We are comparing them in terms of average (i.e., mean) age.

If STAT 200 students are younger than STAT 500 students, that translates to \(\mu_{200}<\mu_{500}\) which is an alternative hypothesis. This could also be written as \(\mu_{200}-\mu_{500}<0\), where 0 is a specific population parameter that we are testing. 

The appropriate procedure here is a  hypothesis test for the difference in two means .

Research question:  On average, how much taller are adult male giraffes compared to adult female giraffes?

There are two groups: males and females. The response variable is height, which is quantitative. We are not given a specific parameter to test, instead we are asked to estimate "how much" taller males are than females. The appropriate procedure is a  confidence interval for the difference in two means .

Research question:  Are STAT 500 students more likely than STAT 200 students to be employed full-time?

There are two independent groups: STAT 500 students and STAT 200 students. The response variable is full-time employment status which is categorical with two levels: yes/no.

If STAT 500 students are more likely than STAT 200 students to be employed full-time, that translates to \(p_{500}>p_{200}\) which is an alternative hypothesis. This could also be written as \(p_{500}-p_{200}>0\), where 0 is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for the difference in two proportions.

Research question:  Is there is a relationship between outdoor temperature (in Fahrenheit) and coffee sales (in cups per day)?

There are two variables here: (1) temperature in Fahrenheit and (2) cups of coffee sold in a day. Both variables are quantitative. The parameter of interest is the correlation between these two variables.

If there is a relationship between the variables, that means that the correlation is different from zero. This is a specific parameter that we are testing. The appropriate procedure is a  hypothesis test for a correlation . 

Logo for MacEwan Open Books

8.6 Relationship Between Confidence Intervals and Hypothesis Tests

Confidence intervals (CI) and hypothesis tests should give consistent results: we should not reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval contains the hypothesized value [latex]\mu_0[/latex]. Two-sided confidence intervals correspond to two-tailed tests, upper-tailed confidence intervals correspond to right-tailed tests, and lower-tailed confidence intervals correspond to left-tailed tests.

A [latex](1 - \alpha) \times 100\%[/latex] two-sided [latex]t[/latex] confidence interval is given in the form [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \bar{x} + t_{\alpha / 2} \frac{s}{\sqrt{n}})[/latex]. A [latex](1 - \alpha) \times 100\%[/latex] upper-tailed t confidence interval is given by [latex](\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}, \infty)[/latex] and the number [latex]\bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the lower bound of the interval. A [latex](1 - \alpha) \times 100\%[/latex] lower-tailed t confidence interval is given by [latex](- \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}})[/latex] and the number [latex]\bar{x} + t_{\alpha} \frac{s}{\sqrt{n}}[/latex] is called the upper bound of the interval. We can also use confidence intervals to make conclusions about hypothesis tests: reject the null hypothesis [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval does not contain the hypothesized value [latex]\mu_0[/latex]. The relationship is summarized in the following table.

Table 8.3 : Relationship Between Confidence Interval and Hypothesis Test

hypothesis testing vs confidence interval

Here is the reason we should reject [latex]H_0[/latex] if [latex]\mu_0[/latex] is outside the corresponding confidence interval.

Take the right-tailed test for example, we should reject [latex]H_0[/latex] if the observed test statistic [latex]t_o[/latex] falls in the rejection region, that is if [latex]t_o \geq t_{\alpha}[/latex]. This implies [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \geq t_{\alpha} \Longrightarrow \mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}.[/latex] Given that the upper-tailed confidence interval for a right-tailed test is [latex](\bar{x} - t_{\alpha / 2} \frac{s}{\sqrt{n}}, \infty)[/latex], [latex]\mu_0 \leq \bar{x} - t_{\alpha} \frac{s}{\sqrt{n}}[/latex] means the value of [latex]\mu_0[/latex] is outside the confidence interval. The same rationale applies to two-tailed and left-tailed tests. Therefore, we can reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if [latex]\mu_0[/latex] is outside the corresponding (1– [latex]\alpha[/latex] )×100% confidence interval.

Example: Relationship Between Confidence Intervals and Hypothesis Tests

The ankle-brachial index (ABI) compares the blood pressure of a patient’s arm to the blood pressure of the patient’s leg. The ABI can be an indicator of different diseases, including arterial diseases. A healthy (or normal) ABI is 0.9 or greater. Researchers obtained the ABI of 100 women with peripheral arterial disease and obtained a mean ABI of 0.64 with a standard deviation of 0.15.

  • Set up the hypotheses: [latex]H_0: \mu \geq 0.9[/latex] versus [latex]H_a: \mu < 0.9[/latex].
  • The significance level is [latex]\alpha = 0.05[/latex].
  • Compute the value of the test statistic: [latex]t_o = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{0.64 - 0.9}{0.15 / \sqrt{100}} = \frac{-0.26}{0.015} = -17.333[/latex] with [latex]df = n-1 = 100 -1 = 99[/latex] (not given in Table IV, use 95, the closest one smaller than 99).
  • Find the P-value. For a left-tailed test, the P-value is the area to the left of the observed test statistic [latex]t_o[/latex]. [latex]\mbox{P-value} = P(t \leq t_o) = P(t \leq -17.333) = P(t \geq 17.333) 2.629(t_{0.005})[/latex].
  • Decision: Since the P- value [latex]< 0.005 < 0.05(\alpha)[/latex], we should reject the null hypothesis [latex]H_0[/latex].
  • Conclusion: At the 5% significance level, the data provide sufficient evidence that, on average, women with peripheral arterial disease have an unhealthy ABI.

[latex]\left( - \infty, \bar{x} + t_{\alpha} \frac{s}{\sqrt{n}} \right)= \left( - \infty, 0.64 + 1.661 \times \frac{0.15}{\sqrt{100}} \right) = (- \infty , 0.665)[/latex].

  • Does the interval in part b) support the conclusion in part a)? In part a), we reject [latex]H_0[/latex] and claim that the mean ABI is below 0.9 for women with peripheral arterial disease. In part b), we are 95% confident that the mean ABI is less than 0.9 since the entire confidence interval is below 0.9. In other words, the hypothesized value 0.9 is outside the corresponding confidence interval, we should reject the null. Therefore, the results obtained in parts a) and b) are consistent.

Introduction to Applied Statistics Copyright © 2024 by Wanhua Su is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

Cover of StatPearls

StatPearls [Internet].

Hypothesis testing, p values, confidence intervals, and significance.

Jacob Shreffler ; Martin R. Huecker .

Affiliations

Last Update: March 13, 2023 .

  • Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

  • Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

  • Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

  • Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

  • Review Questions
  • Access free multiple choice questions on this topic.
  • Comment on this article.

Disclosure: Jacob Shreffler declares no relevant financial relationships with ineligible companies.

Disclosure: Martin Huecker declares no relevant financial relationships with ineligible companies.

This book is distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) ( http://creativecommons.org/licenses/by-nc-nd/4.0/ ), which permits others to distribute the work, provided that the article is not altered or used commercially. You are not required to obtain permission to distribute this article, provided that you credit the author and journal.

  • Cite this Page Shreffler J, Huecker MR. Hypothesis Testing, P Values, Confidence Intervals, and Significance. [Updated 2023 Mar 13]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2024 Jan-.

In this Page

Bulk download.

  • Bulk download StatPearls data from FTP

Similar articles in PubMed

  • The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). [PeerJ. 2021] The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997-2017). Messam LLM, Weng HY, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. PeerJ. 2021; 9:e12453. Epub 2021 Nov 24.
  • Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. [J Pharm Pract. 2010] Review Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Ferrill MJ, Brown DA, Kyle JA. J Pharm Pract. 2010 Aug; 23(4):344-51. Epub 2010 Apr 13.
  • Interpreting "statistical hypothesis testing" results in clinical research. [J Ayurveda Integr Med. 2012] Interpreting "statistical hypothesis testing" results in clinical research. Sarmukaddam SB. J Ayurveda Integr Med. 2012 Apr; 3(2):65-9.
  • Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. [Dermatol Surg. 2005] Confidence intervals in procedural dermatology: an intuitive approach to interpreting data. Alam M, Barzilai DA, Wrone DA. Dermatol Surg. 2005 Apr; 31(4):462-6.
  • Review Is statistical significance testing useful in interpreting data? [Reprod Toxicol. 1993] Review Is statistical significance testing useful in interpreting data? Savitz DA. Reprod Toxicol. 1993; 7(2):95-100.

Recent Activity

  • Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearl... Hypothesis Testing, P Values, Confidence Intervals, and Significance - StatPearls

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Icon Partners

  • Quality Improvement
  • Talk To Minitab

Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

Topics: Hypothesis Testing , Data Analysis , Statistics

In this series of posts, I show how hypothesis tests and confidence intervals work by focusing on concepts and graphs rather than equations and numbers.  

Previously, I used graphs to show what statistical significance really means . In this post, I’ll explain both confidence intervals and confidence levels, and how they’re closely related to P values and significance levels.

How to Correctly Interpret Confidence Intervals and Confidence Levels

A confidence interval is a range of values that is likely to contain an unknown population parameter. If you draw a random sample many times, a certain percentage of the confidence intervals will contain the population mean. This percentage is the confidence level.

Most frequently, you’ll use confidence intervals to bound the mean or standard deviation, but you can also obtain them for regression coefficients, proportions, rates of occurrence (Poisson), and for the differences between populations.

Just as there is a common misconception of how to interpret P values , there’s a common misconception of how to interpret confidence intervals. In this case, the confidence level is not the probability that a specific confidence interval contains the population parameter.

The confidence level represents the theoretical ability of the analysis to produce accurate intervals if you are able to assess many intervals and you know the value of the population parameter. For a specific confidence interval from one study, the interval either contains the population value or it does not—there’s no room for probabilities other than 0 or 1. And you can't choose between these two possibilities because you don’t know the value of the population parameter.

"The parameter is an unknown constant and no probability statement concerning its value may be made."  —Jerzy Neyman, original developer of confidence intervals.

This will be easier to understand after we discuss the graph below . . .

With this in mind, how do you interpret confidence intervals?

Confidence intervals serve as good estimates of the population parameter because the procedure tends to produce intervals that contain the parameter. Confidence intervals are comprised of the point estimate (the most likely value) and a margin of error around that point estimate. The margin of error indicates the amount of uncertainty that surrounds the sample estimate of the population parameter.

In this vein, you can use confidence intervals to assess the precision of the sample estimate. For a specific variable, a narrower confidence interval [90 110] suggests a more precise estimate of the population parameter than a wider confidence interval [50 150].

Confidence Intervals and the Margin of Error

Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a sampling distribution using probability distribution plots , the t-distribution , and the variability in our data. We'll base our confidence interval on the energy cost data set that we've been using.

When we looked at significance levels , the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.

Probability distribution plot that illustrates how a confidence interval works

The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.

Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.

You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!

Why P Values and Confidence Intervals Always Agree About Statistical Significance

You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

  • If the P value is less than your significance (alpha) level, the hypothesis test is statistically significant.
  • If the confidence interval does not contain the null hypothesis value, the results are statistically significant.
  • If the P value is less than alpha, the confidence interval will not contain the null hypothesis value.

For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.

To understand why the results always agree, let’s recall how both the significance level and confidence level work.

  • The significance level defines the distance the sample mean must be from the null hypothesis to be considered statistically significant.
  • The confidence level defines the distance for how close the confidence limits are to sample mean.

Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!

The distance equals the critical t-value * standard error of the mean . For our energy cost example data, the distance works out to be $63.57.

Imagine this discussion between the null hypothesis mean and the sample mean:

Null hypothesis mean, hypothesis test representative : Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!

Sample mean, confidence interval representative : Actually, I’m significant because you’re more than $63.57 away from me !

Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this post .

Closing Thoughts

In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.

It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.

If you like this post, you might want to read the previous posts in this series that use the same graphical framework:

  • Part One: Why We Need to Use Hypothesis Tests
  • Part Two: Significance Levels (alpha) and P values

For more about confidence intervals, read my post where I compare them to tolerance intervals and prediction intervals .

If you'd like to see how I made the probability distribution plot, please read: How to Create a Graphical Version of the 1-sample t-Test .

minitab-on-twitter

You Might Also Like

  • Trust Center

© 2023 Minitab, LLC. All Rights Reserved.

  • Terms of Use
  • Privacy Policy
  • Cookies Settings

Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Testing

Hypothesis Testing

After completing this reading, you should be able to:

  • Construct an appropriate null hypothesis and alternative hypothesis and distinguish between the two.
  • Construct and apply confidence intervals for one-sided and two-sided hypothesis tests, and interpret the results of hypothesis tests with a specific level of confidence.
  • Differentiate between a one-sided and a two-sided test and identify when to use each test.
  • Explain the difference between Type I and Type II errors and how these relate to the size and power of a test.
  • Understand how a hypothesis test and a confidence interval are related.
  • Explain what the p-value of a hypothesis test measures.
  • Interpret the results of hypothesis tests with a specific level of confidence.
  • Identify the steps to test a hypothesis about the difference between two population means.
  • Explain the problem of multiple testing and how it can bias results.

Hypothesis testing is defined as a process of determining whether a hypothesis is in line with the sample data. Hypothesis testing tries to test whether the observed data of the hypothesis is true. Hypothesis testing starts by stating the null hypothesis and the alternative hypothesis. The null hypothesis is an assumption of the population parameter. On the other hand,  the alternative hypothesis states the parameter values (critical values) at which the null hypothesis is rejected. The critical values are determined by the distribution of the test statistic (when the null hypothesis is true) and the size of the test (which gives the size at which we reject the null hypothesis).

Components of the Hypothesis Testing

The elements of the test hypothesis include:

  • The null hypothesis.
  • The alternative hypothesis.
  • The test statistic.
  • The size of the hypothesis test and errors
  • The critical value.
  • The decision rule.

The Null hypothesis

As stated earlier, the first stage of the hypothesis test is the statement of the null hypothesis. The null hypothesis is the statement concerning the population parameter values. It brings out the notion that “there is nothing about the data.”

The  null hypothesis , denoted as H 0 , represents the current state of knowledge about the population parameter that’s the subject of the test. In other words, it represents the “status quo.” For example, the U.S Food and Drug Administration may walk into a cooking oil manufacturing plant intending to confirm that each 1 kg oil package has, say, 0.15% cholesterol and not more. The inspectors will formulate a hypothesis like:

H 0 : Each 1 kg package has 0.15% cholesterol.

A test would then be carried out to confirm or reject the null hypothesis.

Other typical statements of H 0  include:

$$H_0:\mu={\mu}_0$$

$$H_0:\mu≤{\mu}_0$$

\(μ\) = true population mean and,

\(μ_0\)= the hypothesized population mean.

The Alternative Hypothesis

The  alternative hypothesis , denoted H 1 , is a contradiction of the null hypothesis. The null hypothesis determines the values of the population parameter at which the null hypothesis is rejected. Thus, rejecting the H 0  makes H 1  valid. We accept the alternative hypothesis when the “status quo” is discredited and found to be untrue.

Using our FDA example above, the alternative hypothesis would be:

H 1 : Each 1 kg package does not have 0.15% cholesterol.

The typical statements of H1   include:

$$H_1:\mu \neq {\mu}_0$$

$$H_1:\mu > {\mu}_0$$

Note that we have stated the alternative hypothesis, which contradicted the above statement of the null hypothesis.

The Test Statistic

A test statistic is a standardized value computed from sample information when testing hypotheses. It compares the given data with what we would expect under the null hypothesis. Thus, it is a major determinant when deciding whether to reject H 0 , the null hypothesis.

We use the test statistic to gauge the degree of agreement between sample data and the null hypothesis. Analysts use the following formula when calculating the test statistic.

$$ \text{Test Statistic}= \frac{(\text{Sample Statistic–Hypothesized Value})}{(\text{Standard Error of the Sample Statistic})}$$

The test statistic is a random variable that changes from one sample to another. Test statistics assume a variety of distributions. We shall focus on normally distributed test statistics because it is used hypotheses concerning the means, regression coefficients, and other econometric models.

We shall consider the hypothesis test on the mean. Consider a null hypothesis \(H_0:μ=μ_0\). Assume that the data used is iid, and asymptotic normally distributed as:

$$\sqrt{n} (\hat{\mu}-\mu) \sim N(0, {\sigma}^2)$$

Where \({\sigma}^2\) is the variance of the sequence of the iid random variable used. The asymptotic distribution leads to the test statistic:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}}\sim N(0,1)$$

Note this is consistent with our initial definition of the test statistic.

The following table  gives a brief outline of the various test statistics used regularly, based on the distribution that the data is assumed to follow:

$$\begin{array}{ll} \textbf{Hypothesis Test} & \textbf{Test Statistic}\\ \text{Z-test} & \text{z-statistic} \\ \text{Chi-Square Test} & \text{Chi-Square statistic}\\ \text{t-test} & \text{t-statistic} \\ \text{ANOVA} & \text{F-statistic}\\ \end{array}$$ We can subdivide the set of values that can be taken by the test statistic into two regions: One is called the non-rejection region, which is consistent with H 0  and the rejection region (critical region), which is inconsistent with H 0 . If the test statistic has a value found within the critical region, we reject H 0 .

Just like with any other statistic, the distribution of the test statistic must be specified entirely under H 0  when H 0  is true.

The Size of the Hypothesis Test and the Type I and Type II Errors

While using sample statistics to draw conclusions about the parameters of the population as a whole, there is always the possibility that the sample collected does not accurately represent the population. Consequently, statistical tests carried out using such sample data may yield incorrect results that may lead to erroneous rejection (or lack thereof) of the null hypothesis. We have two types of errors:

Type I Error

Type I error occurs when we reject a true null hypothesis. For example, a type I error would manifest in the form of rejecting H 0  = 0 when it is actually zero.

Type II Error

Type II error occurs when we fail to reject a false null hypothesis. In such a scenario, the test provides insufficient evidence to reject the null hypothesis when it’s false.

The level of significance denoted by α represents the probability of making a type I error, i.e., rejecting the null hypothesis when, in fact, it’s true. α is the direct opposite of β, which is taken to be the probability of making a type II error within the bounds of statistical testing. The ideal but practically impossible statistical test would be one that  simultaneously   minimizes α and β. We use α to determine critical values that subdivide the distribution into the rejection and the non-rejection regions.

The Critical Value and the Decision Rule

The decision to reject or not to reject the null hypothesis is based on the distribution assumed by the test statistic. This means if the variable involved follows a normal distribution, we use the level of significance (α) of the test to come up with critical values that lie along with the standard normal distribution.

The decision rule is a result of combining the critical value (denoted by \(C_α\)), the alternative hypothesis, and the test statistic (T). The decision rule is to whether to reject the null hypothesis in favor of the alternative hypothesis or fail to reject the null hypothesis.

For the t-test, the decision rule is dependent on the alternative hypothesis. When testing the two-side alternative, the decision is to reject the null hypothesis if \(|T|>C_α\). That is, reject the null hypothesis if the absolute value of the test statistic is greater than the critical value. When testing on the one-sided, decision rule, reject the null hypothesis if \(T<C_α\)  when using a one-sided lower alternative and if \(T>C_α\)  when using a one-sided upper alternative. When a null hypothesis is rejected at an α significance level, we say that the result is significant at α significance level.

Note that prior to decision-making, one must decide whether the test should be one-tailed or two-tailed. The following is a brief summary of the decision rules under different scenarios:

Left One-tailed Test

H 1 : parameter < X

Decision rule: Reject H 0  if the test statistic is less than the critical value. Otherwise,  do not reject  H 0.

Right One-tailed Test

H 1 : parameter > X

Decision rule: Reject H 0  if the test statistic is greater than the critical value. Otherwise,  do not reject  H 0.

Two-tailed Test

H 1 : parameter  ≠  X (not equal to X)

Decision rule: Reject H 0  if the test statistic is greater than the upper critical value or less than the lower critical value.

Two-tailed Test

 H 0 : μ < μ 0  vs. H 1 : μ > μ 0.

The second graph represents the rejection region when the alternative is a one-sided upper. The null hypothesis, in this case, is stated as:

H 0 : μ > μ 0  vs. H 1 : μ < μ 0.

Example: Hypothesis Test on the Mean

Consider the returns from a portfolio \(X=(x_1,x_2,\dots, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. We wish to determine whether the expected value of the return is different from 0 at a 5% significance level.

We start by stating the two-sided hypothesis test:

H 0 : μ =0 vs. H 1 : μ ≠ 0

The test statistic is:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{\hat{\sigma}^2}{n}}} \sim N(0,1)$$

In this case, we have,

\(\hat{μ}\)=0.075

\(\hat{\sigma}^2\)=0.17 2

$$T=\frac{0.075-0}{\sqrt{\frac{0.17^2}{40}}} \approx 2.79$$

At the significance level, \(α=5\%\),the critical value is \(±1.96\). Since this is a two-sided test, the rejection regions are ( \(-\infty,-1.96\) ) and (\(1.96, \infty \) ) as shown in the diagram below:

Rejection Regions - Two-Sided Test

The example above is an example of a Z-test (which is mostly emphasized in this chapter and immediately follows from the central limit theorem (CLT)). However, we can use the Student’s t-distribution if the random variables are iid and normally distributed and that the sample size is small (n<30).

In Student’s t-distribution, we used the unbiased estimator of variance. That is:

$$s^2=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}}$$

Therefore the test statistic for \(H_0=μ_0\) is given by:

$$T=\frac{\hat{\mu}-{\mu}_0}{\sqrt{\frac{s^2}{n}}} \sim t_{n-1}$$

The Type II Error and the Test Power

The power of a test is the direct opposite of the level of significance. While the level of relevance gives us the probability of rejecting the null hypothesis when it’s, in fact, true, the power of a test gives the probability of correctly discrediting and rejecting the null hypothesis when it is false. In other words, it gives the likelihood of rejecting H 0  when, indeed, it’s false. Denoting the probability of type II error by \(\beta\), the power test is given by:

$$ \text{Power of a Test}=1–\beta $$

The power test measures the likelihood that the false null hypothesis is rejected. It is influenced by the sample size, the length between the hypothesized parameter and the true value, and the size of the test.

Confidence Intervals

A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size. Therefore, a 1-α confidence interval contains values that cannot be disregarded at a test size of α.

It is important to note that the confidence interval depends on the alternative hypothesis statement in the test. Let us start with the two-sided test alternatives.

$$ H_0:μ=0$$

$$H_1:μ≠0$$

Then the \(1-α\) confidence interval is given by:

$$\left[\hat{\mu} -C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]$$

\(C_α\) is the critical value at \(α\) test size.

Example: Calculating Two-Sided Alternative Confidence Intervals

Consider the returns from a portfolio \(X=(x_1,x_2,…, x_n)\) from 1980 through 2020. The approximated mean of the returns is 7.50%, with a standard deviation of 17%. Calculate the 95% confidence interval for the portfolio return.

The \(1-\alpha\) confidence interval is given by:

$$\begin{align*}&\left[\hat{\mu}-C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} ,\hat{\mu} + C_{\alpha} \times \frac{\hat {\sigma}}{\sqrt{n}} \right]\\& =\left[0.0750-1.96 \times \frac{0.17}{\sqrt{40}}, 0.0750+1.96 \times \frac{0.17}{\sqrt{40}} \right]\\&=[0.02232,0.1277]\end{align*}$$

Thus, the confidence intervals imply any value of the null between 2.23% and 12.77% cannot be rejected against the alternative.

One-Sided Alternative

For the one-sided alternative, the confidence interval is given by either:

$$\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

for the lower alternative

$$\left ( \hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}},\infty \right )$$

for the upper alternative.

Example: Calculating the One-Sided Alternative Confidence Interval

Assume that we were conducting the following one-sided test:

\(H_0:μ≤0\)

\(H_1:μ>0\)

The 95% confidence interval for the portfolio return is:

$$\begin{align*}&=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )\\&=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)\\&=(-\infty, 0.1192)\end{align*}$$

On the other hand, if the hypothesis test was:

\(H_0:μ>0\)

\(H_1:μ≤0\)

The 95% confidence interval would be:

$$=\left(-\infty ,\hat{\mu} +C_{\alpha} \times \frac{\hat{\sigma}}{\sqrt{n}} \right )$$

$$=\left(-\infty ,0.0750+1.645\times \frac{0.17}{\sqrt{40}}\right)=(0.1192, \infty)$$

Note that the critical value decreased from 1.96 to 1.645 due to a change in the direction of the change.

The p-Value

When carrying out a statistical test with a fixed value of the significance level (α), we merely compare the observed test statistic with some critical value. For example, we might “reject H 0  using a 5% test” or “reject H 0 at 1% significance level”. The problem with this ‘classical’ approach is that it does not give us details about the  strength of the evidence  against the null hypothesis.

Determination of the  p-value  gives statisticians a more informative approach to hypothesis testing. The p-value is the lowest level at which we can reject H 0 . This means that the strength of the evidence against H 0  increases as the  p-value becomes smaller. The test statistic depends on the alternative.

The p-Value for One-Tailed Test Alternative

For one-tailed tests, the  p-value  is given by the probability that lies below the calculated test statistic for left-tailed tests. Similarly, the likelihood that lies above the test statistic in right-tailed tests gives the  p-value.

Denoting the test statistic by T, the p-value for \(H_1:μ>0\)  is given by:

$$P(Z>|T|)=1-P(Z≤|T|)=1- \Phi (|T|) $$

Conversely , for  \(H_1:μ≤0 \)  the p-value is given by:

$$ P(Z≤|T|)= \Phi (|T|)$$ 

Where z is a standard normal random variable, the absolute value of T (|T|) ensures that the right tail is measured whether T is negative or positive.

The p-Value for Two-Tailed Test Alternative

  If the test is two-tailed, this value is given by the sum of the probabilities in the two tails. We start by determining the probability lying below the negative value of the test statistic. Then, we add this to the probability lying above the positive value of the test statistic. That is the p-value for the two-tailed hypothesis test is given by:

$$2\left[1-\Phi [|T|\right]$$

Example 1: p-Value for One-Sided Alternative

Let θ represent the probability of obtaining a head when a coin is tossed. Suppose we toss the coin 200 times, and heads come up in 85 of the trials. Test the following hypothesis at 5% level of significance.

H 0 : θ = 0.5

H 1 : θ < 0.5

First, not that repeatedly tossing a coin follows a binomial distribution.

Our p-value will be given by P(X < 85) where X  `binomial(200,0.5)  with mean 100(np=200*0.5), assuming H 0  is true.

$$\begin{align*}P\left [ z< \frac{85.5-100}{\sqrt{50}} \right]&=P(Z<-2.05)\\&=1–0.97982=0.02018 \end{align*}$$

Recall that for a binomial distribution, the variance is given by:

$$np(1-p)=200(0.5)(1-0.5)=50$$

(We have applied the Central Limit Theorem by taking the binomial distribution as approx. normal)

Since the probability is less than 0.05, H 0  is extremely unlikely, and we actually have strong evidence against H 0  that favors H 1 . Thus, clearly expressing this result, we could say:

“There is very strong evidence against the hypothesis that the coin is fair. We, therefore, conclude that the coin is biased against heads.”

Remember, failure to reject H 0  does not mean it’s true. It means there’s insufficient evidence to justify rejecting H 0,  given a certain level of significance.

Example 2:  p-Value for Two-Sided Alternative

A CFA candidate conducts a statistical test about the mean value of a random variable X.

H 0 : μ = μ 0  vs. H 1 : μ  ≠  μ 0

She obtains a test statistic of 2.2. Given a 5% significance level, determine and interpret the  p-value

$$ \text{P-value}=2P(Z>2.2)=2[1–P(Z≤2.2)]  =1.39\%×2=2.78\%$$

(We have multiplied by two since this is a two-tailed test)

Example - Two-Sided Test

The p-value (2.78%) is less than the level of significance (5%). Therefore, we have sufficient evidence to reject H 0 . In fact, the evidence is so strong that we would also reject H 0  at significance levels of 4% and 3%. However, at significance levels of 2% or 1%, we would not reject H 0  since the  p-value  surpasses these values.

Hypothesis about the Difference between Two Population Means.

It’s common for analysts to be interested in establishing whether there exists a significant difference between the means of two different populations. For instance, they might want to know whether the average returns for two subsidiaries of a given company exhibit  significant  differences.

Now, consider a bivariate random variable:

$$W_i=[X_i,Y_i]$$

Assume that the components \(X_i\) and \(Y_i\)are both iid and are correlated. That is: \(\text{Corr} (X_i,Y_i )≠0\)

Now, suppose that we want to test the hypothesis that:

$$H_0:μ_X=μ_Y$$

$$H_1:μ_X≠μ_Y$$

In other words, we want to test whether the constituent random variables have equal means. Note that the hypothesis statement above can be written as:

$$H_0:μ_X-μ_Y=0$$

$$H_1:μ_X-μ_Y≠0$$

To execute this test, consider the variable:

$$Z_i=X_i-Y_i$$

Therefore, considering the above random variable, if the null hypothesis is correct then,

$$E(Z_i)=E(X_i)-E(Y_i)=μ_X-μ_Y=0$$

Intuitively, this can be considered as a standard hypothesis test of

H 0 : μ Z =0 vs. H 1 : μ Z  ≠ 0.

The tests statistic is given by:

$$T=\frac{\hat{\mu}_z}{\sqrt{\frac{\hat{\sigma}^2_z}{n}}} \sim N(0,1)$$

Note that the test statistic formula accounts for the correction between \(X_i \) and \(Y_i\). It is easy to see that:

$$V(Z_i)=V(X_i )+V(Y_i)-2COV(X_i, Y_i)$$

Which can be denoted as:

$$\hat{\sigma}^2_z =\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}$$

$$ \hat{\mu}_z ={\mu}_X-{\mu}_Y $$

And thus the test statistic formula can be written as:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}$$

This formula indicates that correlation plays a crucial role in determining the magnitude of the test statistic.

Another special case of the test statistic is when \(X_i\), and \(Y_i\) are iid and independent. The test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X}{n_X}+\frac{\hat{\sigma}^2_Y}{n_Y}}}$$

Where \(n_X\)  and \(n_Y\)  are the sample sizes of \(X_i\), and \(Y_i\) respectively.

Example: Hypothesis Test on Two Means

An investment analyst wants to test whether there is a significant difference between the means of the two portfolios at a 95% level. The first portfolio X consists of 30 government-issued bonds and has a mean of 10% and a standard deviation of 2%. The second portfolio Y consists of 30 private bonds with a mean of 14% and a standard deviation of 3%. The correlation between the two portfolios is 0.7. Calculate the null hypothesis and state whether the null hypothesis is rejected or otherwise.

The hypothesis statement is given by:

H 0 : μ X – μ Y =0 vs. H 1 : μ X – μ Y ≠ 0.

Note that this is a two-tailed test. At 95% level, the test size is α=5% and thus the critical value \(C_α=±1.96\). 

Recall that:

$$Cov(X, Y)=σ_{XY}=ρ_{XY} σ_X σ_Y$$

Where ρ_XY  is the correlation coefficient between X and Y.

Now the test statistic is given by:

$$T=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\sigma}_{XY}}{n}}}=\frac{{\mu}_X -{\mu}_Y}{\sqrt{\frac{\hat{\sigma}^2_X +\hat{\sigma}^2_Y – 2{\rho}_{XY} {\sigma}_X {\sigma}_Y}{n}}}$$

$$=\frac{0.10-0.14}{\sqrt{\frac{0.02^2 +0.03^2-2\times 0.7 \times 0.02 \times 0.03}{30}}}=-10.215$$

The test statistic is far much less than -1.96. Therefore the null hypothesis is rejected at a 95% level.

The Problem of Multiple Testing

Multiple testing occurs when multiple multiple hypothesis tests are conducted on the same data set. The reuse of data results in spurious results and unreliable conclusions that do not hold up to scrutiny. The fundamental problem with multiple testing is that the test size (i.e., the probability that a true null is rejected) is only applicable for a single test. However, repeated testing creates test sizes that are much larger than the assumed size of alpha and therefore increases the probability of a Type I error.

Some control methods have been developed to combat multiple testing. These include Bonferroni correction, the False Discovery Rate (FDR), and Familywise Error Rate (FWER).

Practice Question An experiment was done to find out the number of hours that candidates spend preparing for the FRM part 1 exam. For a sample of 10 students , the average study time was found to be 312.7 hours, with a standard deviation of 7.2 hours. What is the 95% confidence interval for the mean study time of all candidates? A. [307.5, 317.9] B. [310, 317] C. [300, 317] D. [307.5, 312.2] The correct answer is A. To calculate the 95% confidence interval for the mean study time of all candidates, we can use the formula for the confidence interval when the population variance is unknown: \[\text{Confidence Interval} = \bar{X} \pm t_{1-\frac{\alpha}{2}} \times \frac{s}{\sqrt{n}}\] Where: \(\bar{X}\) is the sample mean \(t_{1-\frac{\alpha}{2}}\) is the t-score corresponding to the desired confidence level and degrees of freedom \(s\) is the sample standard deviation \(n\) is the sample size In this case: \(\bar{X} = 312.7\) hours (the average study time) \(s = 7.2\) hours (the standard deviation of study time) \(n = 10\) students (the sample size) To find the t-score (\(t_{1-\frac{\alpha}{2}}\)), we look at the t-table for the 95% confidence level (which corresponds to \(\alpha = 0.05\)) and 9 degrees of freedom (\(n – 1 = 10 – 1 = 9\)). The t-score is 2.262. Now, we can plug these values into the confidence interval formula: \[\text{Confidence Interval} = 312.7 \pm 2.262 \times \frac{7.2}{\sqrt{10}}\] Calculating the margin of error: \[\text{Margin of Error} = 2.262 \times \frac{7.2}{\sqrt{10}} \approx 5.2\] So the confidence interval is: \[\text{Confidence Interval} = 312.7 \pm 5.2 = [307.5, 317.9]\] Therefore, the 95% confidence interval for the mean study time of all candidates is [307.5, 317.9] hours.

Offered by AnalystPrep

hypothesis testing vs confidence interval

Approaches to Asset Allocation

Bayesian analysis.

After completing this reading you should be able to: Describe Bayes’ theorem and... Read More

Country Risk: Determinants, Measures, ...

After completing this reading you should be able to: Explain how a country’s... Read More

Characterizing Cycles

After completing this reading you should be able to: Define covariance stationary, autocovariance... Read More

Foundations of Risk Management

1. The Building Blocks of Risk Management 2. How Do Firms Manage Financial Risk? 3. The... Read More

Leave a Comment Cancel reply

You must be logged in to post a comment.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Unit 12: Significance tests (hypothesis testing)

About this unit.

Significance tests give us a formal process for using sample data to evaluate the likelihood of some claim about a population value. Learn how to conduct significance tests and calculate p-values to see how likely a sample result is to occur by random chance. You'll also see how we use p-values to make conclusions about hypotheses.

The idea of significance tests

  • Simple hypothesis testing (Opens a modal)
  • Idea behind hypothesis testing (Opens a modal)
  • Examples of null and alternative hypotheses (Opens a modal)
  • P-values and significance tests (Opens a modal)
  • Comparing P-values to different significance levels (Opens a modal)
  • Estimating a P-value from a simulation (Opens a modal)
  • Using P-values to make conclusions (Opens a modal)
  • Simple hypothesis testing Get 3 of 4 questions to level up!
  • Writing null and alternative hypotheses Get 3 of 4 questions to level up!
  • Estimating P-values from simulations Get 3 of 4 questions to level up!

Error probabilities and power

  • Introduction to Type I and Type II errors (Opens a modal)
  • Type 1 errors (Opens a modal)
  • Examples identifying Type I and Type II errors (Opens a modal)
  • Introduction to power in significance tests (Opens a modal)
  • Examples thinking about power in significance tests (Opens a modal)
  • Consequences of errors and significance (Opens a modal)
  • Type I vs Type II error Get 3 of 4 questions to level up!
  • Error probabilities and power Get 3 of 4 questions to level up!

Tests about a population proportion

  • Constructing hypotheses for a significance test about a proportion (Opens a modal)
  • Conditions for a z test about a proportion (Opens a modal)
  • Reference: Conditions for inference on a proportion (Opens a modal)
  • Calculating a z statistic in a test about a proportion (Opens a modal)
  • Calculating a P-value given a z statistic (Opens a modal)
  • Making conclusions in a test about a proportion (Opens a modal)
  • Writing hypotheses for a test about a proportion Get 3 of 4 questions to level up!
  • Conditions for a z test about a proportion Get 3 of 4 questions to level up!
  • Calculating the test statistic in a z test for a proportion Get 3 of 4 questions to level up!
  • Calculating the P-value in a z test for a proportion Get 3 of 4 questions to level up!
  • Making conclusions in a z test for a proportion Get 3 of 4 questions to level up!

Tests about a population mean

  • Writing hypotheses for a significance test about a mean (Opens a modal)
  • Conditions for a t test about a mean (Opens a modal)
  • Reference: Conditions for inference on a mean (Opens a modal)
  • When to use z or t statistics in significance tests (Opens a modal)
  • Example calculating t statistic for a test about a mean (Opens a modal)
  • Using TI calculator for P-value from t statistic (Opens a modal)
  • Using a table to estimate P-value from t statistic (Opens a modal)
  • Comparing P-value from t statistic to significance level (Opens a modal)
  • Free response example: Significance test for a mean (Opens a modal)
  • Writing hypotheses for a test about a mean Get 3 of 4 questions to level up!
  • Conditions for a t test about a mean Get 3 of 4 questions to level up!
  • Calculating the test statistic in a t test for a mean Get 3 of 4 questions to level up!
  • Calculating the P-value in a t test for a mean Get 3 of 4 questions to level up!
  • Making conclusions in a t test for a mean Get 3 of 4 questions to level up!

More significance testing videos

  • Hypothesis testing and p-values (Opens a modal)
  • One-tailed and two-tailed tests (Opens a modal)
  • Z-statistics vs. T-statistics (Opens a modal)
  • Small sample hypothesis test (Opens a modal)
  • Large sample proportion hypothesis testing (Opens a modal)

Confidence intervals and hypothesis testing

  • Understand the t value and Pr(>|t|) fields in the output of lm
  • Be able to think critically about the meaning and limitations of strict hypothesis tests

Confidence intervals and hypothesis tests

T-statistics.

Suppose we’re interested in the value \(\beta_k\) , the \(k\) –th entry of \(\betav\) in for some regression \(\y_n \sim \betav^\trans \xv_n\) . Recall that we have been finding \(\v\) such that

\[ \sqrt{N} (\beta_k - \beta) \rightarrow \gauss{0, \v}. \]

For example, under homoskedastic assumptions with \(\y_n = \xv_n^\trans \beta + \res_n\) , we have

\[ \begin{aligned} \v =& \sigma^2 (\Xcov^{-1})_{kk} \textrm{ where } \\ \Xcov =& \lim_{N \rightarrow \infty} \frac{1}{N} \X^\trans \X \textrm{ and } \\ \sigma^2 =& \var{\res_n}. \end{aligned} \]

Typically we don’t know \(\v\) , but have \(\hat\v\) such that \(\hat\v \rightarrow \v\) as \(N \rightarrow \infty\) . Again, under homoeskedastic assumptions,

\[ \begin{aligned} \hat\v =& \hat\sigma^2 \left(\frac{1}{N} \X^\trans \X \right)_{kk} \textrm{ where } \\ \hat\sigma^2 =& \frac{1}{N-P} \sumn \reshat_n^2. \end{aligned} \]

Putting all this together, the quantity

\[ \t = \frac{\sqrt{N} (\betahat_k - \beta_k)}{\sqrt{\hat\v}} = \frac{\betahat_k - \beta_k}{\sqrt{\hat\v / N}} \]

has an approximately standard normal distribution for large \(N\) .

Quantities of this form are called “T–statistics,” since, under our normal assumptions, we have shown that

\[ \t \sim \studentt{N-P}, \]

exactly for all \(N\) . Despite it’s name, it’s worth remembering that a T–statistic is actually not Student T distributed in general; it is asymptotically normal. Recall that for large \(N\) , the Student T and standard normal distributions coincide.

Plugging in values for \(\beta_k\)

However, there’s something funny about a “T-statistic” — as written, you cannot compute it, because you don’t know \(\beta_k\) . In fact, finding what values \(\beta_k\) might plausibly take is the whole point of statistical inference.

So what good is a T–statistic? Informally, one way to reason about it is as follows. Let’s take some concrete values for an example. Suppose guess that \(\beta_k^0\) is the value, and compute

\[ \betahat_k = 2 \quad\textrm{and}\quad \sqrt{\hat\v / N} = 3 \quad\textrm{so}\quad \t = \frac{2 - \beta_k^0}{3}. \]

We use the superscript \(0\) to indicate that \(\beta_k^0\) is our guess, not necessarily the true value.

Suppose we plug in some particular value, such as \(\beta_k^0 = 32\) . Using this value, we compute our T–statistic, and find that it’s very large — in our example, we would have \(\t = (2 - 32) / 3 = -30\) . It’s very unlikely to get a standard normal (or Student T) draw this large. Therefore, either:

  • We got a very (very very very very) unusual draw of our standard normal or
  • We guessed wrong, i.e.  \(\beta_k \ne \beta_k^0 = 32\) .

In this way, we might consider it plausible to “reject” the hypothesis that \(\beta_k = 32\) .

There’s a subtle problem with the preceding reasoning, however. Suppose we do the same calculation with \(\beta_k^0 = 1\) . Then \(\t = (2 - 1) / 3 = 1/3\) . This is a much more typical value for a standard normal distribution. However, the probability of getting exactly \(1/3\) — or, indeed, any particular value — is zero, since the normal distribution is continuous valued. (This problem is easiest to see with continuous random variables, but the same basic problem will occur when the distribution is discrete but spread over a large number of possible values.)

Rejection regions

To resolve this problem, we can specify regions that we consider implausible. That is, suppose we take a region \(R\) such that, if \(\t\) is standard normal (or Student-T), then

\[ \prob{\t \in R} \le \alpha \quad\textrm{form some small }\alpha. \]

For example, we might take \(\Phi^{-1}(\cdot)\) to be the inverse CDF of \(\t\) if \(\beta_k = \beta_k^0\) . Then we can take

\[ R_{ts} = \{\t: \abs{t} \ge q \} \quad\textrm{where } q = \Phi^{-1}(\alpha / 2)\\ \]

where \(q\) is an \(\alpha / 2\) quantile of the distribution of \(\t\) . But there are other choices, such as

\[ \begin{aligned} R_{u} ={}& \{\t: \t \ge q \} \quad\textrm{where } q = \Phi^{-1}(1 - \alpha) \\ R_{l} ={}& \{\t: \t \le q \} \quad\textrm{where } q = \Phi^{-1}(\alpha) \\ R_{m} ={}& \{\t: \abs{\t} \le q \} \quad\textrm{where } q = \Phi^{-1}(0.5 + \alpha / 2) \quad\textrm{(!!!)}\\ R_{\infty} ={}& \begin{cases} \emptyset & \textrm{ with independent probability } \alpha \\ (-\infty,\infty) & \textrm{ with independent probability } 1 - \alpha \\ \end{cases} \quad\textrm{(!!!)} \end{aligned} \]

The last two may seem silly, but they are still rejection regions into which \(\t\) is unlikely to fall if it has a standard normal distribution.

How can we think about \(\alpha\) , and about the choice of the region? Recall that

  • If \(\t \in R\) , we “reject” the proposed value of \(\beta_k^0\)
  • If \(\t \notin R\) , we “fail to reject” the given value of \(\beta_k^0\) .

Of course, we don’t “accept” the value of \(\beta_k^0\) in the sense of believing that \(\beta_k^0 = \beta_k\) — if nothing else, there will always be multiple values of \(\beta_k^0\) that we do not reject, and \(\beta_k\) cannot be equal to all of them.

So there are two ways to make an error:

  • Type I error: We are correct and \(\beta_k = \beta_k^0\) , but \(\t \in R\) and we reject
  • Type II error: We are incorrect and \(\beta_k \ne \beta_k^0\) , but \(\t \notin R\) and we fail to reject

By definition of the region \(R\) , we have that

\[ \prob{\textrm{Type I error}} \le \alpha. \]

This is true for all the regions above, including the silly ones!

What about the Type II error? It must depend on the “true” value of \(\beta_k\) , and on the shape of the rejection region we choose. Note that

\[ \t = \frac{\betahat_k - \beta_k^0}{\sqrt{\hat\v / N}} = \frac{\betahat_k - \beta_k}{\sqrt{\hat\v / N}} + \frac{\beta_k - \beta_k^0}{\sqrt{\hat\v / N}} \]

So if the true value \(\beta_k \gg \beta_k^0\) , then our \(\t\) statistic is too large, and so on.

For example:

  • Then \(\t\) is too large and positive.
  • \(R_u\) and \(R_{ts}\) will reject, but \(R_l\) will not.
  • The Type II error of \(R_u\) will be lowest, then \(R_{ts}\) , then \(R_l\) .
  • \(R_l\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\) .
  • Then \(\t\) is too large and negative.
  • \(R_l\) and \(R_{ts}\) will reject, but \(R_u\) will not.
  • The Type II error of \(R_l\) will be lowest, then \(R_{ts}\) , then \(R_u\) .
  • \(R_u\) actually has greater Type II error than the silly regions, \(R_\infty\) and \(R_m\) .
  • Then \(\t\) has about the same distribution as when \(\beta_k^0 = \beta_k\) .
  • All the regions reject just about as often as we commit a Type I error, that is, a proportion \(\alpha\) of the time.

Thus the shape of the region determines which alternatives you are able to reject. The probability of “rejecting” under a particular alternative is called the “power” of a test; the power is one minus the Type II error rate.

The null and alternative

Statistics has some formal language to distinguish between the “guess” \(\beta_k^0\) and other values.

  • Falsely rejecting the null hypothesis is called a Type I error
  • By construction, Type I errors occurs with probability at most \(\alpha\)
  • Falsely failling to reject the null hypothesis is called a Type II error
  • Type II errors’ probability depends on the alternative(s) and the rejection region shape.

The choice of a test statistic (here, \(\t\) ), together with a rejection region (here, \(R\) ) constitute a “test” of the null hypothesis. In general, one can imagine constructing many different tests, with different theoretical guarantees and power.

Confidence intervals

Often in applied statistics, a big deal is made about a single hypothesis test, particularly the null that \(\beta_k^0 = 0\) . Often this is not a good idea. Typically, we do not care whether \(\beta_k\) is precisely zero; rather, we care about the set of plausible values \(\beta_k\) might take. The distinction can be expressed as the difference between statistical and practical significance:

  • Statistical significance is the size of an effect relative to sampling variability
  • Practical significance is the size of the effect in terms of its effect on reality.

For example, suppose that \(\beta_k\) is nonzero but very small, but \(\sqrt{\hat\v / N}\) is very small, too. We might reject the null hypothesis \(\beta_k^0 = 0\) with a high degree of certainty, and call our result statistically significant . However, a small value of \(\beta_k\) may still not be a meaningful effect size for the problem at hand, i.e., it may not be practically significant .

A remendy is confidence intervals, which are actually closely related to our hypothesis tests. Recall that we have been constructing intervals of the form

\[ \prob{\beta_k \in I} \ge 1-\alpha \]

\[ I = \left(\betahat_k \pm q \hat\v / \sqrt{N}\right), \]

where \(q = \Phi^{-1}(\alpha / 2)\) , and \(\Phi\) is the CDF of either the standard normal or Student T distribution. It turns out that \(I\) is precisely the set of values that we would not reject with region \(R_{ts}\) . And, indeed, given a confidence interval, a valid test of the hypothesis \(\beta_k^0\) is given by rejecting if an only if \(\beta_k^0 \in I\) .

This duality is entirely general:

  • The set of values that a valid test does not reject is a valid confidence interval
  • Checking whether a value falls in a valid confidence interval is a valid test

Source Code

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

2.9: Confidence intervals and bootstrapping

  • Last updated
  • Save as PDF
  • Page ID 33222

  • Mark Greenwood
  • Montana State University

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Up to this point the focus has been on hypotheses, p-values, and estimates of the size of differences. But so far this has not explored inference techniques for the size of the difference. Confidence intervals provide an interval where we are __% confident that the true parameter lies. The idea of “confidence” is that if we repeated randomly sampling from the same population and made a similar confidence interval, the collection of all these confidence intervals would contain the true parameter at the specified confidence level (usually 95%). We only get to make one interval and so it either has the true parameter in it or not, and we don’t know the truth in real situations.

Confidence intervals can be constructed with parametric and a nonparametric approaches. The nonparametric approach will be using what is called bootstrapping and draws its name from “pull yourself up by your bootstraps” where you improve your situation based on your own efforts. In statistics, we make our situation or inferences better by re-using the observations we have by assuming that the sample represents the population. Since each observation represents other similar observations in the population that we didn’t get to measure, if we sample with replacement to generate a new data set of size n from our data set (also of size n ) it mimics the process of taking repeated random samples of size \(n\) from our population of interest. This process also ends up giving us useful sampling distributions of statistics even when our standard normality assumption is violated, similar to what we encountered in the permutation tests. Bootstrapping is especially useful in situations where we are interested in statistics other than the mean (say we want a confidence interval for a median or a standard deviation) or when we consider functions of more than one parameter and don’t want to derive the distribution of the statistic (say the difference in two medians). Here, bootstrapping is used to provide more trustworthy inferences when some of our assumptions (especially normality) might be violated for our parametric confidence interval procedure.

To perform bootstrapping, the resample function from the mosaic package will be used. We can apply this function to a data set and get a new version of the data set by sampling new observations with replacement from the original one 52 . The new, bootstrapped version of the data set (called dsample_BTS below) contains a new variable called orig.id which is the number of the subject from the original data set. By summarizing how often each of these id’s occurred in a bootstrapped data set, we can see how the re-sampling works. The table function will count up how many times each observation was used in the bootstrap sample, providing a row with the id followed by a row with the count 53 . In the first bootstrap sample shown, the 1 st , 14 th , and 26 th observations were sampled twice, the 9 th and 28 th observations were sampled four times, and the 4 th , 5 th , 6 th , and many others were not sampled at all. Bootstrap sampling thus picks some observations multiple times and to do that it has to ignore some 54 observations.

Like in permutations, one randomization isn’t enough. A second bootstrap sample is also provided to help you get a sense of what bootstrap data sets contain. It did not select observations two through five but did select eight others more than once. You can see other variations in the resulting re-sampling of subjects with the most sampled observation used four times. With \(n = 30\) , the chance of selecting any observation for any slot in the new data set is \(1/30\) and the expected or mean number of appearances we expect to see for an observation is the number of random draws times the probably of selection on each so \(30*1/30 = 1\) . So we expect to see each observation in the bootstrap sample on average once but random variability in the samples then creates the possibility of seeing it more than once or not all.

We can use the two results to get an idea of distribution of results in terms of number of times observations might be re-sampled when sampling with replacement and the variation in those results, as shown in Figure 2.22. We could also derive the expected counts for each number of times of re-sampling when we start with all observations having an equal chance and sampling with replacement but this isn’t important for using bootstrapping methods.

Counts of number of times of observation (or not observed for times re-sampled of 0) for two bootstrap samples.

The main point of this exploration was to see that each run of the resample function provides a new version of the data set. Repeating this \(B\) times using another for loop, we will track our quantity of interest, say \(T\) , in all these new “data sets” and call those results \(T^*\) . The distribution of the bootstrapped \(T^*\) statistics tells us about the range of results to expect for the statistic. The middle % of the \(T^*\) ’s provides a % bootstrap confidence interval 55 for the true parameter – here the difference in the two population means .

To make this concrete, we can revisit our previous examples, starting with the dsample data created before and our interest in comparing the mean passing distances for the commuter and casual outfit groups in the \(n = 30\) stratified random sample that was extracted. The bootstrapping code is very similar to the permutation code except that we apply the resample function to the entire data set used in lm as opposed to the shuffle function that was applied only to the explanatory variable.

Histogram and density curve of bootstrap distributions of difference in sample mean Distances with vertical line for the observed difference in the means of -25.933.

In this situation, the observed difference in the mean passing distances is -25.933 cm ( commute - casual ), which is the bold vertical line in Figure 2.23. The bootstrap distribution shows the results for the difference in the sample means when fake data sets are re-constructed by sampling from the original data set with replacement. The bootstrap distribution is approximately centered at the observed value (difference in the sample means) and is relatively symmetric.

The permutation distribution in the same situation (Figure 2.10) had a similar shape but was centered at 0. Permutations create sampling distributions based on assuming the null hypothesis is true, which is useful for hypothesis testing. Bootstrapping creates distributions centered at the observed result, which is the sampling distribution “under the alternative” or when no null hypothesis is assumed; bootstrap distributions are useful for generating confidence intervals for the true parameter values.

To create a 95% bootstrap confidence interval for the difference in the true mean distances ( \(\mu_\text{commute}-\mu_\text{casual}\) ), select the middle 95% of results from the bootstrap distribution. Specifically, find the 2.5 th percentile and the 97.5 th percentile (values that put 2.5 and 97.5% of the results to the left) in the bootstrap distribution, which leaves 95% in the middle for the confidence interval. To find percentiles in a distribution in R, functions are of the form q[Name of distribution] , with the function qt extracting percentiles from a \(t\) -distribution (examples below). From the bootstrap results, use the qdata function on the Tstar results that contain the bootstrap distribution of the statistic of interest.

These results tell us that the 2.5 th percentile of the bootstrap distribution is at -50.006 cm and the 97.5 th percentile is at -2.249 cm. We can combine these results to provide a 95% confidence for \(\mu_\text{commute}-\mu_\text{casaual}\) that is between -50.01 and -2.25 cm. This interval is interpreted as with any confidence interval, that we are 95% confident that the difference in the true mean distances ( commute minus casual groups) is between -50.01 and -2.25 cm. Or we can switch the direction of the comparison and say that we are 95% confident that the difference in the true means is between 2.25 and 50.01 cm ( casual minus commute ). This result would be incorporated into step 5 of the hypothesis testing protocol to accompany discussing the size of the estimated difference in the groups or used as a result of interest in itself. Both percentiles can be obtained in one line of code using:

Figure 2.24 displays those same percentiles on the bootstrap distribution residing in Tstar .

Histogram and density curve of bootstrap distribution with 95% bootstrap confidence intervals displayed (bold, dashed vertical lines).

Although confidence intervals can exist without referencing hypotheses, we can revisit our previous hypotheses and see what this confidence interval tells us about the test of \(H_0: \mu_\text{commute} = \mu_\text{casual}\) . This null hypothesis is equivalent to testing \(H_0: \mu_\text{commute} - \mu_\text{casual} = 0\) , that the difference in the true means is equal to 0 cm. And the difference in the means was the scale for our confidence interval, which did not contain 0 cm. The 0 cm values is an interesting reference value for the confidence interval, because here it is the value where the true means are equal to each other (have a difference of 0 cm). In general, if our confidence interval does not contain 0, then it is saying that 0 is not one of the likely values for the difference in the true means at the selected confidence level. This implies that we should reject a claim that they are equal. This provides the same inferences for the hypotheses that we considered previously using both parametric and permutation approaches using a fixed \(\alpha\) approach where \(\alpha\) = 1 - confidence level.

The general summary is that we can use confidence intervals to test hypotheses by assessing whether the reference value under the null hypothesis is in the confidence interval (suggests insufficient evidence against \(H_0\) to reject it, at least at the \(\alpha\) level and equivalent to having a p-value larger than \(\alpha\) ) or outside the confidence interval (sufficient evidence against \(H_0\) to reject it and equivalent to having a p-value that is less than \(\alpha\) ). P-values are more informative about hypotheses (measure of evidence against the null hypothesis) but confidence intervals are more informative about the size of differences, so both offer useful information and, as shown here, can provide consistent conclusions about hypotheses. But it is best practice to use p-values to assess evidence against null hypotheses and confidence intervals to do inferences for the size of differences.

As in the previous situation, we also want to consider the parametric approach for comparison purposes and to have that method available, especially to help us understand some methods where we will only consider parametric inferences in later chapters. The parametric confidence interval is called the equal variance, two-sample t confidence interval and additionally assumes that the populations being sampled from are normally distributed instead of just that they have similar shapes in the bootstrap approach. The parametric method leads to using a \(t\) -distribution to form the interval with the degrees of freedom for the \(t\) -distribution of \(n-2\) although we can obtain it without direct reference to this distribution using the confint function applied to the lm model. This function generates two confidence intervals and the one in the second row is the one we are interested as it pertains to the difference in the true means of the two groups. The parametric 95% confidence interval here is from -51.6 to -0.26 cm which is a bit different in width from the nonparametric bootstrap interval that was from -50.01 and -2.25 cm.

The bootstrap interval was narrower by almost 4 cm and its upper limit was much further from 0. The bootstrap CI can vary depending on the random number seed used and additional runs of the code produced intervals of (-49.6, -2.8), (-48.3, -2.5), and (-50.9, -1.1) so the differences between the parametric and nonparametric approaches was not just due to an unusual bootstrap distribution. It is not entirely clear why the two intervals differ but there are slightly more results in the left tail of Figure 2.24 than in the right tail and this shifts the 95% confidence slightly away from 0 as compared to the parametric approach. All intervals have the same interpretation, only the methods for calculating the intervals and the assumptions differ. Specifically, the bootstrap interval can tolerate different distribution shapes other than normal and still provide intervals that work well 56 . The other assumptions are all the same as for the hypothesis test, where we continue to assume that we have independent observations with equal variances for the two groups and maintain concerns about inferences here due to the violation of independence in these responses.

The formula that lm is using to calculate the parametric equal variance, two-sample \(t\) -based confidence interval is:

\[\bar{x}_1 - \bar{x}_2 \mp t^*_{df}s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\]

In this situation, the df is again \(n_1+n_2-2\) (the total sample size - 2) and \(s_p = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}\) . The \(t^*_{df}\) is a multiplier that comes from finding the percentile from the \(t\) -distribution that puts \(C\) % in the middle of the distribution with \(C\) being the confidence level. It is important to note that this \(t^*\) has nothing to do with the previous test statistic \(t\) . It is confusing and students first engaging these two options often happily take the result from a test statistic calculation and use it for a multiplier in a \(t\) -based confidence interval – try to focus on which \(t\) you are interested in before you use either. Figure 2.25 shows the \(t\) -distribution with 28 degrees of freedom and the cut-offs that put 95% of the area in the middle.

Plot of \(t(28)\) with cut-offs for putting 95% of distribution in the middle that delineate the \(t^*\) multiplier to make a 95% confidence interval.

For 95% confidence intervals, the multiplier is going to be close to 2 and anything else is a likely indication of a mistake. We can use R to get the multipliers for confidence intervals using the qt function in a similar fashion to how qdata was used in the bootstrap results, except that this new value must be used in the previous confidence interval formula. This function produces values for requested percentiles, so if we want to put 95% in the middle, we place 2.5% in each tail of the distribution and need to request the 97.5 th percentile. Because the \(t\) -distribution is always symmetric around 0, we merely need to look up the value for the 97.5 th percentile and know that the multiplier for the 2.5 th percentile is just \(-t^*\) . The \(t^*\) multiplier to form the confidence interval is 2.0484 for a 95% confidence interval when the \(df = 28\) based on the results from qt :

Note that the 2.5 th percentile is just the negative of this value due to symmetry and the real source of the minus in the minus/plus in the formula for the confidence interval.

We can also re-write the confidence interval formula into a slightly more general forms as

\[\bar{x}_1 - \bar{x}_2 \mp t^*_{df}SE_{\bar{x}_1 - \bar{x}_2}\ \text{ OR }\ \bar{x}_1 - \bar{x}_2 \mp ME\]

where \(SE_{\bar{x}_1 - \bar{x}_2} = s_p\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\) and \(ME = t^*_{df}SE_{\bar{x}_1 - \bar{x}_2}\) . The SE is available in the lm model summary for the line related to the difference in groups in the “Std. Error” column. In some situations, researchers will report the standard error (SE) or margin of error (ME) as a method of quantifying the uncertainty in a statistic. The SE is an estimate of the standard deviation of the statistic (here \(\bar{x}_1 - \bar{x}_2\) ) and the ME is an estimate of the precision of a statistic that can be used to directly form a confidence interval. The ME depends on the choice of confidence level although 95% is almost always selected.

To finish this example, R can be used to help you do calculations much like a calculator except with much more power “under the hood”. You have to make sure you are careful with using ( ) to group items and remember that the asterisk (*) is used for multiplication. We need the pertinent information which is available from the favstats output repeated below to calculate the confidence interval “by hand” 57 using R.

Start with typing the following command to calculate \(s_p\) and store it in a variable named sp :

Then calculate the confidence interval that confint provided using:

Or using the information from the model summary:

The previous results all use c(-1, 1) times the margin of error to subtract and add the ME to the difference in the sample means ( \(109.8667 - 135.8\) ), which generates the lower and then upper bounds of the confidence interval. If desired, we can also use just the last portion of the calculation to find the margin of error, which is 25.675 here.

For the entire \(n = 1,636\) data set for these two groups, the results are obtained using the following code. The estimated difference in the means is -3 cm ( commute minus casual ). The \(t\) -based 95% confidence interval is from -5.89 to -0.11.

The bootstrap 95% confidence interval is from -5.816 to -0.076. With this large data set, the differences between parametric and permutation approaches decrease and they essentially equivalent here. The bootstrap distribution (not displayed) for the differences in the sample means is relatively symmetric and centered around the estimated difference of 6 cm. So using all the observations we would be 95% confident that the true mean difference in overtake distances ( commute - casual ) is between -5.82 and -0.08 cm, providing additional information about the estimated difference in the sample means of 6 cm.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

11: Hypothesis Testing and Confidence Intervals with Two Samples

  • Last updated
  • Save as PDF
  • Page ID 100403

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

You have learned to conduct hypothesis tests on single means and single proportions. You will expand upon that in this chapter. You will compare two means or two proportions to each other. The general procedure is still the same, just expanded. To compare two means or two proportions, you work with two groups. The groups are classified either as independent or matched pairs. Independent groups consist of two samples that are independent, that is, sample values selected from one population are not related in any way to sample values selected from the other population. Matched pairs consist of two samples that are dependent. The parameter tested using matched pairs is the population mean. The parameters tested using independent groups are either population means or population proportions.

  • 11.1: Prelude to Hypothesis Testing with Two Samples This chapter deals with the following hypothesis tests: Independent groups (samples are independent) Test of two population means. Test of two population proportions. Matched or paired samples (samples are dependent) Test of the two population proportions by testing one population mean of differences.
  • 11.2: Two Population Means with Unknown Standard Deviations The comparison of two population means is very common. A difference between the two samples depends on both the means and the standard deviations. Very different means can occur by chance if there is great variation among the individual samples.
  • 11.3: Two Population Means with Known Standard Deviations Even though this situation is not likely (knowing the population standard deviations is not likely), the following example illustrates hypothesis testing for independent means, known population standard deviations.
  • 11.4: Comparing Two Independent Population Proportions Comparing two proportions, like comparing two means, is common. If two estimated proportions are different, it may be due to a difference in the populations or it may be due to chance. A hypothesis test can help determine if a difference in the estimated proportions reflects a difference in the population proportions.
  • 11.5: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples. The differences form the sample that is used for the hypothesis test. Either the matched pairs have differences that come from a population that is normal or the number of difference
  • 11.6: Hypothesis Testing for Two Means and Two Proportions (Worksheet) A statistics Worksheet: The student will select the appropriate distributions to use in each case. The student will conduct hypothesis tests and interpret the results.
  • 11.7: Hypothesis Testing with Two Samples (Exercises) These are homework exercises to accompany the Textmap created for "Introductory Statistics" by OpenStax.

Contributors and Attributions

  • Template:ContribOpenStax
  • Physician Physician Board Reviews Physician Associate Board Reviews CME Lifetime CME Free CME
  • Student USMLE Step 1 USMLE Step 2 USMLE Step 3 COMLEX Level 1 COMLEX Level 2 COMLEX Level 3 96 Medical School Exams Student Resource Center NCLEX - RN NCLEX - LPN/LVN/PN 24 Nursing Exams
  • Nurse Practitioner APRN/NP Board Reviews CNS Certification Reviews CE - Nurse Practitioner FREE CE
  • Nurse RN Certification Reviews CE - Nurse FREE CE
  • Pharmacist Pharmacy Board Exam Prep CE - Pharmacist
  • Allied Allied Health Exam Prep Dentist Exams CE - Social Worker CE - Dentist
  • Point of Care
  • Free CME/CE

Hypothesis Testing, P Values, Confidence Intervals, and Significance

Definition/introduction.

Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

Issues of Concern

Register for free and read the full article, learn more about a subscription to statpearls point-of-care.

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, it may affect healthcare providers' ability to make clinical decisions without relying purely on the research investigators deemed level of significance. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate to be applied in healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators then identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of relationships between two or more variables. For this article, we will use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 will significantly reduce symptoms associated with Disease A compared to Drug 22.

The null hypothesis states that there is no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when considering how to report p values, and manuscripts should remain internally consistent.

Regarding p values, as the number of individuals enrolled in a study (the sample size) increases, the likelihood of finding a statistically significant effect increases. With very large sample sizes, the p-value can be very low significant differences in the reduction of symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators will either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not provide proof that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, it is not feasible to obtain data for an entire population. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error. [1]  When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults. [2]

Significance

Significance is a term to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance. [3]  Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research. [4]  When conceptualizing findings reported as either significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand both p values and confidence intervals so they do not have to rely on the researchers to determine the level of significance. [5]  One criterion often used to determine statistical significance is the utilization of p values.

P values are used in research to determine whether the sample estimate is significantly different from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced. [6]  Hypothesis testing allows us to determine the size of the effect.

An example of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.

Statement:Individuals who were prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) compared to individuals who were prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the two statements above, some researchers will report findings with < or > and others will provide an exact p-value (0.000001) but never zero [6] . When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design, rather than only providing p values for variables with significant findings. [7]  The inclusion of all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.  

While researchers have historically used p values, experts who find p values problematic encourage the use of confidence intervals. [8] . P-values alone do not allow us to understand the size or the extent of the differences or associations. [3]  In March 2016, the American Statistical Association (ASA) released a statement on p values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (e.g., 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that in isolation, a p-value does not provide strong evidence. [9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7] . The p-value debate has smoldered since the 1950s [10] , and replacement with confidence intervals has been suggested since the 1980s. [11]

Confidence Intervals

A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population. [12]  Most research uses a 95% CI, but investigators can set any level (e.g., 90% CI, 99% CI). [13]  A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population. [14]  Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95, [15]  confidence intervals provide more evidence regarding the precision of an estimate compared to p-values. [6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number will result in less precision of the CI (increase the width). [14]  A larger width indicates a smaller sample size or a larger variability. [16]  A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes any clinically significant values. [14]

Null values are sometimes used for differences with CI (zero for differential comparisons and 1 for ratios). However, CIs provide more information than that. [15]  Consider this example: A hospital implements a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses zero, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance for this may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (e.g., study bias or improper data analysis). [14]  In consideration of whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial. [13]  An example is below:

Reporting both: Individuals who were prescribed Drug 23 had no symptoms after three days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the two groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

Clinical Significance

Recall that clinical significance and statistical significance are two different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with smaller sample size and statistically non-significant results could impact clinical practice. [14]  Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both). [4]  Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and will never be equivalent to clinical significance. [17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but through their understanding of study limitations and practical implications.

Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care. 

Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. Journal of speech, language, and hearing research : JSLHR. 2002 Apr:45(2):243-55     [PubMed PMID: 12003508]

Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ (Clinical research ed.). 2014 Jul 3:349():g4287. doi: 10.1136/bmj.g4287. Epub 2014 Jul 3     [PubMed PMID: 24994622]

Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Australian critical care : official journal of the Confederation of Australian Critical Care Nurses. 2010 May:23(2):93-7. doi: 10.1016/j.aucc.2010.03.001. Epub 2010 Mar 29     [PubMed PMID: 20347326]

Hayat MJ. Understanding statistical significance. Nursing research. 2010 May-Jun:59(3):219-23. doi: 10.1097/NNR.0b013e3181dbb2cc. Epub     [PubMed PMID: 20445438]

Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Journal of pharmacy practice. 2010 Aug:23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13     [PubMed PMID: 21507834]

Infanger D, Schmidt-Trucksäss A. P value functions: An underused method to present research results and to promote quantitative reasoning. Statistics in medicine. 2019 Sep 20:38(21):4189-4197. doi: 10.1002/sim.8293. Epub 2019 Jul 3     [PubMed PMID: 31270842]

Dorey F. Statistics in brief: Interpretation and use of p values: all p values are not equal. Clinical orthopaedics and related research. 2011 Nov:469(11):3259-61. doi: 10.1007/s11999-011-2053-1. Epub     [PubMed PMID: 21918804]

Liu XS. Implications of statistical power for confidence intervals. The British journal of mathematical and statistical psychology. 2012 Nov:65(3):427-37. doi: 10.1111/j.2044-8317.2011.02035.x. Epub 2011 Oct 25     [PubMed PMID: 22026811]

Tijssen JG, Kolm P. Demystifying the New Statistical Recommendations: The Use and Reporting of p Values. Journal of the American College of Cardiology. 2016 Jul 12:68(2):231-3. doi: 10.1016/j.jacc.2016.05.026. Epub     [PubMed PMID: 27386779]

Spanos A. Recurring controversies about P values and confidence intervals revisited. Ecology. 2014 Mar:95(3):645-51     [PubMed PMID: 24804448]

Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials. Brazilian journal of physical therapy. 2019 Jul-Aug:23(4):302-310. doi: 10.1016/j.bjpt.2018.10.004. Epub 2018 Oct 16     [PubMed PMID: 30366845]

Dorey FJ. In brief: statistics in brief: Confidence intervals: what is the real result in the target population? Clinical orthopaedics and related research. 2010 Nov:468(11):3137-8. doi: 10.1007/s11999-010-1407-4. Epub     [PubMed PMID: 20532716]

Porcher R. Reporting results of orthopaedic research: confidence intervals and p values. Clinical orthopaedics and related research. 2009 Oct:467(10):2736-7. doi: 10.1007/s11999-009-0952-1. Epub 2009 Jun 30     [PubMed PMID: 19565303]

Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British medical journal (Clinical research ed.). 1986 Mar 15:292(6522):746-50     [PubMed PMID: 3082422]

Cooper RJ, Wears RL, Schriger DL. Reporting research results: recommendations for improving communication. Annals of emergency medicine. 2003 Apr:41(4):561-4     [PubMed PMID: 12658257]

Doll H, Carney S. Statistical approaches to uncertainty: P values and confidence intervals unpacked. Equine veterinary journal. 2007 May:39(3):275-6     [PubMed PMID: 17520981]

Colquhoun D. The reproducibility of research and the misinterpretation of p-values. Royal Society open science. 2017 Dec:4(12):171085. doi: 10.1098/rsos.171085. Epub 2017 Dec 6     [PubMed PMID: 29308247]

Use the mouse wheel to zoom in and out, click and drag to pan the image

IMAGES

  1. PPT

    hypothesis testing vs confidence interval

  2. The Relationship Between Confidence Intervals and Hypothesis Tests

    hypothesis testing vs confidence interval

  3. Understanding Hypothesis Tests: Confidence Intervals and Confidence Levels

    hypothesis testing vs confidence interval

  4. PPT

    hypothesis testing vs confidence interval

  5. Hypothesis Testing and Confidence Intervals

    hypothesis testing vs confidence interval

  6. Hypothesis Testing

    hypothesis testing vs confidence interval

VIDEO

  1. Statistics Webinar on Hypothesis Tests and Confidence Intervals for 2 Proportions and Means

  2. hypothesis tests confidence intervals and the p-value

  3. Relationship between Hypothesis testing and Confidence Interval

  4. CSE567-13-13A: Comparing Computer Systems Using Sample Data

  5. 114 Testing of Hypothesis & Confidence Interval: Lecture V

  6. Hypothesis Testing/Confidence Interval Inferences About Two Proportions pt3

COMMENTS

  1. Hypothesis Test vs. Confidence Interval: What's the Difference?

    Here's the difference between the two: A hypothesis test is a formal statistical test that is used to determine if some hypothesis about a population parameter is true. A confidence interval is a range of values that is likely to contain a population parameter with a certain level of confidence. This tutorial shares a brief overview of each ...

  2. 6.6

    In other words, if the the 95% confidence interval contains the hypothesized parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always fail to reject the null hypothesis. If the 95% confidence interval does not contain the hypothesize parameter, then a hypothesis test at the 0.05 \(\alpha\) level will almost always ...

  3. Hypothesis Testing and Confidence Intervals

    The relationship between the confidence level and the significance level for a hypothesis test is as follows: Confidence level = 1 - Significance level (alpha) For example, if your significance level is 0.05, the equivalent confidence level is 95%. Both of the following conditions represent statistically significant results: The P-value in a ...

  4. 8.6 Relationship Between Confidence Intervals and Hypothesis Tests

    8.6 Relationship Between Confidence Intervals and Hypothesis Tests Confidence intervals (CI) and hypothesis tests should give consistent results: we should not reject [latex]H_0[/latex] at the significance level [latex]\alpha[/latex] if the corresponding [latex](1 - \alpha) \times 100\%[/latex] confidence interval contains the hypothesized value [latex]\mu_0[/latex].

  5. The Relationship Between Hypothesis Testing and Confidence Intervals

    Both confidence intervals and hypothesis intervals can be used in tandem to help support our conclusions! References: Vital Signs: Predicted Heart Age and Racial Disparities in Heart Age Among U.S. Adults at the State Level; Hypothesis Test vs. Confidence Interval | Statistics Tutorial #15 | MarinStatsLectures

  6. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    Definition/Introduction. Medical providers often rely on evidence-based medicine to guide decision-making in practice. Often a research hypothesis is tested with results provided, typically with p values, confidence intervals, or both. Additionally, statistical or research significance is estimated or determined by the investigators.

  7. Confidence Intervals: Interpreting, Finding & Formulas

    Confidence intervals are similarly helpful for understanding an effect size. For example, if you assess a treatment and control group, the mean difference between these groups is the estimated effect size. A 2-sample t-test can construct a confidence interval for the mean difference.

  8. Understanding Hypothesis Tests: Confidence Intervals and ...

    You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree. The confidence level is equivalent to 1 - the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.

  9. The Ultimate Guide to Hypothesis Testing and Confidence Intervals in

    We can test whether this sample is drawn from a population with mean equals to μ by checking whether Ᾱ differs significantly from μ. We can also estimation a 95% confidence interval for the population mean where this sample is drawn from. Hypothesis Testing. Here are the steps for conducting hypothesis testing: Step 1: Set up the null ...

  10. PDF Hypothesis Testing and Confidence Intervals

    • Exceeds the critical value (| -5.86| > 2.10), so we still reject the null hypothesis. • But remember that we had to assume CR4 in order to perform this hypothesis test and construct the confidence interval. • If there's any reason to think that the population errors are not normally distributed

  11. Hypothesis Testing and Confidence Intervals

    A confidence interval can be defined as the range of parameters at which the true parameter can be found at a confidence level. For instance, a 95% confidence interval constitutes the set of parameter values where the null hypothesis cannot be rejected when using a 5% test size.

  12. Significance tests (hypothesis testing)

    Confidence intervals. Unit 12. Significance tests (hypothesis testing) Unit 13. Two-sample inference for the difference between groups. ... Large sample proportion hypothesis testing (Opens a modal) Up next for you: Unit test. Level up on all the skills in this unit and collect up to 1,500 Mastery points!

  13. Confidence intervals and hypothesis testing

    And, indeed, given a confidence interval, a valid test of the hypothesis β k 0 is given by rejecting if an only if β k 0 ∈ I. This duality is entirely general: The set of values that a valid test does not reject is a valid confidence interval. Checking whether a value falls in a valid confidence interval is a valid test.

  14. Understanding Confidence Intervals

    To calculate the 95% confidence interval, we can simply plug the values into the formula. For the USA: So for the USA, the lower and upper bounds of the 95% confidence interval are 34.02 and 35.98. For GB: So for the GB, the lower and upper bounds of the 95% confidence interval are 33.04 and 36.96.

  15. Hypothesis testing and confidence intervals

    Hypothesis testing and confidence intervals are intrinsically related. This chapter discusses how to test statistical hypotheses, and then focuses on interval estimation. Special attention is given to explanation of major statistical concepts, such as the p-value, in layman's terms. The chapter provides two ad hoc examples where hypothesis ...

  16. 11.8: Significance Testing and Confidence Intervals

    There is a close relationship between confidence intervals and significance tests. Specifically, if a statistic is significantly different from 0 0 at the 0.05 0.05 level, then the 95% 95 % confidence interval will not contain 0 0. All values in the confidence interval are plausible values for the parameter, whereas values outside the interval ...

  17. 12: Confidence Intervals and Hypothesis Tests

    12.1: Confidence Intervals. In this chapter, you will learn to construct and interpret confidence intervals. You will also learn a new distribution, the Student's-t, and how it is used with these intervals. Throughout the chapter, it is important to keep in mind that the confidence interval is a random variable.

  18. 2.9: Confidence intervals and bootstrapping

    To create a 95% bootstrap confidence interval for the difference in the true mean distances ( μcommute −μcasual μ commute − μ casual ), select the middle 95% of results from the bootstrap distribution. Specifically, find the 2.5 th percentile and the 97.5 th percentile (values that put 2.5 and 97.5% of the results to the left) in the ...

  19. What is the difference between confidence intervals and hypothesis testing?

    25. You can use a confidence interval (CI) for hypothesis testing. In the typical case, if the CI for an effect does not span 0 then you can reject the null hypothesis. But a CI can be used for more, whereas reporting whether it has been passed is the limit of the usefulness of a test. The reason you're recommended to use CI instead of just a t ...

  20. Hypothesis Test vs. Confidence Interval

    Hypothesis Testing (p value) and Confidence Interval: Comparing and contrasting hypothesis testing and confidence interval in research and statistics with ex...

  21. PDF Lecture 10: Confidence intervals & Hypothesis testing

    Yes. No. Testing claims based on a confidence interval (cont.) Using a confidence interval for hypothesis testing might be insufficient in some cases since it gives a yes/no (reject/don't reject) answer, as opposed to quantifying our decision with a probability. Formal hypothesis testing allows us to report a probability along with our decision.

  22. 11: Hypothesis Testing and Confidence Intervals with Two Samples

    11.5: Matched or Paired Samples. When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  23. Hypothesis Testing, P Values, Confidence Intervals, and Significance

    The p-value debate has smoldered since the 1950s, and replacement with confidence intervals has been suggested since the 1980s. Confidence Intervals. A confidence interval provides a range of values within given confidence (e.g., 95%), including the accurate value of the statistical constraint within a targeted population.

  24. Confidence Intervals: Key to BI Hypothesis Testing

    Hypothesis testing is a statistical method used to determine if there is enough evidence to reject a null hypothesis. The confidence interval (CI) is a range of values, derived from sample data ...