Module 10: Hypothesis Testing With Two Samples

Putting it together: hypothesis testing with two samples, let’s summarize.

  • The steps for performing a hypothesis test for two population means with unknown standard deviation is generally the same as the steps for conducting a hypothesis test for one population mean with unknown standard deviation, using a t -distribution.
  • Because the population standard deviations are not known, the sample standard deviations are used for calculations.
  • When the sum of the sample sizes is more than 30, a normal distribution can be used to approximate the student’s  t -distribution.
  • The difference of two proportions is approximately normal if there are at least five successes and five failures in each sample.
  • When conducting a hypothesis test for a difference of two proportions, the random samples must be independent and the population must be at least ten times the sample size.
  • When calculating the standard error for the difference in sample proportions, the pooled proportion must be used.
  • When two measurements (samples) are drawn from the same pair of individuals or objects, the differences from the sample are used to conduct the hypothesis test.
  • The distribution that is used to conduct the hypothesis test on the differences is a t -distribution.
  • Provided by : Lumen Learning. License : CC BY: Attribution
  • Introductory Statistics. Authored by : Barbara Illowsky, Susan Dean. Provided by : OpenStax. Located at : https://openstax.org/books/introductory-statistics/pages/1-introduction . License : CC BY: Attribution . License Terms : Access for free at https://openstax.org/books/introductory-statistics/pages/1-introduction

Footer Logo Lumen Candela

Privacy Policy

t-test Calculator

When to use a t-test, which t-test, how to do a t-test, p-value from t-test, t-test critical values, how to use our t-test calculator, one-sample t-test, two-sample t-test, paired t-test, t-test vs z-test.

Welcome to our t-test calculator! Here you can not only easily perform one-sample t-tests , but also two-sample t-tests , as well as paired t-tests .

Do you prefer to find the p-value from t-test, or would you rather find the t-test critical values? Well, this t-test calculator can do both! 😊

What does a t-test tell you? Take a look at the text below, where we explain what actually gets tested when various types of t-tests are performed. Also, we explain when to use t-tests (in particular, whether to use the z-test vs. t-test) and what assumptions your data should satisfy for the results of a t-test to be valid. If you've ever wanted to know how to do a t-test by hand, we provide the necessary t-test formula, as well as tell you how to determine the number of degrees of freedom in a t-test.

A t-test is one of the most popular statistical tests for location , i.e., it deals with the population(s) mean value(s).

There are different types of t-tests that you can perform:

  • A one-sample t-test;
  • A two-sample t-test; and
  • A paired t-test.

In the next section , we explain when to use which. Remember that a t-test can only be used for one or two groups . If you need to compare three (or more) means, use the analysis of variance ( ANOVA ) method.

The t-test is a parametric test, meaning that your data has to fulfill some assumptions :

  • The data points are independent; AND
  • The data, at least approximately, follow a normal distribution .

If your sample doesn't fit these assumptions, you can resort to nonparametric alternatives. Visit our Mann–Whitney U test calculator or the Wilcoxon rank-sum test calculator to learn more. Other possibilities include the Wilcoxon signed-rank test or the sign test.

Your choice of t-test depends on whether you are studying one group or two groups:

One sample t-test

Choose the one-sample t-test to check if the mean of a population is equal to some pre-set hypothesized value .

The average volume of a drink sold in 0.33 l cans — is it really equal to 330 ml?

The average weight of people from a specific city — is it different from the national average?

Choose the two-sample t-test to check if the difference between the means of two populations is equal to some pre-determined value when the two samples have been chosen independently of each other.

In particular, you can use this test to check whether the two groups are different from one another .

The average difference in weight gain in two groups of people: one group was on a high-carb diet and the other on a high-fat diet.

The average difference in the results of a math test from students at two different universities.

This test is sometimes referred to as an independent samples t-test , or an unpaired samples t-test .

A paired t-test is used to investigate the change in the mean of a population before and after some experimental intervention , based on a paired sample, i.e., when each subject has been measured twice: before and after treatment.

In particular, you can use this test to check whether, on average, the treatment has had any effect on the population .

The change in student test performance before and after taking a course.

The change in blood pressure in patients before and after administering some drug.

So, you've decided which t-test to perform. These next steps will tell you how to calculate the p-value from t-test or its critical values, and then which decision to make about the null hypothesis.

Decide on the alternative hypothesis :

Use a two-tailed t-test if you only care whether the population's mean (or, in the case of two populations, the difference between the populations' means) agrees or disagrees with the pre-set value.

Use a one-tailed t-test if you want to test whether this mean (or difference in means) is greater/less than the pre-set value.

Compute your T-score value :

Formulas for the test statistic in t-tests include the sample size , as well as its mean and standard deviation . The exact formula depends on the t-test type — check the sections dedicated to each particular test for more details.

Determine the degrees of freedom for the t-test:

The degrees of freedom are the number of observations in a sample that are free to vary as we estimate statistical parameters. In the simplest case, the number of degrees of freedom equals your sample size minus the number of parameters you need to estimate . Again, the exact formula depends on the t-test you want to perform — check the sections below for details.

The degrees of freedom are essential, as they determine the distribution followed by your T-score (under the null hypothesis). If there are d degrees of freedom, then the distribution of the test statistics is the t-Student distribution with d degrees of freedom . This distribution has a shape similar to N(0,1) (bell-shaped and symmetric) but has heavier tails . If the number of degrees of freedom is large (>30), which generically happens for large samples, the t-Student distribution is practically indistinguishable from N(0,1).

💡 The t-Student distribution owes its name to William Sealy Gosset, who, in 1908, published his paper on the t-test under the pseudonym "Student". Gosset worked at the famous Guinness Brewery in Dublin, Ireland, and devised the t-test as an economical way to monitor the quality of beer. Cheers! 🍺🍺🍺

Recall that the p-value is the probability (calculated under the assumption that the null hypothesis is true) that the test statistic will produce values at least as extreme as the T-score produced for your sample . As probabilities correspond to areas under the density function, p-value from t-test can be nicely illustrated with the help of the following pictures:

p-value from t-test

The following formulae say how to calculate p-value from t-test. By cdf t,d we denote the cumulative distribution function of the t-Student distribution with d degrees of freedom:

p-value from left-tailed t-test:

p-value = cdf t,d (t score )

p-value from right-tailed t-test:

p-value = 1 − cdf t,d (t score )

p-value from two-tailed t-test:

p-value = 2 × cdf t,d (−|t score |)

or, equivalently: p-value = 2 − 2 × cdf t,d (|t score |)

However, the cdf of the t-distribution is given by a somewhat complicated formula. To find the p-value by hand, you would need to resort to statistical tables, where approximate cdf values are collected, or to specialized statistical software. Fortunately, our t-test calculator determines the p-value from t-test for you in the blink of an eye!

Recall, that in the critical values approach to hypothesis testing, you need to set a significance level, α, before computing the critical values , which in turn give rise to critical regions (a.k.a. rejection regions).

Formulas for critical values employ the quantile function of t-distribution, i.e., the inverse of the cdf :

Critical value for left-tailed t-test: cdf t,d -1 (α)

critical region:

(-∞, cdf t,d -1 (α)]

Critical value for right-tailed t-test: cdf t,d -1 (1-α)

[cdf t,d -1 (1-α), ∞)

Critical values for two-tailed t-test: ±cdf t,d -1 (1-α/2)

(-∞, -cdf t,d -1 (1-α/2)] ∪ [cdf t,d -1 (1-α/2), ∞)

To decide the fate of the null hypothesis, just check if your T-score lies within the critical region:

If your T-score belongs to the critical region , reject the null hypothesis and accept the alternative hypothesis.

If your T-score is outside the critical region , then you don't have enough evidence to reject the null hypothesis.

Choose the type of t-test you wish to perform:

A one-sample t-test (to test the mean of a single group against a hypothesized mean);

A two-sample t-test (to compare the means for two groups); or

A paired t-test (to check how the mean from the same group changes after some intervention).

Two-tailed;

Left-tailed; or

Right-tailed.

This t-test calculator allows you to use either the p-value approach or the critical regions approach to hypothesis testing!

Enter your T-score and the number of degrees of freedom . If you don't know them, provide some data about your sample(s): sample size, mean, and standard deviation, and our t-test calculator will compute the T-score and degrees of freedom for you .

Once all the parameters are present, the p-value, or critical region, will immediately appear underneath the t-test calculator, along with an interpretation!

The null hypothesis is that the population mean is equal to some value μ 0 \mu_0 μ 0 ​ .

The alternative hypothesis is that the population mean is:

  • different from μ 0 \mu_0 μ 0 ​ ;
  • smaller than μ 0 \mu_0 μ 0 ​ ; or
  • greater than μ 0 \mu_0 μ 0 ​ .

One-sample t-test formula :

  • μ 0 \mu_0 μ 0 ​ — Mean postulated in the null hypothesis;
  • n n n — Sample size;
  • x ˉ \bar{x} x ˉ — Sample mean; and
  • s s s — Sample standard deviation.

Number of degrees of freedom in t-test (one-sample) = n − 1 n-1 n − 1 .

The null hypothesis is that the actual difference between these groups' means, μ 1 \mu_1 μ 1 ​ , and μ 2 \mu_2 μ 2 ​ , is equal to some pre-set value, Δ \Delta Δ .

The alternative hypothesis is that the difference μ 1 − μ 2 \mu_1 - \mu_2 μ 1 ​ − μ 2 ​ is:

  • Different from Δ \Delta Δ ;
  • Smaller than Δ \Delta Δ ; or
  • Greater than Δ \Delta Δ .

In particular, if this pre-determined difference is zero ( Δ = 0 \Delta = 0 Δ = 0 ):

The null hypothesis is that the population means are equal.

The alternate hypothesis is that the population means are:

  • μ 1 \mu_1 μ 1 ​ and μ 2 \mu_2 μ 2 ​ are different from one another;
  • μ 1 \mu_1 μ 1 ​ is smaller than μ 2 \mu_2 μ 2 ​ ; and
  • μ 1 \mu_1 μ 1 ​ is greater than μ 2 \mu_2 μ 2 ​ .

Formally, to perform a t-test, we should additionally assume that the variances of the two populations are equal (this assumption is called the homogeneity of variance ).

There is a version of a t-test that can be applied without the assumption of homogeneity of variance: it is called a Welch's t-test . For your convenience, we describe both versions.

Two-sample t-test if variances are equal

Use this test if you know that the two populations' variances are the same (or very similar).

Two-sample t-test formula (with equal variances) :

where s p s_p s p ​ is the so-called pooled standard deviation , which we compute as:

  • Δ \Delta Δ — Mean difference postulated in the null hypothesis;
  • n 1 n_1 n 1 ​ — First sample size;
  • x ˉ 1 \bar{x}_1 x ˉ 1 ​ — Mean for the first sample;
  • s 1 s_1 s 1 ​ — Standard deviation in the first sample;
  • n 2 n_2 n 2 ​ — Second sample size;
  • x ˉ 2 \bar{x}_2 x ˉ 2 ​ — Mean for the second sample; and
  • s 2 s_2 s 2 ​ — Standard deviation in the second sample.

Number of degrees of freedom in t-test (two samples, equal variances) = n 1 + n 2 − 2 n_1 + n_2 - 2 n 1 ​ + n 2 ​ − 2 .

Two-sample t-test if variances are unequal (Welch's t-test)

Use this test if the variances of your populations are different.

Two-sample Welch's t-test formula if variances are unequal:

  • s 1 s_1 s 1 ​ — Standard deviation in the first sample;
  • s 2 s_2 s 2 ​ — Standard deviation in the second sample.

The number of degrees of freedom in a Welch's t-test (two-sample t-test with unequal variances) is very difficult to count. We can approximate it with the help of the following Satterthwaite formula :

Alternatively, you can take the smaller of n 1 − 1 n_1 - 1 n 1 ​ − 1 and n 2 − 1 n_2 - 1 n 2 ​ − 1 as a conservative estimate for the number of degrees of freedom.

🔎 The Satterthwaite formula for the degrees of freedom can be rewritten as a scaled weighted harmonic mean of the degrees of freedom of the respective samples: n 1 − 1 n_1 - 1 n 1 ​ − 1 and n 2 − 1 n_2 - 1 n 2 ​ − 1 , and the weights are proportional to the standard deviations of the corresponding samples.

As we commonly perform a paired t-test when we have data about the same subjects measured twice (before and after some treatment), let us adopt the convention of referring to the samples as the pre-group and post-group.

The null hypothesis is that the true difference between the means of pre- and post-populations is equal to some pre-set value, Δ \Delta Δ .

The alternative hypothesis is that the actual difference between these means is:

Typically, this pre-determined difference is zero. We can then reformulate the hypotheses as follows:

The null hypothesis is that the pre- and post-means are the same, i.e., the treatment has no impact on the population .

The alternative hypothesis:

  • The pre- and post-means are different from one another (treatment has some effect);
  • The pre-mean is smaller than the post-mean (treatment increases the result); or
  • The pre-mean is greater than the post-mean (treatment decreases the result).

Paired t-test formula

In fact, a paired t-test is technically the same as a one-sample t-test! Let us see why it is so. Let x 1 , . . . , x n x_1, ... , x_n x 1 ​ , ... , x n ​ be the pre observations and y 1 , . . . , y n y_1, ... , y_n y 1 ​ , ... , y n ​ the respective post observations. That is, x i , y i x_i, y_i x i ​ , y i ​ are the before and after measurements of the i -th subject.

For each subject, compute the difference, d i : = x i − y i d_i := x_i - y_i d i ​ := x i ​ − y i ​ . All that happens next is just a one-sample t-test performed on the sample of differences d 1 , . . . , d n d_1, ... , d_n d 1 ​ , ... , d n ​ . Take a look at the formula for the T-score :

Δ \Delta Δ — Mean difference postulated in the null hypothesis;

n n n — Size of the sample of differences, i.e., the number of pairs;

x ˉ \bar{x} x ˉ — Mean of the sample of differences; and

s s s  — Standard deviation of the sample of differences.

Number of degrees of freedom in t-test (paired): n − 1 n - 1 n − 1

We use a Z-test when we want to test the population mean of a normally distributed dataset, which has a known population variance . If the number of degrees of freedom is large, then the t-Student distribution is very close to N(0,1).

Hence, if there are many data points (at least 30), you may swap a t-test for a Z-test, and the results will be almost identical. However, for small samples with unknown variance, remember to use the t-test because, in such cases, the t-Student distribution differs significantly from the N(0,1)!

🙋 Have you concluded you need to perform the z-test? Head straight to our z-test calculator !

What is a t-test?

A t-test is a widely used statistical test that analyzes the means of one or two groups of data. For instance, a t-test is performed on medical data to determine whether a new drug really helps.

What are different types of t-tests?

Different types of t-tests are:

  • One-sample t-test;
  • Two-sample t-test; and
  • Paired t-test.

How to find the t value in a one sample t-test?

To find the t-value:

  • Subtract the null hypothesis mean from the sample mean value.
  • Divide the difference by the standard deviation of the sample.
  • Multiply the resultant with the square root of the sample size.

Alien civilization

Ascending order, confusion matrix, steps to calories.

  • Biology (100)
  • Chemistry (100)
  • Construction (144)
  • Conversion (295)
  • Ecology (30)
  • Everyday life (262)
  • Finance (570)
  • Health (440)
  • Physics (510)
  • Sports (104)
  • Statistics (182)
  • Other (182)
  • Discover Omni (40)

JMP | Statistical Discovery.™ From SAS.

Statistics Knowledge Portal

A free online introduction to statistics

The Two-Sample t -Test

What is the two-sample t -test.

The two-sample t -test (also known as the independent samples t -test) is a method used to test whether the unknown population means of two groups are equal or not.

Is this the same as an A/B test?

Yes, a two-sample t -test is used to analyze the results from A/B tests.

When can I use the test?

You can use the test when your data values are independent, are randomly sampled from two normal populations and the two independent groups have equal variances.

What if I have more than two groups?

Use a multiple comparison method. Analysis of variance (ANOVA) is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.

What if the variances for my two groups are not equal?

You can still use the two-sample t- test. You use a different estimate of the standard deviation. 

What if my data isn’t nearly normally distributed?

If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.

See how to perform a two-sample t -test using statistical software

  • Download JMP to follow along using the sample data included with the software.
  • To see more JMP tutorials, visit the JMP Learning Library .

Using the two-sample t -test

The sections below discuss what is needed to perform the test, checking our data, how to perform the test and statistical details.

What do we need?

For the two-sample t -test, we need two variables. One variable defines the two groups. The second variable is the measurement of interest.

We also have an idea, or hypothesis, that the means of the underlying populations for the two groups are different. Here are a couple of examples:

  • We have students who speak English as their first language and students who do not. All students take a reading test. Our two groups are the native English speakers and the non-native speakers. Our measurements are the test scores. Our idea is that the mean test scores for the underlying populations of native and non-native English speakers are not the same. We want to know if the mean score for the population of native English speakers is different from the people who learned English as a second language.
  • We measure the grams of protein in two different brands of energy bars. Our two groups are the two brands. Our measurement is the grams of protein for each energy bar. Our idea is that the mean grams of protein for the underlying populations for the two brands may be different. We want to know if we have evidence that the mean grams of protein for the two brands of energy bars is different or not.

Two-sample t -test assumptions

To conduct a valid test:

  • Data values must be independent. Measurements for one observation do not affect measurements for any other observation.
  • Data in each group must be obtained via a random sample from the population.
  • Data in each group are normally distributed .
  • Data values are continuous.
  • The variances for the two independent groups are equal.

For very small groups of data, it can be hard to test these requirements. Below, we'll discuss how to check the requirements using software and what to do when a requirement isn’t met.

Two-sample t -test example

One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.

Our sample data is from a group of men and women who did workouts at a gym three times a week for a year. Then, their trainer measured the body fat. The table below shows the data.

Table 1: Body fat percentage data grouped by gender

You can clearly see some overlap in the body fat measurements for the men and women in our sample, but also some differences. Just by looking at the data, it's hard to draw any solid conclusions about whether the underlying populations of men and women at the gym have the same mean body fat. That is the value of statistical tests – they provide a common, statistically valid way to make decisions, so that everyone makes the same decision on the same set of data values.

Checking the data

Let’s start by answering: Is the two-sample t -test an appropriate method to evaluate the difference in body fat between men and women?

  • The data values are independent. The body fat for any one person does not depend on the body fat for another person.
  • We assume the people measured represent a simple random sample from the population of members of the gym.
  • We assume the data are normally distributed, and we can check this assumption.
  • The data values are body fat measurements. The measurements are continuous.
  • We assume the variances for men and women are equal, and we can check this assumption.

Before jumping into analysis, we should always take a quick look at the data. The figure below shows histograms and summary statistics for the men and women.

Histogram and summary statistics for the body fat data

The two histograms are on the same scale. From a quick look, we can see that there are no very unusual points, or outliers . The data look roughly bell-shaped, so our initial idea of a normal distribution seems reasonable.

Examining the summary statistics, we see that the standard deviations are similar. This supports the idea of equal variances. We can also check this using a test for variances.

Based on these observations, the two-sample t -test appears to be an appropriate method to test for a difference in means.

How to perform the two-sample t -test

For each group, we need the average, standard deviation and sample size. These are shown in the table below.

Table 2: Average, standard deviation and sample size statistics grouped by gender

Without doing any testing, we can see that the averages for men and women in our samples are not the same. But how different are they? Are the averages “close enough” for us to conclude that mean body fat is the same for the larger population of men and women at the gym? Or are the averages too different for us to make this conclusion?

We'll further explain the principles underlying the two sample t -test in the statistical details section below, but let's first proceed through the steps from beginning to end. We start by calculating our test statistic. This calculation begins with finding the difference between the two averages:

$ 22.29 - 14.95 = 7.34 $

This difference in our samples estimates the difference between the population means for the two groups.

Next, we calculate the pooled standard deviation. This builds a combined estimate of the overall standard deviation. The estimate adjusts for different group sizes. First, we calculate the pooled variance:

$ s_p^2 = \frac{((n_1 - 1)s_1^2) + ((n_2 - 1)s_2^2)} {n_1 + n_2 - 2} $

$ s_p^2 = \frac{((10 - 1)5.32^2) + ((13 - 1)6.84^2)}{(10 + 13 - 2)} $

$ = \frac{(9\times28.30) + (12\times46.82)}{21} $

$ = \frac{(254.7 + 561.85)}{21} $

$ =\frac{816.55}{21} = 38.88 $

Next, we take the square root of the pooled variance to get the pooled standard deviation. This is:

$ \sqrt{38.88} = 6.24 $

We now have all the pieces for our test statistic. We have the difference of the averages, the pooled standard deviation and the sample sizes.  We calculate our test statistic as follows:

$ t = \frac{\text{difference of group averages}}{\text{standard error of difference}} = \frac{7.34}{(6.24\times \sqrt{(1/10 + 1/13)})} = \frac{7.34}{2.62} = 2.80 $

To evaluate the difference between the means in order to make a decision about our gym programs, we compare the test statistic to a theoretical value from the t- distribution. This activity involves four steps:

  • We decide on the risk we are willing to take for declaring a significant difference. For the body fat data, we decide that we are willing to take a 5% risk of saying that the unknown population means for men and women are not equal when they really are. In statistics-speak, the significance level, denoted by α, is set to 0.05. It is a good practice to make this decision before collecting the data and before calculating test statistics.
  • We calculate a test statistic. Our test statistic is 2.80.
  • We find the theoretical value from the t- distribution based on our null hypothesis which states that the means for men and women are equal. Most statistics books have look-up tables for the t- distribution. You can also find tables online. The most likely situation is that you will use software and will not use printed tables. To find this value, we need the significance level (α = 0.05) and the degrees of freedom . The degrees of freedom ( df ) are based on the sample sizes of the two groups. For the body fat data, this is: $ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $ The t value with α = 0.05 and 21 degrees of freedom is 2.080.
  • We compare the value of our statistic (2.80) to the t value. Since 2.80 > 2.080, we reject the null hypothesis that the mean body fat for men and women are equal, and conclude that we have evidence body fat in the population is different between men and women.

Statistical details

Let’s look at the body fat data and the two-sample t -test using statistical terms.

Our null hypothesis is that the underlying population means are the same. The null hypothesis is written as:

$ H_o:  \mathrm{\mu_1} =\mathrm{\mu_2} $

The alternative hypothesis is that the means are not equal. This is written as:

$ H_o:  \mathrm{\mu_1} \neq \mathrm{\mu_2} $

We calculate the average for each group, and then calculate the difference between the two averages. This is written as:

$\overline{x_1} -  \overline{x_2} $

We calculate the pooled standard deviation. This assumes that the underlying population variances are equal. The pooled variance formula is written as:

The formula shows the sample size for the first group as n 1 and the second group as n 2 . The standard deviations for the two groups are s 1 and s 2 . This estimate allows the two groups to have different numbers of observations. The pooled standard deviation is the square root of the variance and is written as s p .

What if your sample sizes for the two groups are the same? In this situation, the pooled estimate of variance is simply the average of the variances for the two groups:

$ s_p^2 = \frac{(s_1^2 + s_2^2)}{2} $

The test statistic is calculated as:

$ t = \frac{(\overline{x_1} -\overline{x_2})}{s_p\sqrt{1/n_1 + 1/n_2}} $

The numerator of the test statistic is the difference between the two group averages. It estimates the difference between the two unknown population means. The denominator is an estimate of the standard error of the difference between the two unknown population means. 

Technical Detail: For a single mean, the standard error is $ s/\sqrt{n} $  . The formula above extends this idea to two groups that use a pooled estimate for s (standard deviation), and that can have different group sizes.

We then compare the test statistic to a t value with our chosen alpha value and the degrees of freedom for our data. Using the body fat data as an example, we set α = 0.05. The degrees of freedom ( df ) are based on the group sizes and are calculated as:

$ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $

The formula shows the sample size for the first group as n 1 and the second group as n 2 .  Statisticians write the t value with α = 0.05 and 21 degrees of freedom as:

$ t_{0.05,21} $

The t value with α = 0.05 and 21 degrees of freedom is 2.080. There are two possible results from our comparison:

  • The test statistic is lower than the t value. You fail to reject the hypothesis of equal means. You conclude that the data support the assumption that the men and women have the same average body fat.
  • The test statistic is higher than the t value. You reject the hypothesis of equal means. You do not conclude that men and women have the same average body fat.

t -Test with unequal variances

When the variances for the two groups are not equal, we cannot use the pooled estimate of standard deviation. Instead, we take the standard error for each group separately. The test statistic is:

$ t = \frac{ (\overline{x_1} -  \overline{x_2})}{\sqrt{s_1^2/n_1 + s_2^2/n_2}} $

The numerator of the test statistic is the same. It is the difference between the averages of the two groups. The denominator is an estimate of the overall standard error of the difference between means. It is based on the separate standard error for each group.

The degrees of freedom calculation for the t value is more complex with unequal variances than equal variances and is usually left up to statistical software packages. The key point to remember is that if you cannot use the pooled estimate of standard deviation, then you cannot use the simple formula for the degrees of freedom.

Testing for normality

The normality assumption is more important   when the two groups have small sample sizes than for larger sample sizes.

Normal distributions are symmetric, which means they are “even” on both sides of the center. Normal distributions do not have extreme values, or outliers. You can check these two features of a normal distribution with graphs. Earlier, we decided that the body fat data was “close enough” to normal to go ahead with the assumption of normality. The figure below shows a normal quantile plot for men and women, and supports our decision.

 Normal quantile plot of the body fat measurements for men and women

You can also perform a formal test for normality using software. The figure above shows results of testing for normality with JMP software. We test each group separately. Both the test for men and the test for women show that we cannot reject the hypothesis of a normal distribution. We can go ahead with the assumption that the body fat data for men and for women are normally distributed.

Testing for unequal variances

Testing for unequal variances is complex. We won’t show the calculations in detail, but will show the results from JMP software. The figure below shows results of a test for unequal variances for the body fat data.

Test for unequal variances for the body fat data

Without diving into details of the different types of tests for unequal variances, we will use the F test. Before testing, we decide to accept a 10% risk of concluding the variances are equal when they are not. This means we have set α = 0.10.

Like most statistical software, JMP shows the p -value for a test. This is the likelihood of finding a more extreme value for the test statistic than the one observed. It’s difficult to calculate by hand. For the figure above, with the F test statistic of 1.654, the p- value is 0.4561. This is larger than our α value: 0.4561 > 0.10. We fail to reject the hypothesis of equal variances. In practical terms, we can go ahead with the two-sample t -test with the assumption of equal variances for the two groups.

Understanding p-values

Using a visual, you can check to see if your test statistic is a more extreme value in the distribution. The figure below shows a t- distribution with 21 degrees of freedom.

t-distribution with 21 degrees of freedom and α = .05

Since our test is two-sided and we have set α = .05, the figure shows that the value of 2.080 “cuts off” 2.5% of the data in each of the two tails. Only 5% of the data overall is further out in the tails than 2.080. Because our test statistic of 2.80 is beyond the cut-off point, we reject the null hypothesis of equal means.

Putting it all together with software

The figure below shows results for the two-sample t -test for the body fat data from JMP software.

Results for the two-sample t-test from JMP software

The results for the two-sample t -test that assumes equal variances are the same as our calculations earlier. The test statistic is 2.79996. The software shows results for a two-sided test and for one-sided tests. The two-sided test is what we want (Prob > |t|). Our null hypothesis is that the mean body fat for men and women is equal. Our alternative hypothesis is that the mean body fat is not equal. The one-sided tests are for one-sided alternative hypotheses – for example, for a null hypothesis that mean body fat for men is less than that for women.

We can reject the hypothesis of equal mean body fat for the two groups and conclude that we have evidence body fat differs in the population between men and women. The software shows a p -value of 0.0107. We decided on a 5% risk of concluding the mean body fat for men and women are different, when they are not. It is important to make this decision before doing the statistical test.

The figure also shows the results for the t- test that does not assume equal variances. This test does not use the pooled estimate of the standard deviation. As was mentioned above, this test also has a complex formula for degrees of freedom. You can see that the degrees of freedom are 20.9888. The software shows a p- value of 0.0086. Again, with our decision of a 5% risk, we can reject the null hypothesis of equal mean body fat for men and women.

Other topics

If you have more than two independent groups, you cannot use the two-sample t- test. You should use a multiple comparison   method. ANOVA, or analysis of variance, is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.

What if my data are not from normal distributions?

If your sample size is very small, it might be hard to test for normality. In this situation, you might need to use your understanding of the measurements. For example, for the body fat data, the trainer knows that the underlying distribution of body fat is normally distributed. Even for a very small sample, the trainer would likely go ahead with the t -test and assume normality.

What if you know the underlying measurements are not normally distributed? Or what if your sample size is large and the test for normality is rejected? In this situation, you can use nonparametric analyses. These types of analyses do not depend on an assumption that the data values are from a specific distribution. For the two-sample t ­-test, the Wilcoxon rank sum test is a nonparametric test that could be used.

Two Sample t-test: Definition, Formula, and Example

A two sample t-test is used to determine whether or not two population means are equal.

This tutorial explains the following:

  • The motivation for performing a two sample t-test.
  • The formula to perform a two sample t-test.
  • The assumptions that should be met to perform a two sample t-test.
  • An example of how to perform a two sample t-test.

Two Sample t-test: Motivation

Suppose we want to know whether or not the mean weight between two different species of turtles is equal. Since there are thousands of turtles in each population, it would be too time-consuming and costly to go around and weigh each individual turtle.

Instead, we might take a simple random sample of 15 turtles from each population and use the mean weight in each sample to determine if the mean weight is equal between the two populations:

Two sample t-test example

However, it’s virtually guaranteed that the mean weight between the two samples will be at least a little different. The question is whether or not this difference is statistically significant . Fortunately, a two sample t-test allows us to answer this question.

Two Sample t-test: Formula

A two-sample t-test always uses the following null hypothesis:

  • H 0 : μ 1  = μ 2 (the two population means are equal)

The alternative hypothesis can be either two-tailed, left-tailed, or right-tailed:

  • H 1 (two-tailed): μ 1  ≠ μ 2 (the two population means are not equal)
  • H 1 (left-tailed): μ 1  2 (population 1 mean is less than population 2 mean)
  • H 1 (right-tailed):  μ 1 > μ 2  (population 1 mean is greater than population 2 mean)

We use the following formula to calculate the test statistic t:

Test statistic:  ( x 1  –  x 2 )  /  s p (√ 1/n 1  + 1/n 2 )

where  x 1  and  x 2 are the sample means, n 1 and n 2  are the sample sizes, and where s p is calculated as:

s p = √  (n 1 -1)s 1 2  +  (n 2 -1)s 2 2  /  (n 1 +n 2 -2)

where s 1 2  and s 2 2  are the sample variances.

If the p-value that corresponds to the test statistic t with (n 1 +n 2 -1) degrees of freedom is less than your chosen significance level (common choices are 0.10, 0.05, and 0.01) then you can reject the null hypothesis.

Two Sample t-test: Assumptions

For the results of a two sample t-test to be valid, the following assumptions should be met:

  • The observations in one sample should be independent of the observations in the other sample.
  • The data should be approximately normally distributed.
  • The two samples should have approximately the same variance. If this assumption is not met, you should instead perform Welch’s t-test .
  • The data in both samples was obtained using a random sampling method .

Two Sample t-test : Example

Suppose we want to know whether or not the mean weight between two different species of turtles is equal. To test this, will perform a two sample t-test at significance level α = 0.05 using the following steps:

Step 1: Gather the sample data.

Suppose we collect a random sample of turtles from each population with the following information:

  • Sample size n 1 = 40
  • Sample mean weight  x 1  = 300
  • Sample standard deviation s 1 = 18.5
  • Sample size n 2 = 38
  • Sample mean weight  x 2  = 305
  • Sample standard deviation s 2 = 16.7

Step 2: Define the hypotheses.

We will perform the two sample t-test with the following hypotheses:

  • H 0 :  μ 1  = μ 2 (the two population means are equal)
  • H 1 :  μ 1  ≠ μ 2 (the two population means are not equal)

Step 3: Calculate the test statistic  t .

First, we will calculate the pooled standard deviation s p :

s p = √  (n 1 -1)s 1 2  +  (n 2 -1)s 2 2  /  (n 1 +n 2 -2)  = √  (40-1)18.5 2  +  (38-1)16.7 2  /  (40+38-2)  = 17.647

Next, we will calculate the test statistic  t :

t = ( x 1  –  x 2 )  /  s p (√ 1/n 1  + 1/n 2 ) =  (300-305) / 17.647(√ 1/40 + 1/38 ) =  -1.2508

Step 4: Calculate the p-value of the test statistic  t .

According to the T Score to P Value Calculator , the p-value associated with t = -1.2508 and degrees of freedom = n 1 +n 2 -2 = 40+38-2 = 76 is  0.21484 .

Step 5: Draw a conclusion.

Since this p-value is not less than our significance level α = 0.05, we fail to reject the null hypothesis. We do not have sufficient evidence to say that the mean weight of turtles between these two populations is different.

Note:  You can also perform this entire two sample t-test by simply using the Two Sample t-test Calculator .

Additional Resources

The following tutorials explain how to perform a two-sample t-test using different statistical programs:

How to Perform a Two Sample t-test in Excel How to Perform a Two Sample t-test in SPSS How to Perform a Two Sample t-test in Stata How to Perform a Two Sample t-test in R How to Perform a Two Sample t-test in Python How to Perform a Two Sample t-test on a TI-84 Calculator

An Introduction to the Binomial Distribution

4 examples of using linear regression in real life, related posts, three-way anova: definition & example, two sample z-test: definition, formula, and example, one sample z-test: definition, formula, and example, how to find a confidence interval for a..., an introduction to the exponential distribution, an introduction to the uniform distribution, the breusch-pagan test: definition & example, population vs. sample: what’s the difference, introduction to multiple linear regression, dunn’s test for multiple comparisons.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 11.

  • Hypotheses for a two-sample t test
  • Example of hypotheses for paired and two-sample t tests
  • Writing hypotheses to test the difference of means

Two-sample t test for difference of means

  • Test statistic in a two-sample t test
  • P-value in a two-sample t test
  • Conclusion for a two-sample t test using a P-value
  • Conclusion for a two-sample t test using a confidence interval
  • Making conclusions about the difference of means

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Video transcript

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Inference for Comparing 2 Population Means (HT for 2 Means, independent samples)

More of the good stuff! We will need to know how to label the null and alternative hypothesis, calculate the test statistic, and then reach our conclusion using the critical value method or the p-value method.

The Test Statistic for a Test of 2 Means from Independent Samples:

[latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}}[/latex]

What the different symbols mean:

[latex]n_1[/latex] is the sample size for the first group

[latex]n_2[/latex] is the sample size for the second group

[latex]df[/latex], the degrees of freedom, is the smaller of [latex]n_1 - 1[/latex] and [latex]n_2 - 1[/latex]

[latex]\mu_1[/latex] is the population mean from the first group

[latex]\mu_2[/latex] is the population mean from the second group

[latex]\bar{x_1}[/latex] is the sample mean for the first group

[latex]\bar{x_2}[/latex] is the sample mean for the second group

[latex]s_1[/latex] is the sample standard deviation for the first group

[latex]s_2[/latex] is the sample standard deviation for the second group

[latex]\alpha[/latex] is the significance level , usually given within the problem, or if not given, we assume it to be 5% or 0.05

Assumptions when conducting a Test for 2 Means from Independent Samples:

  • We do not know the population standard deviations, and we do not assume they are equal
  • The two samples or groups are independent
  • Both samples are simple random samples
  • Both populations are Normally distributed OR both samples are large ([latex]n_1 > 30[/latex] and [latex]n_2 > 30[/latex])

Steps to conduct the Test for 2 Means from Independent Samples:

  • Identify all the symbols listed above (all the stuff that will go into the formulas). This includes [latex]n_1[/latex] and [latex]n_2[/latex], [latex]df[/latex], [latex]\mu_1[/latex] and [latex]\mu_2[/latex], [latex]\bar{x_1}[/latex] and [latex]\bar{x_2}[/latex], [latex]s_1[/latex] and [latex]s_2[/latex], and [latex]\alpha[/latex]
  • Identify the null and alternative hypotheses
  • Calculate the test statistic, [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}}[/latex]
  • Find the critical value(s) OR the p-value OR both
  • Apply the Decision Rule
  • Write up a conclusion for the test

Example 1: Study on the effectiveness of stents for stroke patients [1]

In this study , researchers randomly assigned stroke patients to two groups: one received the current standard care (control) and the other received a stent surgery in addition to the standard care (stent treatment). If the stents work, the treatment group should have a lower average disability score . Do the results give convincing statistical evidence that the stent treatment reduces the average disability from stroke?

Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the patients with stent treatment and patients receiving the standard care), so we will conduct a Test of 2 Means.

  • [latex]n_1 = 98[/latex] is the sample size for the first group
  • [latex]n_2 = 93[/latex] is the sample size for the second group
  • [latex]df[/latex], the degrees of freedom, is the smaller of [latex]98 - 1 = 97[/latex] and [latex]93 - 1 = 92[/latex], so [latex]df = 92[/latex]
  • [latex]\bar{x_1} = 2.26[/latex] is the sample mean for the first group
  • [latex]\bar{x_2} = 3.23[/latex] is the sample mean for the second group
  • [latex]s_1 = 1.78[/latex] is the sample standard deviation for the first group
  • [latex]s_2 = 1.78[/latex] is the sample standard deviation for the second group
  • [latex]\alpha = 0.05[/latex] (we were not told a specific value in the problem, so we are assuming it is 5%)
  • One additional assumption we extend from the null hypothesis is that [latex]\mu_1 - \mu_2 = 0[/latex]; this means that in our formula, those variables cancel out
  • [latex]H_{0}: \mu_1 = \mu_2[/latex]
  • [latex]H_{A}: \mu_1 < \mu_2[/latex]
  • [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}} = \displaystyle \frac{(2.26 - 3.23) - 0)}{\sqrt{\displaystyle \frac{1.78^2}{98} + \displaystyle \frac{1.78^2}{93}}} = -3.76[/latex]
  • StatDisk : We can conduct this test using StatDisk. The nice thing about StatDisk is that it will also compute the test statistic. From the main menu above we click on Analysis, Hypothesis Testing, and then Mean Two Independent Samples. From there enter the 0.05 significance, along with the specific values as outlined in the picture below in Step 2. Notice the alternative hypothesis is the [latex]<[/latex] option. Enter the sample size, mean, and standard deviation for each group, and make sure that unequal variances is selected. Now we click on Evaluate. If you check the values, the test statistic is reported in the Step 3 display, as well as the P-Value of 0.00011.
  • Applying the Decision Rule: We now compare this to our significance level, which is 0.05. If the p-value is smaller or equal to the alpha level, we have enough evidence for our claim, otherwise we do not. Here, [latex]p-value = 0.00011[/latex], which is definitely smaller than [latex]\alpha = 0.05[/latex], so we have enough evidence for the alternative hypothesis…but what does this mean?
  • Conclusion: Because our p-value  of [latex]0.00011[/latex] is less than our [latex]\alpha[/latex] level of [latex]0.05[/latex], we reject [latex]H_{0}[/latex]. We have convincing statistical evidence that the stent treatment reduces the average disability from stroke.

Example 2: Home Run Distances

In 1998, Sammy Sosa and Mark McGwire (2 players in Major League Baseball) were on pace to set a new home run record. At the end of the season McGwire ended up with 70 home runs, and Sosa ended up with 66. The home run distances were recorded and compared (sometimes a player’s home run distance is used to measure their “power”). Do the results give convincing statistical evidence that the home run distances are different from each other? Who would you say “hit the ball farther” in this comparison?

Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the home run distances), so we will conduct a Test of 2 Means.

  • [latex]n_1 = 70[/latex] is the sample size for the first group
  • [latex]n_2 = 66[/latex] is the sample size for the second group
  • [latex]df[/latex], the degrees of freedom, is the smaller of [latex]70 - 1 = 69[/latex] and [latex]66 - 1 = 65[/latex], so [latex]df = 65[/latex]
  • [latex]\bar{x_1} = 418.5[/latex] is the sample mean for the first group
  • [latex]\bar{x_2} = 404.8[/latex] is the sample mean for the second group
  • [latex]s_1 = 45.5[/latex] is the sample standard deviation for the first group
  • [latex]s_2 = 35.7[/latex] is the sample standard deviation for the second group
  • [latex]H_{A}: \mu_1 \neq \mu_2[/latex]
  • [latex]t = \displaystyle \frac{(\bar{x_1} - \bar{x_2}) - (\mu_1 - \mu_2)}{\sqrt{\displaystyle \frac{s_1^2}{n_1} + \displaystyle \frac{s_2^2}{n_2}}} = \displaystyle \frac{(418.5 - 404.8) - 0)}{\sqrt{\displaystyle \frac{45.5^2}{70} + \displaystyle \frac{35.7^2}{65}}} = 1.95[/latex]
  • StatDisk : We can conduct this test using StatDisk. The nice thing about StatDisk is that it will also compute the test statistic. From the main menu above we click on Analysis, Hypothesis Testing, and then Mean Two Independent Samples. From there enter the 0.05 significance, along with the specific values as outlined in the picture below in Step 2. Notice the alternative hypothesis is the [latex]\neq[/latex] option. Enter the sample size, mean, and standard deviation for each group, and make sure that unequal variances is selected. Now we click on Evaluate. If you check the values, the test statistic is reported in the Step 3 display, as well as the P-Value of 0.05221.
  • Applying the Decision Rule: We now compare this to our significance level, which is 0.05. If the p-value is smaller or equal to the alpha level, we have enough evidence for our claim, otherwise we do not. Here, [latex]p-value = 0.05221[/latex], which is larger than [latex]\alpha = 0.05[/latex], so we do not have enough evidence for the alternative hypothesis…but what does this mean?
  • Conclusion: Because our p-value  of [latex]0.05221[/latex] is larger than our [latex]\alpha[/latex] level of [latex]0.05[/latex], we fail to reject [latex]H_{0}[/latex]. We do not have convincing statistical evidence that the home run distances are different.
  • Follow-up commentary: But what does this mean? There actually was a difference, right? If we take McGwire’s average and subtract Sosa’s average we get a difference of 13.7. What this result indicates is that the difference is not statistically significant; it could be due more to random chance than something meaningful. Other factors, such as sample size, could also be a determining factor (with a larger sample size, the difference may have been more meaningful).
  • Adapted from the Skew The Script curriculum ( skewthescript.org ), licensed under CC BY-NC-Sa 4.0 ↵

Basic Statistics Copyright © by Allyn Leon is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Power and Sample Size Determination

Lisa Sullivan, PhD

Professor of Biosatistics

Boston Univeristy School of Public Health

Title logo - a jumble of words related to sample size and statistical power

Introduction

A critically important aspect of any study is determining the appropriate sample size to answer the research question. This module will focus on formulas that can be used to estimate the sample size needed to produce a confidence interval estimate with a specified margin of error (precision) or to ensure that a test of hypothesis has a high probability of detecting a meaningful difference in the parameter.

Studies should be designed to include a sufficient number of participants to adequately address the research question. Studies that have either an inadequate number of participants or an excessively large number of participants are both wasteful in terms of participant and investigator time, resources to conduct the assessments, analytic efforts and so on. These situations can also be viewed as unethical as participants may have been put at risk as part of a study that was unable to answer an important question. Studies that are much larger than they need to be to answer the research questions are also wasteful.

The formulas presented here generate estimates of the necessary sample size(s) required based on statistical criteria. However, in many studies, the sample size is determined by financial or logistical constraints. For example, suppose a study is proposed to evaluate a new screening test for Down Syndrome.  Suppose that the screening test is based on analysis of a blood sample taken from women early in pregnancy. In order to evaluate the properties of the screening test (e.g., the sensitivity and specificity), each pregnant woman will be asked to provide a blood sample and in addition to undergo an amniocentesis. The amniocentesis is included as the gold standard and the plan is to compare the results of the screening test to the results of the amniocentesis. Suppose that the collection and processing of the blood sample costs $250 per participant and that the amniocentesis costs $900 per participant. These financial constraints alone might substantially limit the number of women that can be enrolled. Just as it is important to consider both statistical and clinical significance when interpreting results of a statistical analysis, it is also important to weigh both statistical and logistical issues in determining the sample size for a study.

Learning Objectives

After completing this module, the student will be able to:

  • Provide examples demonstrating how the margin of error, effect size and variability of the outcome affect sample size computations.
  • Compute the sample size required to estimate population parameters with precision.
  • Interpret statistical power in tests of hypothesis.
  • Compute the sample size required to ensure high power when hypothesis testing.

Issues in Estimating Sample Size for Confidence Intervals Estimates

The module on confidence intervals provided methods for estimating confidence intervals for various parameters (e.g., μ , p, ( μ 1 - μ 2 ),   μ d , (p 1 -p 2 )). Confidence intervals for every parameter take the following general form:

Point Estimate + Margin of Error

In the module on confidence intervals we derived the formula for the confidence interval for μ as

In practice we use the sample standard deviation to estimate the population standard deviation. Note that there is an alternative formula for estimating the mean of a continuous outcome in a single population, and it is used when the sample size is small (n<30). It involves a value from the t distribution, as opposed to one from the standard normal distribution, to reflect the desired level of confidence. When performing sample size computations, we use the large sample formula shown here. [Note: The resultant sample size might be small, and in the analysis stage, the appropriate confidence interval formula must be used.]

The point estimate for the population mean is the sample mean and the margin of error is

In planning studies, we want to determine the sample size needed to ensure that the margin of error is sufficiently small to be informative. For example, suppose we want to estimate the mean weight of female college students. We conduct a study and generate a 95% confidence interval as follows 125 + 40 pounds, or 85 to 165 pounds. The margin of error is so wide that the confidence interval is uninformative. To be informative, an investigator might want the margin of error to be no more than 5 or 10 pounds (meaning that the 95% confidence interval would have a width (lower limit to upper limit) of 10 or 20 pounds). In order to determine the sample size needed, the investigator must specify the desired margin of error . It is important to note that this is not a statistical issue, but a clinical or a practical one. For example, suppose we want to estimate the mean birth weight of infants born to mothers who smoke cigarettes during pregnancy. Birth weights in infants clearly have a much more restricted range than weights of female college students. Therefore, we would probably want to generate a confidence interval for the mean birth weight that has a margin of error not exceeding 1 or 2 pounds.

The margin of error in the one sample confidence interval for μ can be written as follows:

Our goal is to determine the sample size, n, that ensures that the margin of error, " E ," does not exceed a specified value. We can take the formula above and, with some algebra, solve for n :

First, multipy both sides of the equation by the square root of n . Then cancel out the square root of n from the numerator and denominator on the right side of the equation (since any number divided by itself is equal to 1). This leaves:

Now divide both sides by "E" and cancel out "E" from the numerator and denominator on the left side. This leaves:

Finally, square both sides of the equation to get:

This formula generates the sample size, n , required to ensure that the margin of error, E , does not exceed a specified value. To solve for n , we must input " Z ," " σ ," and " E ."  

  • Z is the value from the table of probabilities of the standard normal distribution for the desired confidence level (e.g., Z = 1.96 for 95% confidence)
  • E is the margin of error that the investigator specifies as important from a clinical or practical standpoint.
  • σ is the standard deviation of the outcome of interest.

Sometimes it is difficult to estimate σ . When we use the sample size formula above (or one of the other formulas that we will present in the sections that follow), we are planning a study to estimate the unknown mean of a particular outcome variable in a population. It is unlikely that we would know the standard deviation of that variable. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study done in a different, but comparable, population. The sample size computation is not an application of statistical inference and therefore it is reasonable to use an appropriate estimate for the standard deviation. The estimate can be derived from a different study that was reported in the literature; some investigators perform a small pilot study to estimate the standard deviation. A pilot study usually involves a small number of participants (e.g., n=10) who are selected by convenience, as opposed to by random sampling. Data from the participants in the pilot study can be used to compute a sample standard deviation, which serves as a good estimate for σ in the sample size formula. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size is not too small.

Sample Size for One Sample, Continuous Outcome

In studies where the plan is to estimate the mean of a continuous outcome variable in a single population, the formula for determining sample size is given below:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), σ is the standard deviation of the outcome variable and E is the desired margin of error. The formula above generates the minimum number of subjects required to ensure that the margin of error in the confidence interval for μ does not exceed E .  

An investigator wants to estimate the mean systolic blood pressure in children with congenital heart disease who are between the ages of 3 and 5. How many children should be enrolled in the study? The investigator plans on using a 95% confidence interval (so Z=1.96) and wants a margin of error of 5 units. The standard deviation of systolic blood pressure is unknown, but the investigators conduct a literature search and find that the standard deviation of systolic blood pressures in children with other cardiac defects is between 15 and 20. To estimate the sample size, we consider the larger standard deviation in order to obtain the most conservative (largest) sample size. 

In order to ensure that the 95% confidence interval estimate of the mean systolic blood pressure in children between the ages of 3 and 5 with congenital heart disease is within 5 units of the true mean, a sample of size 62 is needed. [ Note : We always round up; the sample size formulas always generate the minimum number of subjects needed to ensure the specified precision.] Had we assumed a standard deviation of 15, the sample size would have been n=35. Because the estimates of the standard deviation were derived from studies of children with other cardiac defects, it would be advisable to use the larger standard deviation and plan for a study with 62 children. Selecting the smaller sample size could potentially produce a confidence interval estimate with a larger margin of error. 

An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams? Try to work through the calculation before you look at the answer.

Sample Size for One Sample, Dichotomous Outcome 

In studies where the plan is to estimate the proportion of successes in a dichotomous outcome variable (yes/no) in a single population, the formula for determining sample size is:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%) and E is the desired margin of error. p is the proportion of successes in the population. Here we are planning a study to generate a 95% confidence interval for the unknown population proportion, p . The equation to determine the sample size for determining p seems to require knowledge of p, but this is obviously this is a circular argument, because if we knew the proportion of successes in the population, then a study would not be necessary! What we really need is an approximate value of p or an anticipated value. The range of p is 0 to 1, and therefore the range of p(1-p) is 0 to 1. The value of p that maximizes p(1-p) is p=0.5. Consequently, if there is no information available to approximate p, then p=0.5 can be used to generate the most conservative, or largest, sample size.

Example 2:  

An investigator wants to estimate the proportion of freshmen at his University who currently smoke cigarettes (i.e., the prevalence of smoking). How many freshmen should be involved in the study to ensure that a 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion?

Because we have no information on the proportion of freshmen who smoke, we use 0.5 to estimate the sample size as follows:

In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 385 is needed.

Suppose that a similar study was conducted 2 years ago and found that the prevalence of smoking was 27% among freshmen. If the investigator believes that this is a reasonable estimate of prevalence 2 years later, it can be used to plan the next study. Using this estimate of p, what sample size is needed (assuming that again a 95% confidence interval will be used and we want the same level of precision)?

An investigator wants to estimate the prevalence of breast cancer among women who are between 40 and 45 years of age living in Boston. How many women must be involved in the study to ensure that the estimate is precise? National data suggest that 1 in 235 women are diagnosed with breast cancer by age 40. This translates to a proportion of 0.0043 (0.43%) or a prevalence of 43 per 10,000 women. Suppose the investigator wants the estimate to be within 10 per 10,000 women with 95% confidence. The sample size is computed as follows:

A sample of size n=16,448 will ensure that a 95% confidence interval estimate of the prevalence of breast cancer is within 0.10 (or to within 10 women per 10,000) of its true value. This is a situation where investigators might decide that a sample of this size is not feasible. Suppose that the investigators thought a sample of size 5,000 would be reasonable from a practical point of view. How precisely can we estimate the prevalence with a sample of size n=5,000? Recall that the confidence interval formula to estimate prevalence is:

Assuming that the prevalence of breast cancer in the sample will be close to that based on national data, we would expect the margin of error to be approximately equal to the following:

Thus, with n=5,000 women, a 95% confidence interval would be expected to have a margin of error of 0.0018 (or 18 per 10,000). The investigators must decide if this would be sufficiently precise to answer the research question. Note that the above is based on the assumption that the prevalence of breast cancer in Boston is similar to that reported nationally. This may or may not be a reasonable assumption. In fact, it is the objective of the current study to estimate the prevalence in Boston. The research team, with input from clinical investigators and biostatisticians, must carefully evaluate the implications of selecting a sample of size n = 5,000, n = 16,448 or any size in between.

Sample Sizes for Two Independent Samples, Continuous Outcome

In studies where the plan is to estimate the difference in means between two independent populations, the formula for determining the sample sizes required in each comparison group is given below:

where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used and E is the desired margin of error. σ again reflects the standard deviation of the outcome variable. Recall from the module on confidence intervals that, when we generated a confidence interval estimate for the difference in means, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome (based on pooling the data), where Sp is computed as follows:

If data are available on variability of the outcome in each comparison group, then Sp can be computed and used in the sample size formula. However, it is more often the case that data on the variability of the outcome are available from only one group, often the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated below.  

Note that the formula for the sample size generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used.  

An investigator wants to plan a clinical trial to evaluate the efficacy of a new drug designed to increase HDL cholesterol (the "good" cholesterol). The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. HDL cholesterol will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study over 12 weeks. A 95% confidence interval will be estimated to quantify the difference in mean HDL levels between patients taking the new drug as compared to placebo. The investigator would like the margin of error to be no more than 3 units. How many patients should be recruited into the study?  

The sample sizes are computed as follows:

A major issue is determining the variability in the outcome of interest (σ), here the standard deviation of HDL cholesterol. To plan this study, we can use data from the Framingham Heart Study. In participants who attended the seventh examination of the Offspring Study and were not on treatment for high cholesterol, the standard deviation of HDL cholesterol is 17.1. We will use this value and the other inputs to compute the sample sizes as follows:

Samples of size n 1 =250 and n 2 =250 will ensure that the 95% confidence interval for the difference in mean HDL levels will have a margin of error of no more than 3 units. Again, these sample sizes refer to the numbers of participants with complete data. The investigators hypothesized a 10% attrition (or drop-out) rate (in both groups). In order to ensure that the total sample size of 500 is available at 12 weeks, the investigator needs to recruit more participants to allow for attrition.  

N (number to enroll) * (% retained) = desired sample size

Therefore N (number to enroll) = desired sample size/(% retained)

N = 500/0.90 = 556

If they anticipate a 10% attrition rate, the investigators should enroll 556 participants. This will ensure N=500 with complete data at the end of the trial.

An investigator wants to compare two diet programs in children who are obese. One diet is a low fat diet, and the other is a low carbohydrate diet. The plan is to enroll children and weigh them at the start of the study. Each child will then be randomly assigned to either the low fat or the low carbohydrate diet. Each child will follow the assigned diet for 8 weeks, at which time they will again be weighed. The number of pounds lost will be computed for each child. Based on data reported from diet trials in adults, the investigator expects that 20% of all children will not complete the study. A 95% confidence interval will be estimated to quantify the difference in weight lost between the two diets and the investigator would like the margin of error to be no more than 3 pounds. How many children should be recruited into the study?  

Again the issue is determining the variability in the outcome of interest (σ), here the standard deviation in pounds lost over 8 weeks. To plan this study, investigators use data from a published study in adults. Suppose one such study compared the same diets in adults and involved 100 participants in each diet group. The study reported a standard deviation in weight lost over 8 weeks on a low fat diet of 8.4 pounds and a standard deviation in weight lost over 8 weeks on a low carbohydrate diet of 7.7 pounds. These data can be used to estimate the common standard deviation in weight lost as follows:

We now use this value and the other inputs to compute the sample sizes:

Samples of size n 1 =56 and n 2 =56 will ensure that the 95% confidence interval for the difference in weight lost between diets will have a margin of error of no more than 3 pounds. Again, these sample sizes refer to the numbers of children with complete data. The investigators anticipate a 20% attrition rate. In order to ensure that the total sample size of 112 is available at 8 weeks, the investigator needs to recruit more participants to allow for attrition.  

N = 112/0.80 = 140

Sample Size for Matched Samples, Continuous Outcome

In studies where the plan is to estimate the mean difference of a continuous outcome based on matched data, the formula for determining sample size is given below:

where Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), E is the desired margin of error, and σ d is the standard deviation of the difference scores. It is extremely important that the standard deviation of the difference scores (e.g., the difference based on measurements over time or the difference between matched pairs) is used here to appropriately estimate the sample size.    

Sample Sizes for Two Independent Samples, Dichotomous Outcome

In studies where the plan is to estimate the difference in proportions between two independent populations (i.e., to estimate the risk difference), the formula for determining the sample sizes required in each comparison group is:

where n i is the sample size required in each group (i=1,2), Z is the value from the standard normal distribution reflecting the confidence level that will be used (e.g., Z = 1.96 for 95%), and E is the desired margin of error. p 1 and p 2 are the proportions of successes in each comparison group. Again, here we are planning a study to generate a 95% confidence interval for the difference in unknown proportions, and the formula to estimate the sample sizes needed requires p 1 and p 2 . In order to estimate the sample size, we need approximate values of p 1 and p 2 . The values of p 1 and p 2 that maximize the sample size are p 1 =p 2 =0.5. Thus, if there is no information available to approximate p 1 and p 2 , then 0.5 can be used to generate the most conservative, or largest, sample sizes.    

Similar to the situation for two independent samples and a continuous outcome at the top of this page, it may be the case that data are available on the proportion of successes in one group, usually the untreated (e.g., placebo control) or unexposed group. If so, the known proportion can be used for both p 1 and p 2 in the formula shown above. The formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used. Interested readers can see Fleiss for more details. 4

An investigator wants to estimate the impact of smoking during pregnancy on premature delivery. Normal pregnancies last approximately 40 weeks and premature deliveries are those that occur before 37 weeks. The 2005 National Vital Statistics report indicates that approximately 12% of infants are born prematurely in the United States. 5 The investigator plans to collect data through medical record review and to generate a 95% confidence interval for the difference in proportions of infants born prematurely to women who smoked during pregnancy as compared to those who did not. How many women should be enrolled in the study to ensure that the 95% confidence interval for the difference in proportions has a margin of error of no more than 4%?

The sample sizes (i.e., numbers of women who smoked and did not smoke during pregnancy) can be computed using the formula shown above. National data suggest that 12% of infants are born prematurely. We will use that estimate for both groups in the sample size computation.

Samples of size n 1 =508 women who smoked during pregnancy and n 2 =508 women who did not smoke during pregnancy will ensure that the 95% confidence interval for the difference in proportions who deliver prematurely will have a margin of error of no more than 4%.

Is attrition an issue here? 

Issues in Estimating Sample Size for Hypothesis Testing

In the module on hypothesis testing for means and proportions, we introduced techniques for means, proportions, differences in means, and differences in proportions. While each test involved details that were specific to the outcome of interest (e.g., continuous or dichotomous) and to the number of comparison groups (one, two, more than two), there were common elements to each test. For example, in each test of hypothesis, there are two errors that can be committed. The first is called a Type I error and refers to the situation where we incorrectly reject H 0 when in fact it is true.   In the first step of any test of hypothesis, we select a level of significance, α , and α = P(Type I error) = P(Reject H 0 | H 0 is true). Because we purposely select a small value for α , we control the probability of committing a Type I error. The second type of error is called a Type II error and it is defined as the probability we do not reject H 0 when it is false. The probability of a Type II error is denoted β , and β =P(Type II error) = P(Do not Reject H 0 | H 0 is false). In hypothesis testing, we usually focus on power, which is defined as the probability that we reject H 0 when it is false, i.e., power = 1- β = P(Reject H 0 | H 0 is false). Power is the probability that a test correctly rejects a false null hypothesis. A good test is one with low probability of committing a Type I error (i.e., small α ) and high power (i.e., small β, high power).  

Here we present formulas to determine the sample size required to ensure that a test has high power. The sample size computations depend on the level of significance, aα, the desired power of the test (equivalent to 1-β), the variability of the outcome, and the effect size. The effect size is the difference in the parameter of interest that represents a clinically meaningful difference. Similar to the margin of error in confidence interval applications, the effect size is determined based on clinical or practical criteria and not statistical criteria.  

The concept of statistical power can be difficult to grasp. Before presenting the formulas to determine the sample sizes required to ensure high power in a test, we will first discuss power from a conceptual point of view.  

Suppose we want to test the following hypotheses at aα=0.05:  H 0 : μ = 90 versus H 1 : μ ≠ 90. To test the hypotheses, suppose we select a sample of size n=100. For this example, assume that the standard deviation of the outcome is σ=20. We compute the sample mean and then must decide whether the sample mean provides evidence to support the alternative hypothesis or not. This is done by computing a test statistic and comparing the test statistic to an appropriate critical value. If the null hypothesis is true (μ=90), then we are likely to select a sample whose mean is close in value to 90. However, it is also possible to select a sample whose mean is much larger or much smaller than 90. Recall from the Central Limit Theorem (see page 11 in the module on Probability), that for large n (here n=100 is sufficiently large), the distribution of the sample means is approximately normal with a mean of

If the null hypothesis is true, it is possible to observe any sample mean shown in the figure below; all are possible under H 0 : μ = 90.  

Normal distribution of X when the mean of X is 90. A bell-shaped curve with a value of X-90 at the center.

Rejection Region for Test H 0 : μ = 90 versus H 1 : μ ≠ 90 at α =0.05

Standard normal distribution showing a mean of 90. The rejection areas are in the two tails at the extremes above and below the mean. If the alpha level is 0.05, then each tail accounts for an arean of 0.025.

The areas in the two tails of the curve represent the probability of a Type I Error, α= 0.05. This concept was discussed in the module on Hypothesis Testing.  

Now, suppose that the alternative hypothesis, H 1 , is true (i.e., μ ≠ 90) and that the true mean is actually 94. The figure below shows the distributions of the sample mean under the null and alternative hypotheses.The values of the sample mean are shown along the horizontal axis.  

Two overlapping normal distributions, one depicting the null hypothesis with a mean of 90 and the other showing the alternative hypothesis with a mean of 94. A more complete explanation of the figure is provided in the text below the figure.

If the true mean is 94, then the alternative hypothesis is true. In our test, we selected α = 0.05 and reject H 0 if the observed sample mean exceeds 93.92 (focusing on the upper tail of the rejection region for now). The critical value (93.92) is indicated by the vertical line. The probability of a Type II error is denoted β, and β = P(Do not Reject H 0 | H 0 is false), i.e., the probability of not rejecting the null hypothesis if the null hypothesis were true. β is shown in the figure above as the area under the rightmost curve (H 1 ) to the left of the vertical line (where we do not reject H 0 ). Power is defined as 1- β = P(Reject H 0 | H 0 is false) and is shown in the figure as the area under the rightmost curve (H 1 ) to the right of the vertical line (where we reject H 0 ).  

Note that β and power are related to α, the variability of the outcome and the effect size. From the figure above we can see what happens to β and power if we increase α. Suppose, for example, we increase α to α=0.10.The upper critical value would be 92.56 instead of 93.92. The vertical line would shift to the left, increasing α, decreasing β and increasing power. While a better test is one with higher power, it is not advisable to increase α as a means to increase power. Nonetheless, there is a direct relationship between α and power (as α increases, so does power).

β and power are also related to the variability of the outcome and to the effect size. The effect size is the difference in the parameter of interest (e.g., μ) that represents a clinically meaningful difference. The figure above graphically displays α, β, and power when the difference in the mean under the null as compared to the alternative hypothesis is 4 units (i.e., 90 versus 94). The figure below shows the same components for the situation where the mean under the alternative hypothesis is 98.

Overlapping bell-shaped distributions - one with a mean of 90 and the other with a mean of 98

Notice that there is much higher power when there is a larger difference between the mean under H 0 as compared to H 1 (i.e., 90 versus 98). A statistical test is much more likely to reject the null hypothesis in favor of the alternative if the true mean is 98 than if the true mean is 94. Notice also in this case that there is little overlap in the distributions under the null and alternative hypotheses. If a sample mean of 97 or higher is observed it is very unlikely that it came from a distribution whose mean is 90. In the previous figure for H 0 : μ = 90 and H 1 : μ = 94, if we observed a sample mean of 93, for example, it would not be as clear as to whether it came from a distribution whose mean is 90 or one whose mean is 94.

Ensuring That a Test Has High Power

In designing studies most people consider power of 80% or 90% (just as we generally use 95% as the confidence level for confidence interval estimates). The inputs for the sample size formulas include the desired power, the level of significance and the effect size. The effect size is selected to represent a clinically meaningful or practically important difference in the parameter of interest, as we will illustrate.  

The formulas we present below produce the minimum sample size to ensure that the test of hypothesis will have a specified probability of rejecting the null hypothesis when it is false (i.e., a specified power). In planning studies, investigators again must account for attrition or loss to follow-up. The formulas shown below produce the number of participants needed with complete data, and we will illustrate how attrition is addressed in planning studies.

In studies where the plan is to perform a test of hypothesis comparing the mean of a continuous outcome variable in a single population to a known mean, the hypotheses of interest are:

H 0 : μ = μ 0 and H 1 : μ ≠ μ 0 where μ 0 is the known mean (e.g., a historical control). The formula for determining sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. For example, if α=0.05, then 1- α/2 = 0.975 and Z=1.960. 1- β is the selected power, and Z 1-β is the value from the standard normal distribution holding 1- β below it. Sample size estimates for hypothesis testing are often based on achieving 80% or 90% power. The Z 1-β values for these popular scenarios are given below:

  • For 80% power Z 0.80 = 0.84
  • For 90% power Z 0.90 =1.282

ES is the effect size , defined as follows:

where μ 0 is the mean under H 0 , μ 1 is the mean under H 1 and σ is the standard deviation of the outcome of interest. The numerator of the effect size, the absolute value of the difference in means | μ 1 - μ 0 |, represents what is considered a clinically meaningful or practically important difference in means. Similar to the issue we faced when planning studies to estimate confidence intervals, it can sometimes be difficult to estimate the standard deviation. In sample size computations, investigators often use a value for the standard deviation from a previous study or a study performed in a different but comparable population. Regardless of how the estimate of the variability of the outcome is derived, it should always be conservative (i.e., as large as is reasonable), so that the resultant sample size will not be too small.

Example 7:  

An investigator hypothesizes that in people free of diabetes, fasting blood glucose, a risk factor for coronary heart disease, is higher in those who drink at least 2 cups of coffee per day. A cross-sectional study is planned to assess the mean fasting blood glucose levels in people who drink at least two cups of coffee per day. The mean fasting blood glucose level in people free of diabetes is reported as 95.0 mg/dL with a standard deviation of 9.8 mg/dL. 7 If the mean blood glucose level in people who drink at least 2 cups of coffee per day is 100 mg/dL, this would be important clinically. How many patients should be enrolled in the study to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.  

The effect size is computed as:

The effect size represents the meaningful difference in the population mean - here 95 versus 100, or 0.51 standard deviation units different. We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

Therefore, a sample of size n=31 will ensure that a two-sided test with α =0.05 has 80% power to detect a 5 mg/dL difference in mean fasting blood glucose levels.

In the planned study, participants will be asked to fast overnight and to provide a blood sample for analysis of glucose levels. Based on prior experience, the investigators hypothesize that 10% of the participants will fail to fast or will refuse to follow the study protocol. Therefore, a total of 35 participants will be enrolled in the study to ensure that 31 are available for analysis (see below).

N (number to enroll) * (% following protocol) = desired sample size

N = 31/0.90 = 35.

Sample Size for One Sample, Dichotomous Outcome

In studies where the plan is to perform a test of hypothesis comparing the proportion of successes in a dichotomous outcome variable in a single population to a known proportion, the hypotheses of interest are:

where p 0 is the known proportion (e.g., a historical control). The formula for determining the sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α/2 below it. 1- β is the selected power and   Z 1-β is the value from the standard normal distribution holding 1- β below it , and ES is the effect size, defined as follows:

where p 0 is the proportion under H 0 and p 1 is the proportion under H 1 . The numerator of the effect size, the absolute value of the difference in proportions |p 1 -p 0 |, again represents what is considered a clinically meaningful or practically important difference in proportions.  

Example 8:  

A recent report from the Framingham Heart Study indicated that 26% of people free of cardiovascular disease had elevated LDL cholesterol levels, defined as LDL > 159 mg/dL. 9 An investigator hypothesizes that a higher proportion of patients with a history of cardiovascular disease will have elevated LDL cholesterol. How many patients should be studied to ensure that the power of the test is 90% to detect a 5% difference in the proportion with elevated LDL cholesterol? A two sided test will be used with a 5% level of significance.  

We first compute the effect size: 

We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

A sample of size n=869 will ensure that a two-sided test with α =0.05 has 90% power to detect a 5% difference in the proportion of patients with a history of cardiovascular disease who have an elevated LDL cholesterol level.

A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance. (Do the computation yourself, before looking at the answer.)

In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are:

where μ 1 and μ 2 are the means in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is:

where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α /2 is the value from the standard normal distribution holding 1- α /2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as:

where | μ 1 - μ 2 | is the absolute value of the difference in means between the two groups expected under the alternative hypothesis, H 1 . σ is the standard deviation of the outcome of interest. Recall from the module on Hypothesis Testing that, when we performed tests of hypothesis comparing the means of two independent groups, we used Sp, the pooled estimate of the common standard deviation, as a measure of variability in the outcome.

Sp is computed as follows:

If data are available on variability of the outcome in each comparison group, then Sp can be computed and used to generate the sample sizes. However, it is more often the case that data on the variability of the outcome are available from only one group, usually the untreated (e.g., placebo control) or unexposed group. When planning a clinical trial to investigate a new drug or procedure, data are often available from other trials that may have involved a placebo or an active control group (i.e., a standard medication or treatment given for the condition under study). The standard deviation of the outcome variable measured in patients assigned to the placebo, control or unexposed group can be used to plan a future trial, as illustrated.  

 Note also that the formula shown above generates sample size estimates for samples of equal size. If a study is planned where different numbers of patients will be assigned or different numbers of patients will comprise the comparison groups, then alternative formulas can be used (see Howell 3 for more details).

An investigator is planning a clinical trial to evaluate the efficacy of a new drug designed to reduce systolic blood pressure. The plan is to enroll participants and to randomly assign them to receive either the new drug or a placebo. Systolic blood pressures will be measured in each participant after 12 weeks on the assigned treatment. Based on prior experience with similar trials, the investigator expects that 10% of all participants will be lost to follow up or will drop out of the study. If the new drug shows a 5 unit reduction in mean systolic blood pressure, this would represent a clinically meaningful reduction. How many patients should be enrolled in the trial to ensure that the power of the test is 80% to detect this difference? A two sided test will be used with a 5% level of significance.  

In order to compute the effect size, an estimate of the variability in systolic blood pressures is needed. Analysis of data from the Framingham Heart Study showed that the standard deviation of systolic blood pressure was 19.0. This value can be used to plan the trial.  

The effect size is:

Samples of size n 1 =232 and n 2 = 232 will ensure that the test of hypothesis will have 80% power to detect a 5 unit difference in mean systolic blood pressures in patients receiving the new drug as compared to patients receiving the placebo. However, the investigators hypothesized a 10% attrition rate (in both groups), and to ensure a total sample size of 232 they need to allow for attrition.  

N = 232/0.90 = 258.

The investigator must enroll 258 participants to be randomly assigned to receive either the new drug or placebo.

An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.  

Answer  

In studies where the plan is to perform a test of hypothesis on the mean difference in a continuous outcome variable based on matched data, the hypotheses of interest are:

where μ d is the mean difference in the population. The formula for determining the sample size to ensure that the test has a specified power is given below:

where α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it and ES is the effect size, defined as follows:

where μ d is the mean difference expected under the alternative hypothesis, H 1 , and σ d is the standard deviation of the difference in the outcome (e.g., the difference based on measurements over time or the difference between matched pairs).    

   

Example 10:

An investigator wants to evaluate the efficacy of an acupuncture treatment for reducing pain in patients with chronic migraine headaches. The plan is to enroll patients who suffer from migraine headaches. Each will be asked to rate the severity of the pain they experience with their next migraine before any treatment is administered. Pain will be recorded on a scale of 1-100 with higher scores indicative of more severe pain. Each patient will then undergo the acupuncture treatment. On their next migraine (post-treatment), each patient will again be asked to rate the severity of the pain. The difference in pain will be computed for each patient. A two sided test of hypothesis will be conducted, at α =0.05, to assess whether there is a statistically significant difference in pain scores before and after treatment. How many patients should be involved in the study to ensure that the test has 80% power to detect a difference of 10 units on the pain scale? Assume that the standard deviation in the difference scores is approximately 20 units.    

First compute the effect size:

Then substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.

A sample of size n=32 patients with migraine will ensure that a two-sided test with α =0.05 has 80% power to detect a mean difference of 10 points in pain before and after treatment, assuming that all 32 patients complete the treatment.

Sample Sizes for Two Independent Samples, Dichotomous Outcomes

In studies where the plan is to perform a test of hypothesis comparing the proportions of successes in two independent populations, the hypotheses of interest are:

H 0 : p 1 = p 2 versus H 1 : p 1 ≠ p 2

where p 1 and p 2 are the proportions in the two comparison populations. The formula for determining the sample sizes to ensure that the test has a specified power is given below:

where n i is the sample size required in each group (i=1,2), α is the selected level of significance and Z 1-α/2 is the value from the standard normal distribution holding 1- α/2 below it, and 1- β is the selected power and Z 1-β is the value from the standard normal distribution holding 1- β below it. ES is the effect size, defined as follows: 

where |p 1 - p 2 | is the absolute value of the difference in proportions between the two groups expected under the alternative hypothesis, H 1 , and p is the overall proportion, based on pooling the data from the two comparison groups (p can be computed by taking the mean of the proportions in the two comparison groups, assuming that the groups will be of approximately equal size).  

Example 11: 

An investigator hypothesizes that there is a higher incidence of flu among students who use their athletic facility regularly than their counterparts who do not. The study will be conducted in the spring. Each student will be asked if they used the athletic facility regularly over the past 6 months and whether or not they had the flu. A test of hypothesis will be conducted to compare the proportion of students who used the athletic facility regularly and got flu with the proportion of students who did not and got flu. During a typical year, approximately 35% of the students experience flu. The investigators feel that a 30% increase in flu among those who used the athletic facility regularly would be clinically meaningful. How many students should be enrolled in the study to ensure that the power of the test is 80% to detect this difference in the proportions? A two sided test will be used with a 5% level of significance.  

We first compute the effect size by substituting the proportions of students in each group who are expected to develop flu, p 1 =0.46 (i.e., 0.35*1.30=0.46) and p 2 =0.35 and the overall proportion, p=0.41 (i.e., (0.46+0.35)/2):

We now substitute the effect size and the appropriate Z values for the selected α and power to compute the sample size.  

Samples of size n 1 =324 and n 2 =324 will ensure that the test of hypothesis will have 80% power to detect a 30% difference in the proportions of students who develop flu between those who do and do not use the athletic facilities regularly.

Donor Feces? Really? Clostridium difficile (also referred to as "C. difficile" or "C. diff.") is a bacterial species that can be found in the colon of humans, although its numbers are kept in check by other normal flora in the colon. Antibiotic therapy sometimes diminishes the normal flora in the colon to the point that C. difficile flourishes and causes infection with symptoms ranging from diarrhea to life-threatening inflammation of the colon. Illness from C. difficile most commonly affects older adults in hospitals or in long term care facilities and typically occurs after use of antibiotic medications. In recent years, C. difficile infections have become more frequent, more severe and more difficult to treat. Ironically, C. difficile is first treated by discontinuing antibiotics, if they are still being prescribed. If that is unsuccessful, the infection has been treated by switching to another antibiotic. However, treatment with another antibiotic frequently does not cure the C. difficile infection. There have been sporadic reports of successful treatment by infusing feces from healthy donors into the duodenum of patients suffering from C. difficile. (Yuk!) This re-establishes the normal microbiota in the colon, and counteracts the overgrowth of C. diff. The efficacy of this approach was tested in a randomized clinical trial reported in the New England Journal of Medicine (Jan. 2013). The investigators planned to randomly assign patients with recurrent C. difficile infection to either antibiotic therapy or to duodenal infusion of donor feces. In order to estimate the sample size that would be needed, the investigators assumed that the feces infusion would be successful 90% of the time, and antibiotic therapy would be successful in 60% of cases. How many subjects will be needed in each group to ensure that the power of the study is 80% with a level of significance α = 0.05?

Determining the appropriate design of a study is more important than the statistical analysis; a poorly designed study can never be salvaged, whereas a poorly analyzed study can be re-analyzed. A critical component in study design is the determination of the appropriate sample size. The sample size must be large enough to adequately answer the research question, yet not too large so as to involve too many patients when fewer would have sufficed. The determination of the appropriate sample size involves statistical criteria as well as clinical or practical considerations. Sample size determination involves teamwork; biostatisticians must work closely with clinical investigators to determine the sample size that will address the research question of interest with adequate precision or power to produce results that are clinically meaningful.

The following table summarizes the sample size formulas for each scenario described here. The formulas are organized by the proposed analysis, a confidence interval estimate or a test of hypothesis.

  • Buschman NA, Foster G, Vickers P. Adolescent girls and their babies: achieving optimal birth weight. Gestational weight gain and pregnancy outcome in terms of gestation at delivery and infant birth weight: a comparison between adolescents under 16 and adult women. Child: Care, Health and Development. 2001; 27(2):163-171.
  • Feuer EJ, Wun LM. DEVCAN: Probability of Developing or Dying of Cancer. Version 4.0 .Bethesda, MD: National Cancer Institute, 1999.
  • Howell DC. Statistical Methods for Psychology. Boston, MA: Duxbury Press, 1982.
  • Fleiss JL. Statistical Methods for Rates and Proportions. New York, NY: John Wiley and Sons, Inc.,1981.
  • National Center for Health Statistics. Health, United States, 2005 with Chartbook on Trends in the Health of Americans. Hyattsville, MD : US Government Printing Office; 2005.  
  • Plaskon LA, Penson DF, Vaughan TL, Stanford JL. Cigarette smoking and risk of prostate cancer in middle-aged men. Cancer Epidemiology Biomarkers & Prevention. 2003; 12: 604-609.
  • Rutter MK, Meigs JB, Sullivan LM, D'Agostino RB, Wilson PW. C-reactive protein, the metabolic syndrome and prediction of cardiovascular events in the Framingham Offspring Study. Circulation. 2004;110: 380-385.
  • Ramachandran V, Sullivan LM, Wilson PW, Sempos CT, Sundstrom J, Kannel WB, Levy D, D'Agostino RB. Relative importance of borderline and elevated levels of coronary heart disease risk factors. Annals of Internal Medicine. 2005; 142: 393-402.
  • Wechsler H, Lee JE, Kuo M, Lee H. College Binge Drinking in the 1990s:A Continuing Problem Results of the Harvard School of Public Health 1999 College Health, 2000; 48: 199-210.

Answers to Selected Problems

Answer to birth weight question - page 3.

An investigator wants to estimate the mean birth weight of infants born full term (approximately 40 weeks gestation) to mothers who are 19 years of age and under. The mean birth weight of infants born full-term to mothers 20 years of age and older is 3,510 grams with a standard deviation of 385 grams. How many women 19 years of age and under must be enrolled in the study to ensure that a 95% confidence interval estimate of the mean birth weight of their infants has a margin of error not exceeding 100 grams?

In order to ensure that the 95% confidence interval estimate of the mean birthweight is within 100 grams of the true mean, a sample of size 57 is needed. In planning the study, the investigator must consider the fact that some women may deliver prematurely. If women are enrolled into the study during pregnancy, then more than 57 women will need to be enrolled so that after excluding those who deliver prematurely, 57 with outcome information will be available for analysis. For example, if 5% of the women are expected to delivery prematurely (i.e., 95% will deliver full term), then 60 women must be enrolled to ensure that 57 deliver full term. The number of women that must be enrolled, N, is computed as follows:

                                                        N (number to enroll) * (% retained) = desired sample size

                                                        N (0.95) = 57

                                                        N = 57/0.95 = 60.

 Answer Freshmen Smoking - Page 4

In order to ensure that the 95% confidence interval estimate of the proportion of freshmen who smoke is within 5% of the true proportion, a sample of size 303 is needed. Notice that this sample size is substantially smaller than the one estimated above. Having some information on the magnitude of the proportion in the population will always produce a sample size that is less than or equal to the one based on a population proportion of 0.5. However, the estimate must be realistic.

Answer to Medical Device Problem - Page 7

A medical device manufacturer produces implantable stents. During the manufacturing process, approximately 10% of the stents are deemed to be defective. The manufacturer wants to test whether the proportion of defective stents is more than 10%. If the process produces more than 15% defective stents, then corrective action must be taken. Therefore, the manufacturer wants the test to have 90% power to detect a difference in proportions of this magnitude. How many stents must be evaluated? For you computations, use a two-sided test with a 5% level of significance.

Then substitute the effect size and the appropriate z values for the selected alpha and power to comute the sample size.

A sample size of 364 stents will ensure that a two-sided test with α=0.05 has 90% power to detect a 0.05, or 5%, difference in jthe proportion of defective stents produced.

Answer to Alcohol and GPA - Page 8

An investigator is planning a study to assess the association between alcohol consumption and grade point average among college seniors. The plan is to categorize students as heavy drinkers or not using 5 or more drinks on a typical drinking day as the criterion for heavy drinking. Mean grade point averages will be compared between students classified as heavy drinkers versus not using a two independent samples test of means. The standard deviation in grade point averages is assumed to be 0.42 and a meaningful difference in grade point averages (relative to drinking status) is 0.25 units. How many college seniors should be enrolled in the study to ensure that the power of the test is 80% to detect a 0.25 unit difference in mean grade point averages? Use a two-sided test with a 5% level of significance.

First compute the effect size.

Now substitute the effect size and the appropriate z values for alpha and power to compute the sample size.

Sample sizes of n i =44 heavy drinkers and 44 who drink few fewer than five drinks per typical drinking day will ensure that the test of hypothesis has 80% power to detect a 0.25 unit difference in mean grade point averages.

Answer to Donor Feces - Page 8

We first compute the effect size by substituting the proportions of patients expected to be cured with each treatment, p 1 =0.6 and p 2 =0.9, and the overall proportion, p=0.75:

We now substitute the effect size and the appropriate Z values for the selected a and power to compute the sample size.

Samples of size n 1 =33 and n 2 =33 will ensure that the test of hypothesis will have 80% power to detect this difference in the proportions of patients who are cured of C. diff. by feces infusion versus antibiotic therapy.

In fact, the investigators enrolled 38 into each group to allow for attrition. Nevertheless, the study was stopped after an interim analysis. Of 16 patients in the infusion group, 13 (81%) had resolution of C. difficile–associated diarrhea after the first infusion. The 3 remaining patients received a second infusion with feces from a different donor, with resolution in 2 patients. Resolution of C. difficile infection occurred in only 4 of 13 patients (31%) receiving the antibiotic vancomycin.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.5 - hypothesis testing for two-sample proportions.

We are now going to develop the hypothesis test for the difference of two proportions for independent samples. The hypothesis test follows the same steps as one group.

These notes are going to go into a little bit of math and formulas to help demonstrate the logic behind hypothesis testing for two groups. If this starts to get a little confusion, just skim over it for a general understanding! Remember we can rely on the software to do the calculations for us, but it is good to have a basic understanding of the logic!

We will use the sampling distribution of \(\hat{p}_1-\hat{p}_2\) as we did for the confidence interval.

For a test for two proportions, we are interested in the difference between two groups. If the difference is zero, then they are not different (i.e., they are equal). Therefore, the null hypothesis will always be:

\(H_0\colon p_1-p_2=0\)

Another way to look at it is \(H_0\colon p_1=p_2\). This is worth stopping to think about. Remember, in hypothesis testing, we assume the null hypothesis is true. In this case, it means that \(p_1\) and \(p_2\) are equal. Under this assumption, then \(\hat{p}_1\) and \(\hat{p}_2\) are both estimating the same proportion. Think of this proportion as \(p^*\).

Therefore, the sampling distribution of both proportions, \(\hat{p}_1\) and \(\hat{p}_2\), will, under certain conditions, be approximately normal centered around \(p^*\), with standard error \(\sqrt{\dfrac{p^*(1-p^*)}{n_i}}\), for \(i=1, 2\).

We take this into account by finding an estimate for this \(p^*\) using the two-sample proportions. We can calculate an estimate of \(p^*\) using the following formula:

\(\hat{p}^*=\dfrac{x_1+x_2}{n_1+n_2}\)

This value is the total number in the desired categories \((x_1+x_2)\) from both samples over the total number of sampling units in the combined sample \((n_1+n_2)\).

Putting everything together, if we assume \(p_1=p_2\), then the sampling distribution of \(\hat{p}_1-\hat{p}_2\) will be approximately normal with mean 0 and standard error of \(\sqrt{p^*(1-p^*)\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}\), under certain conditions.

\(z^*=\dfrac{(\hat{p}_1-\hat{p}_2)-0}{\sqrt{\hat{p}^*(1-\hat{p}^*)\left(\dfrac{1}{n_1}+\dfrac{1}{n_2}\right)}}\)

...will follow a standard normal distribution.

Finally, we can develop our hypothesis test for \(p_1-p_2\).

Hypothesis Testing for Two-Sample Proportions

Conditions :

\(n_1\hat{p}_1\), \(n_1(1-\hat{p}_1)\), \(n_2\hat{p}_2\), and \(n_2(1-\hat{p}_2)\) are all greater than five

Test Statistic:

\(z^*=\dfrac{\hat{p}_1-\hat{p}_2-0}{\sqrt{\hat{p}^*(1-\hat{p}^*)\left(\dfrac{1}{n_1}+\dfrac{1}{n_2}\right)}}\)

...where \(\hat{p}^*=\dfrac{x_1+x_2}{n_1+n_2}\).

The critical values, p-values, and decisions will all follow the same steps as those from a hypothesis test for a one-sample proportion.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

9.2: Comparing Two Independent Population Means (Hypothesis test)

  • Last updated
  • Save as PDF
  • Page ID 125735

  • The two independent samples are simple random samples from two distinct populations.
  • if the sample sizes are small, the distributions are important (should be normal)
  • if the sample sizes are large, the distributions are not important (need not be normal)

The test comparing two independent population means with unknown and possibly unequal population standard deviations is called the Aspin-Welch \(t\)-test. The degrees of freedom formula was developed by Aspin-Welch.

The comparison of two population means is very common. A difference between the two samples depends on both the means and the standard deviations. Very different means can occur by chance if there is great variation among the individual samples. In order to account for the variation, we take the difference of the sample means, \(\bar{X}_{1} - \bar{X}_{2}\), and divide by the standard error in order to standardize the difference. The result is a t-score test statistic.

Because we do not know the population standard deviations, we estimate them using the two sample standard deviations from our independent samples. For the hypothesis test, we calculate the estimated standard deviation, or standard error , of the difference in sample means , \(\bar{X}_{1} - \bar{X}_{2}\).

The standard error is:

\[\sqrt{\dfrac{(s_{1})^{2}}{n_{1}} + \dfrac{(s_{2})^{2}}{n_{2}}}\]

The test statistic ( t -score) is calculated as follows:

\[\dfrac{(\bar{x}-\bar{x}) - (\mu_{1} - \mu_{2})}{\sqrt{\dfrac{(s_{1})^{2}}{n_{1}} + \dfrac{(s_{2})^{2}}{n_{2}}}}\]

  • \(s_{1}\) and \(s_{2}\), the sample standard deviations, are estimates of \(\sigma_{1}\) and \(\sigma_{1}\), respectively.
  • \(\sigma_{1}\) and \(\sigma_{2}\) are the unknown population standard deviations.
  • \(\bar{x}_{1}\) and \(\bar{x}_{2}\) are the sample means. \(\mu_{1}\) and \(\mu_{2}\) are the population means.

The number of degrees of freedom (\(df\)) requires a somewhat complicated calculation. However, a computer or calculator calculates it easily. The \(df\) are not always a whole number. The test statistic calculated previously is approximated by the Student's t -distribution with \(df\) as follows:

Degrees of freedom

\[df = \dfrac{\left(\dfrac{(s_{1})^{2}}{n_{1}} + \dfrac{(s_{2})^{2}}{n_{2}}\right)^{2}}{\left(\dfrac{1}{n_{1}-1}\right)\left(\dfrac{(s_{1})^{2}}{n_{1}}\right)^{2} + \left(\dfrac{1}{n_{2}-1}\right)\left(\dfrac{(s_{2})^{2}}{n_{2}}\right)^{2}}\]

We can also use a conservative estimation of degree of freedom by taking DF to be the smallest of \(n_{1}-1\) and \(n_{2}-1\)

When both sample sizes \(n_{1}\) and \(n_{2}\) are five or larger, the Student's t approximation is very good. Notice that the sample variances \((s_{1})^{2}\) and \((s_{2})^{2}\) are not pooled. (If the question comes up, do not pool the variances.)

It is not necessary to compute the degrees of freedom by hand. A calculator or computer easily computes it.

Example \(\PageIndex{1}\): Independent groups

The average amount of time boys and girls aged seven to 11 spend playing sports each day is believed to be the same. A study is done and data are collected, resulting in the data in Table \(\PageIndex{1}\). Each populations has a normal distribution.

Is there a difference in the mean amount of time boys and girls aged seven to 11 play sports each day? Test at the 5% level of significance.

The population standard deviations are not known. Let g be the subscript for girls and b be the subscript for boys. Then, \(\mu_{g}\) is the population mean for girls and \(\mu_{b}\) is the population mean for boys. This is a test of two independent groups, two population means.

Random variable: \(\bar{X}_{g} - \bar{X}_{b} =\) difference in the sample mean amount of time girls and boys play sports each day.

  • \(H_{0}: \mu_{g} = \mu_{b}\)  
  • \(H_{0}: \mu_{g} - \mu_{b} = 0\)
  • \(H_{a}: \mu_{g} \neq \mu_{b}\)  
  • \(H_{a}: \mu_{g} - \mu_{b} \neq 0\)

The words "the same" tell you \(H_{0}\) has an "=". Since there are no other words to indicate \(H_{a}\), assume it says "is different." This is a two-tailed test.

Distribution for the test: Use \(t_{df}\) where \(df\) is calculated using the \(df\) formula for independent groups, two population means. Using a calculator, \(df\) is approximately 18.8462. Do not pool the variances.

Calculate the p -value using a Student's t -distribution: \(p\text{-value} = 0.0054\)

This is a normal distribution curve representing the difference in the average amount of time girls and boys play sports all day. The mean is equal to zero, and the values -1.2, 0, and 1.2 are labeled on the horizontal axis. Two vertical lines extend from -1.2 and 1.2 to the curve. The region to the left of x = -1.2 and the region to the right of x = 1.2 are shaded to represent the p-value. The area of each region is 0.0028.

\[s_{g} = 0.866\]

\[s_{b} = 1\]

\[\bar{x}_{g} - \bar{x}_{b} = 2 - 3.2 = -1.2\]

Half the \(p\text{-value}\) is below –1.2 and half is above 1.2.

Make a decision: Since \(\alpha > p\text{-value}\), reject \(H_{0}\). This means you reject \(\mu_{g} = \mu_{b}\). The means are different.

Press STAT . Arrow over to TESTS and press 4:2-SampTTest . Arrow over to Stats and press ENTER . Arrow down and enter 2 for the first sample mean, \(\sqrt{0.866}\) for Sx1, 9 for n1, 3.2 for the second sample mean, 1 for Sx2, and 16 for n2. Arrow down to μ1: and arrow to does not equal μ2. Press ENTER . Arrow down to Pooled: and No . Press ENTER . Arrow down to Calculate and press ENTER . The \(p\text{-value}\) is \(p = 0.0054\), the dfs are approximately 18.8462, and the test statistic is -3.14. Do the procedure again but instead of Calculate do Draw.

Conclusion: At the 5% level of significance, the sample data show there is sufficient evidence to conclude that the mean number of hours that girls and boys aged seven to 11 play sports per day is different (mean number of hours boys aged seven to 11 play sports per day is greater than the mean number of hours played by girls OR the mean number of hours girls aged seven to 11 play sports per day is greater than the mean number of hours played by boys).

Exercise \(\PageIndex{1}\)

Two samples are shown in Table. Both have normal distributions. The means for the two populations are thought to be the same. Is there a difference in the means? Test at the 5% level of significance.

The \(p\text{-value}\) is \(0.4125\), which is much higher than 0.05, so we decline to reject the null hypothesis. There is not sufficient evidence to conclude that the means of the two populations are not the same.

When the sum of the sample sizes is larger than \(30 (n_{1} + n_{2} > 30)\) you can use the normal distribution to approximate the Student's \(t\).

Example \(\PageIndex{2}\)

A study is done by a community group in two neighboring colleges to determine which one graduates students with more math classes. College A samples 11 graduates. Their average is four math classes with a standard deviation of 1.5 math classes. College B samples nine graduates. Their average is 3.5 math classes with a standard deviation of one math class. The community group believes that a student who graduates from college A has taken more math classes, on the average. Both populations have a normal distribution. Test at a 1% significance level. Answer the following questions.

  • Is this a test of two means or two proportions?
  • Are the populations standard deviations known or unknown?
  • Which distribution do you use to perform the test?
  • What is the random variable?
  • What are the null and alternate hypotheses? Write the null and alternate hypotheses in words and in symbols.
  • Is this test right-, left-, or two-tailed?
  • What is the \(p\text{-value}\)?
  • Do you reject or not reject the null hypothesis?
  • Student's t
  • \(\bar{X}_{A} - \bar{X}_{B}\)
  • \(H_{0}: \mu_{A} \leq \mu_{B}\) and \(H_{a}: \mu_{A} > \mu_{B}\)

alt

  • h. Do not reject.
  • i. At the 1% level of significance, from the sample data, there is not sufficient evidence to conclude that a student who graduates from college A has taken more math classes, on the average, than a student who graduates from college B.

Exercise \(\PageIndex{2}\)

A study is done to determine if Company A retains its workers longer than Company B. Company A samples 15 workers, and their average time with the company is five years with a standard deviation of 1.2. Company B samples 20 workers, and their average time with the company is 4.5 years with a standard deviation of 0.8. The populations are normally distributed.

  • Are the population standard deviations known?
  • Conduct an appropriate hypothesis test. At the 5% significance level, what is your conclusion?
  • They are unknown.
  • The \(p\text{-value} = 0.0878\). At the 5% level of significance, there is insufficient evidence to conclude that the workers of Company A stay longer with the company.

Example \(\PageIndex{3}\)

A professor at a large community college wanted to determine whether there is a difference in the means of final exam scores between students who took his statistics course online and the students who took his face-to-face statistics class. He believed that the mean of the final exam scores for the online class would be lower than that of the face-to-face class. Was the professor correct? The randomly selected 30 final exam scores from each group are listed in Table \(\PageIndex{3}\) and Table \(\PageIndex{4}\).

Is the mean of the Final Exam scores of the online class lower than the mean of the Final Exam scores of the face-to-face class? Test at a 5% significance level. Answer the following questions:

  • Are the population standard deviations known or unknown?
  • What are the null and alternative hypotheses? Write the null and alternative hypotheses in words and in symbols.
  • Is this test right, left, or two tailed?
  • At the ___ level of significance, from the sample data, there ______ (is/is not) sufficient evidence to conclude that ______.

(See the conclusion in Example, and write yours in a similar fashion)

Be careful not to mix up the information for Group 1 and Group 2!

  • Student's \(t\)
  • \(\bar{X}_{1} - \bar{X}_{2}\)
  • \(H_{0}: \mu_{1} = \mu_{2}\) Null hypothesis: the means of the final exam scores are equal for the online and face-to-face statistics classes.
  • \(H_{a}: \mu_{1} < \mu_{2}\) Alternative hypothesis: the mean of the final exam scores of the online class is less than the mean of the final exam scores of the face-to-face class.
  • left-tailed

This is a normal distribution curve with mean equal to zero. A vertical line near the tail of the curve to the left of zero extends from the axis to the curve. The region under the curve to the left of the line is shaded representing p-value = 0.0011.

Figure \(\PageIndex{3}\).

  • Reject the null hypothesis

At the 5% level of significance, from the sample data, there is (is/is not) sufficient evidence to conclude that the mean of the final exam scores for the online class is less than the mean of final exam scores of the face-to-face class.

First put the data for each group into two lists (such as L1 and L2). Press STAT. Arrow over to TESTS and press 4:2SampTTest. Make sure Data is highlighted and press ENTER. Arrow down and enter L1 for the first list and L2 for the second list. Arrow down to \(\mu_{1}\): and arrow to \(\neq \mu_{1}\) (does not equal). Press ENTER. Arrow down to Pooled: No. Press ENTER. Arrow down to Calculate and press ENTER.

Cohen's Standards for Small, Medium, and Large Effect Sizes

Cohen's \(d\) is a measure of effect size based on the differences between two means. Cohen’s \(d\), named for United States statistician Jacob Cohen, measures the relative strength of the differences between the means of two populations based on sample data. The calculated value of effect size is then compared to Cohen’s standards of small, medium, and large effect sizes.

Cohen's \(d\) is the measure of the difference between two means divided by the pooled standard deviation: \(d = \dfrac{\bar{x}_{2}-\bar{x}_{2}}{s_{\text{pooled}}}\) where \(s_{pooled} = \sqrt{\dfrac{(n_{1}-1)s^{2}_{1} + (n_{2}-1)s^{2}_{2}}{n_{1}+n_{2}-2}}\)

Example \(\PageIndex{4}\)

Calculate Cohen’s d for Example. Is the size of the effect small, medium, or large? Explain what the size of the effect means for this problem.

\(\mu_{1} = 4 s_{1} = 1.5 n_{1} = 11\)

\(\mu_{2} = 3.5 s_{2} = 1 n_{2} = 9\)

\(d = 0.384\)

The effect is small because 0.384 is between Cohen’s value of 0.2 for small effect size and 0.5 for medium effect size. The size of the differences of the means for the two colleges is small indicating that there is not a significant difference between them.

Example \(\PageIndex{5}\)

Calculate Cohen’s \(d\) for Example. Is the size of the effect small, medium or large? Explain what the size of the effect means for this problem.

\(d = 0.834\); Large, because 0.834 is greater than Cohen’s 0.8 for a large effect size. The size of the differences between the means of the Final Exam scores of online students and students in a face-to-face class is large indicating a significant difference.

Example 10.2.6

Weighted alpha is a measure of risk-adjusted performance of stocks over a period of a year. A high positive weighted alpha signifies a stock whose price has risen while a small positive weighted alpha indicates an unchanged stock price during the time period. Weighted alpha is used to identify companies with strong upward or downward trends. The weighted alpha for the top 30 stocks of banks in the northeast and in the west as identified by Nasdaq on May 24, 2013 are listed in Table and Table, respectively.

Is there a difference in the weighted alpha of the top 30 stocks of banks in the northeast and in the west? Test at a 5% significance level. Answer the following questions:

  • Calculate Cohen’s d and interpret it.
  • Student’s-t
  • \(H_{0}: \mu_{1} = \mu_{2}\) Null hypothesis: the means of the weighted alphas are equal.
  • \(H_{a}: \mu_{1} \neq \mu_{2}\) Alternative hypothesis : the means of the weighted alphas are not equal.
  • \(p\text{-value} = 0.8787\)
  • Do not reject the null hypothesis

This is a normal distribution curve with mean equal to zero. Both the right and left tails of the curve are shaded. Each tail represents 1/2(p-value) = 0.4394.

Figure \(\PageIndex{4}\).

  • \(d = 0.040\), Very small, because 0.040 is less than Cohen’s value of 0.2 for small effect size. The size of the difference of the means of the weighted alphas for the two regions of banks is small indicating that there is not a significant difference between their trends in stocks.
  • Data from Graduating Engineer + Computer Careers. Available online at www.graduatingengineer.com
  • Data from Microsoft Bookshelf .
  • Data from the United States Senate website, available online at www.Senate.gov (accessed June 17, 2013).
  • “List of current United States Senators by Age.” Wikipedia. Available online at en.Wikipedia.org/wiki/List_of...enators_by_age (accessed June 17, 2013).
  • “Sectoring by Industry Groups.” Nasdaq. Available online at www.nasdaq.com/markets/barcha...&base=industry (accessed June 17, 2013).
  • “Strip Clubs: Where Prostitution and Trafficking Happen.” Prostitution Research and Education, 2013. Available online at www.prostitutionresearch.com/ProsViolPosttrauStress.html (accessed June 17, 2013).
  • “World Series History.” Baseball-Almanac, 2013. Available online at http://www.baseball-almanac.com/ws/wsmenu.shtml (accessed June 17, 2013).

Two population means from independent samples where the population standard deviations are not known

  • Random Variable: \(\bar{X}_{1} - \bar{X}_{2} =\) the difference of the sampling means
  • Distribution: Student's t -distribution with degrees of freedom (variances not pooled)

Formula Review

Standard error: \[SE = \sqrt{\dfrac{(s_{1}^{2})}{n_{1}} + \dfrac{(s_{2}^{2})}{n_{2}}}\]

Test statistic ( t -score): \[t = \dfrac{(\bar{x}_{1}-\bar{x}_{2}) - (\mu_{1}-\mu_{2})}{\sqrt{\dfrac{(s_{1})^{2}}{n_{1}} + \dfrac{(s_{2})^{2}}{n_{2}}}}\]

Degrees of freedom:

\[df = \dfrac{\left(\dfrac{(s_{1})^{2}}{n_{1}} + \dfrac{(s_{2})^{2}}{n_{2}}\right)^{2}}{\left(\dfrac{1}{n_{1} - 1}\right)\left(\dfrac{(s_{1})^{2}}{n_{1}}\right)^{2}} + \left(\dfrac{1}{n_{2} - 1}\right)\left(\dfrac{(s_{2})^{2}}{n_{2}}\right)^{2}\]

  • \(s_{1}\) and \(s_{2}\) are the sample standard deviations, and n 1 and n 2 are the sample sizes.
  • \(x_{1}\) and \(x_{2}\) are the sample means.

OR use the   DF to be the smallest of \(n_{1}-1\) and \(n_{2}-1\)

Cohen’s \(d\) is the measure of effect size:

\[d = \dfrac{\bar{x}_{1} - \bar{x}_{2}}{s_{\text{pooled}}}\]

\[s_{\text{pooled}} = \sqrt{\dfrac{(n_{1} - 1)s^{2}_{1} + (n_{2} - 1)s^{2}_{2}}{n_{1} + n_{2} - 2}}\]

  • The domain of the random variable (RV) is not necessarily a numerical set; the domain may be expressed in words; for example, if \(X =\) hair color, then the domain is {black, blond, gray, green, orange}.
  • We can tell what specific value x of the random variable \(X\) takes only after performing the experiment.

Statology

Statistics Made Easy

How to Perform a t-test with Unequal Sample Sizes

One question students often have in statistics is:

Is it possible to perform a t-test when the sample sizes of each group are not equal?

The short answer:

Yes, you can perform a t-test when the sample sizes are not equal. Equal sample sizes is not one of the assumptions made in a t-test.

The real issues arise when the two samples do not have equal variances, which is one of the assumptions made in a t-test.

When this occurs, it’s recommended that you use Welch’s t-test instead, which does not make the assumption of equal variances.

The following examples demonstrate how to perform t-tests with unequal sample sizes when the variances are equal and when they’re not equal.

Example 1: Unequal Sample Sizes and Equal Variances

Suppose we administer two programs designed to help students score higher on some exam.

The results are as follows:

  • n (sample size): 500
  • x (sample mean): 80
  • s (sample standard deviation): 5
  • n (sample size): 20
  • x (sample mean): 85

The following code shows how to create a boxplot in R to visualize the distribution of exam scores for each program:

hypothesis test two sample sizes

The mean exam score for Program 2 appears to be higher, but the variance of exam scores between the two programs is roughly equal. 

The following code shows how to perform an independent samples t-test along with a Welch’s t-test:

The independent samples t-test returns a p-value of .0009 and Welch’s t-test returns a p-value of .0029 .

Since the p-value of each test is less than .05, we would reject the null hypothesis in each test and conclude that there is a statistically significant difference in mean exam scores between the two programs.

Even though the sample sizes are unequal, the independent samples t-test and Welch’s t-test both return similar results since the two samples had equal variances.

Example 2: Unequal Sample Sizes and Unequal Variances

  • s (sample standard deviation): 25

hypothesis test two sample sizes

The mean exam score for Program 2 appears to be higher, but the variance of exam scores for Program 1 is much higher than Program 2.

The independent samples t-test returns a p-value of .5496  and Welch’s t-test returns a p-value of .0361 .

The independent samples t-test is not able to detect a difference in mean exam scores, but the Welch’s t-test is able to detect a statistically significant difference.

Since the two samples had unequal variances, only Welch’s t-test was able to detect the statistically significant difference in mean exam scores since this test does not make the assumption of equal variances between samples.

Additional Resources

The following tutorials provide additional information about t-tests:

Introduction to the One Sample t-test Introduction to the Two Sample t-test Introduction to the Paired Samples t-test

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

hypothesis test two sample sizes

Snapsolve any problem by taking a picture. Try it in the Numerade app?

  • Reference Manager
  • Simple TEXT file

People also looked at

Original research article, validity and reliability of the 2-min step test in individuals with stroke and lower-limb musculoskeletal disorders.

hypothesis test two sample sizes

  • 1 Department of Physical Therapy, Faculty of Rehabilitation Sciences, Nagoya Gakuin University, Aichi, Japan
  • 2 Department of Physical Therapy, Faculty of Nursing and Rehabilitation, Konan Women’s University, Hyogo, Japan
  • 3 Department of Rehabilitation, Senri-Chuo Hospital, Osaka, Japan
  • 4 Department of Physical Therapy, Faculty of Health and Medical Care, Saitama Medical University, Saitama, Japan
  • 5 Department of Rehabilitation, Nishiyamato Rehabilitation Hospital, Nara, Japan

Introduction: We investigated the reliability and validity of the 2-min step test (2MST) for assessing the exercise endurance of individuals with stroke and lower-limb musculoskeletal disorders.

Participants and methods: The participants were 39 individuals with stroke and 42 with lower-limb musculoskeletal disorders (mainly hip fractures) from the convalescent rehabilitation wards of four hospitals. The concurrent validity and congruence between the 2MST and the 6-min walk test (6MWT) and construct validity by hypotheses testing, including mobility and lower limb muscle strength, were also confirmed. A subset of participants (stroke-group, n  = 15; musculoskeletal-group, n  = 19) underwent a retest 2MST for our evaluation of relative and absolute reliability using the intraclass correlation coefficient (ICC 1,1 ) and Bland–Altman plot.

Results: Both groups showed a moderate correlation between the 2MST and 6MWT ( ρ  = 0.55–0.60), but the congruence was not sufficient. The 6MWT was correlated with mobility in both groups and with muscle strength in the stroke group, whereas the 2MST did not show a significant correlation with mobility. The relative reliability was excellent in both groups (ICC 1,1  > 0.9). In terms of absolute reliability, the width of the limit of agreement was 18.8% for the stroke group and 15.4% for the musculoskeletal group, relative to their respective sample means of 2MST. A fixed bias was identified in the stroke group, in which step counts increased by 6.5 steps upon retesting.

Discussion: Our analyses revealed that the 2MST is a valid and reliable tool for assessing the exercise endurance of individuals with stroke or lower-limb musculoskeletal disorders. However, it is necessary to validate the absolute reliability observed herein by using a larger sample size. In addition, when assessing the exercise endurance of individuals with stroke, it may be necessary to consider the potential bias of an increased step count during retesting.

1 Introduction

Exercise capacity is a defining factor of physical fitness and a crucial determinant of successful aging ( 1 ). Numerous studies have shown a dose-response relationship between increased exercise capacity and reduced morbidity and mortality in older adults ( 2 ). Consequently, the World Health Organization's physical activity guidelines recommend aerobic exercise for a variety of individuals, including adults, older adults, and those with chronic disease and disability ( 3 ). It is therefore essential to evaluate individuals' exercise capacity properly to determine the advantages of aerobic exercise. Exercise capacity is typically assessed by using the testee's maximal oxygen uptake, which can be measured using direct or indirect methods. The direct method involves analyzing exhaled gas during exercise on a treadmill or bicycle ergometer, which requires specialized equipment, space, and trained professionals. In contrast, the indirect method estimates maximal oxygen uptake based on the amount of exercise (i.e., exercise endurance) that can be performed within a time limit. Although direct methods can accurately assess exercise capacity, indirect methods based on exercise endurance are often used in clinical settings due to their broad and simple applicability. The most common indirect method is the 6-min walk test (6MWT), which measures the distance a subject can walk in a 6-min period ( 4 ). The 6MWT is a standard clinical assessment recommended in practice guidelines or evidence reviews for evaluating exercise endurance. This test is applicable to conditions such as stroke ( 5 ) and lower-limb musculoskeletal disorders (LMSD), including hip fractures ( 6 ), knee or hip osteoarthritis, and total knee or hip arthroplasty ( 7 ), which may affect the activities of daily living of older adults. However, due to the requirement of a long walkway, it may not be feasible to perform the 6MWT in clinics, homes, or other clinical settings with limited space. In Japan, where the population is the most aged among major industrialized countries ( 8 ), stroke and LMSD (including fractures, falls, and joint diseases) are the major causes of the need for long-term care ( 9 ). Against the backdrop of the aging population in Japan, the government is promoting a shift in medical and nursing care (including rehabilitation services) from hospitals to homes as a matter of policy ( 10 ). In other words, there is a need to establish a simple method to assess the exercise endurance of individuals with stroke or LMSD in home and community settings, which are more environmentally constrained than in hospitals. Moreover, aging is a global concern that is not unique to Japan ( 11 ), and evidence for telerehabilitation performed in the home setting has been building as a matter of global concern ( 12 , 13 ). Therefore, addressing this issue holds significance not only for Japan but worldwide.

An alternative to the 6MWT, which requires less space, is the 2-min step test (2MST) ( 14 ). The 2MST was developed as a subtest of the Senior Fitness Test and is a method for assessing exercise endurance ( 14 – 16 ). In the 2MST, the subject assumes a standing position and performs as many marching movements as possible for 2 min on the spot. Performance on the 2MST is defined by measuring the number of unilateral (usually right-sided) steps taken in the standing position to a height midway between the patella and iliac crest, with a higher number indicating greater exercise endurance. The 2MST was originally designed for older adults, but recent studies have shown its validity as an exercise endurance assessment tool in various populations, including older adults ( 14 , 17 ) and those with cardiovascular diseases ( 18 – 20 ), Parkinson's disease ( 21 ), symptomatic peripheral artery disease ( 22 ), type 2 diabetes ( 23 ), hypertension ( 24 ), and morbid obesity ( 25 ). The inter-and intra-rater reliabilities of the 2MST have been reported in various populations, including older adults ( 14 , 16 ), young to middle-aged adults ( 26 ), and individuals with cardiovascular diseases ( 20 ), symptomatic peripheral arterial diseases ( 22 ), chronic low back pain ( 27 ), and knee osteoarthritis ( 28 ).

However, the reliability and validity of the 2MST in individuals with stroke and LMSD, including hip fracture and knee or hip arthroplasty, have not been adequately investigated. We conducted the present study to investigate the reliability and validity of the 2MST as an assessment of the exercise endurance of individuals with stroke and LMSD.

2 Participants and methods

2.1 study design, ethics and reporting guideline.

This study was a multicenter, cross-sectional survey. The study was approved by the Medical Ethics Committee of Nagoya Gakuin University (approval no. 2020-28). The study complied with the Declaration of Helsinki, and all participants provided written informed consent. This study evaluated the measurement properties of the 2MST according to the taxonomy developed by the Consensus-based Standards for the selection of health Measurement INstruments (COSMIN) initiative and reported them in accordance with the reporting guidelines developed by COSMIN ( 29 , 30 ).

2.2 Study setting and participants

This study was conducted in the convalescent rehabilitation wards of four hospitals in Japan from August 2021 to August 2022, where volunteers were recruited to participate. The study targeted individuals with first-time stroke (infarction or hemorrhage) or an LMSD (hip or femoral fracture, hip or knee osteoarthritis, or total knee arthroplasty). The inclusion criteria were: ( i ) age ≥45 years, ( ii ) ≥60 days post-stroke onset and ≥45 days post-onset of injury or hospitalization due to an LMSD, ( iii ) overall stable health condition with no exercise restrictions imposed by the attending physician related to the expected exercise load in this study, and ( iv ) ability to walk with supervision using a walking aid or lower-limb orthosis. The exclusion criteria were: ( i ) comorbidity requiring the management of cardiac or respiratory illnesses; ( ii ) the presence of acute pain; and ( iii ) cognitive impairments, consciousness disorders, or mental illnesses that would hinder participation in the study.

To calculate the sample size, we determined the concurrent validity based on the correlation coefficient between the 2MST and 6MWT. In reference to a study of individuals with heart failure reporting a correlation coefficient of 0.44 between the 2MST and the 6MWT ( 18 ), a sample size of 38 participants for each of the present groups (stroke and LMSD) was calculated, considering a significance level of 0.05 and a power of 0.80. During the planning phase of the research proposal, this study ( 18 ) was the only one that validated the correlations between 2MST and 6MWT and peak oxygen uptake among middle-aged and older individuals with diseases. Therefore, although the disease differs from that in our study, it was used as a reference value to calculate the sample size. The minimum sample size was 46 participants per group, with a 20% anticipated data loss. This calculation was performed using G*Power 3.1.9.6 [test family: exact; statistical test: correlation (bivariate normal model)] ( 31 ). Reliability data were randomly selected from the participants who provided data for validity. Reliability was determined based on an intra-rater reliability coefficient [intraclass correlation coefficient (ICC 1,1 )] of 0.7, using the test-retest method, with a significance level of 0.05, and power of 0.80. This resulted in a required sample size of 12 participants for each group. Assuming a 30% data loss, a minimum sample size of 16 participants was planned for each group. The R package (ICC.Sample.Size) was used for this calculation ( 32 ). Participant recruitment was stopped early when sufficient valid data were obtained for the calculated minimum sample size.

2.3 Data collection

The 2MST and 6MWT were conducted on different days for each participant, ranging from ≥1 day to <7 days apart. The examiner was given discretion to choose which test to perform first. Only randomly selected participants, chosen for the examination of reliability, underwent the 2MST again within a 7-day period. The researchers, who are licensed physiotherapist, agreed that neither of the day intervals would result in changes in the participants' conditions that could influence the test results. Data were collected by physiotherapists who were informed of the purpose, content, and methods of the study. There were no restrictions on data collection by the physiotherapists who handled the patients during their regular clinical duties. Demographic and clinical characteristics were collected on the day the 2MST or 6MWT was conducted for the first time. The participants were not blinded to their 2MST or 6MWT results.

2.3.1 Demographics and clinical characteristics

Data on age, sex, body mass index (BMI), days from onset, type of stroke or LMSD, affected side, and site(s) were collected from the participants' medical records. Comorbidities contributing to mortality were evaluated and scored using the Charlson Comorbidity Index ( 33 ); the CCI scoring used an updated version of the index rather than the original version ( 34 ). The scores range from 0 to 24 points, with higher scores indicating a greater impact of comorbid conditions and an increased risk of mortality.

2.3.2 Ambulation ability and mobility

We also evaluated the ambulation ability and mobility. Ambulation ability was assessed using the Functional Ambulation Categories (FAC) scale, which ranges from non-functional ambulator to independent ambulator, with six stages (0–5) ( 35 ). A higher FAC stage indicates greater ambulation ability. The Japanese version of the Rivermead Mobility Index (RMI) was used to assess mobility. The RMI evaluates independence in 15 aspects of mobility, including bed mobility, transfers, walking, bathing, and stairs, and is scored from 0 (poorest) to 15 (best) ( 36 , 37 ).

2.3.3 Physical functions

Pain intensity during walking was evaluated using a Numeric Rating Scale (NRS) ranging from 0 (no pain) to 10 (maximum pain) ( 38 ). Lower-limb muscle strength, specifically hip flexion and knee extension, was assessed using manual muscle testing on both sides in six stages ( 39 ). However, for the affected side of the stroke participants and the muscle strength of the affected side (including hip flexion, knee extension, and ankle dorsiflexion) was assessed using the Motricity Index ( 40 ), which is comprised of six stages for each muscle; the total scores were calculated on a scale ranging from 1 (poorest) to 100 (best) ( 41 ).

2.3.4 Exercise endurance

The 6MWT was conducted in accord with the guidelines of the American Thoracic Society, using a 30-meter walkway ( 42 ). The participant was instructed to walk as far as possible within 6 min, with breaks allowed as needed during the test. When taking a break, the participants were encouraged to resume walking as quickly as possible. The total distance walked was recorded.

The 2MST was conducted in accord with the procedures of the Senior Fitness Test ( 14 ), and the participant was instructed to march as many steps as possible for 2 min on the spot. To set the elevation height of the lower limbs, the midpoint between the patella and the anterior superior iliac spine of each participant was identified. If a participant had difficulty raising the affected limb or the more severely affected side to a standard height, he or she was instructed to raise it to the best of their ability. The number of steps taken over a period of 2 min was then measured based on the non-affected or less affected limb. We excluded individuals who were unable to elevate to the set height due to physical limitations. To ensure the safety of participants with balance disorders and to maintain uniform testing conditions, all tests were conducted with the participants holding onto a handrail with one hand. The original manual for 2MST also mentions the option to allow the use of handrail ( 14 ).

For both the 6MWT and 2MST, the % Heart Rate Reserve (%HRR) was calculated by measuring the participant's heart rate before and after completing the exercise tasks. The modified Borg scale (0–10) was used to assess the rate of perceived exertion (RPE) following the exercise tasks ( 43 ).

2.4 Statistical analyses

Statistical analyses were conducted separately for the stroke and LMSD groups. To understand the characteristics of the sample, we calculated descriptive statistics based on the scale properties of each variable and data distribution. Normality was examined using histograms, Q–Q plots, and the Shapiro–Wilk test. Means and standard deviations were used to describe interval scale variables that were confirmed to be normal, whereas medians and first and third quartiles were used for those that were not confirmed to be normal. Nominal scale variables are presented as frequencies and percentages.

We investigated the validity and reliability of the 2MST as a measure of exercise endurance. For validity, both the concurrent validity and agreement between the 2MST and the 6MWT as well as construct validity by hypotheses testing were evaluated. Reliability was assessed by an examination of the intra-rater reliability using the test-retest method, focusing on both relative and absolute reliability. The statistical analyses were performed with R 4.3.1 (CRAN) using the Shrout method for the ICC ( 44 ) and the Stratford method for the standard error of measurement (SEM) ( 45 ), with a significance level of 5%.

2.4.1 Validity

We used Spearman's rank correlation coefficient to examine the concurrent validity of the 2MST and 6MWT. A non-parametric method was employed for this purpose to align with the methodology used in the subsequent analyses of construct validity. In addition, to quantitatively assess the congruence between the 2MST and 6MWT, a simple regression analysis was performed to predict the 6MWT results from the 2MST results, and a 95% prediction interval at the mean value of the 2MST was determined ( 46 ).

To verify and compare the construct validity of the 2MST and 6MST, we examined their relationships with mobility (RMI), pain intensity (NRS), and the strength of the affected lower limb (MMT and Motricity Index) using Spearman's rank correlation coefficient for each group. The construct validity hypothesis was as follows: 2MST, an on-the-spot marching exercise performed while holding a handrail and counting movements of the unaffected side, was hypothesized to have little or no correlation with mobility, pain intensity, or the strength of the affected lower limb. In contrast, the 6MWT, which involves walking, was expected to have a higher correlation with mobility, the strength of the affected lower limb, and pain during walking. In other words, the 2MST was hypothesized to be less influenced by walking or walking-related physical functions in its assessment of exercise endurance. Because we conducted multiple correlation analyses for both concurrent and construct validity, the probability ( p )-values were adjusted using the Holm method to account for the risk of alpha error. The interpretation of the correlation coefficient was defined as follows: 0.0 to ±0.1 as negligible, ±0.1 to ±0.39 as weak, ±0.4 to ±0.69 as moderate, ±0.7 to ±0.89 as strong, and ±0.9 to ±1.0 as very strong ( 47 ).

2.4.2 Reliability

To evaluate the relative reliability of the 2MST, we used the ICC 1,1 to analyze the correlation coefficient between the initial test and retest, and the SEM was also determined. The interpretation of ICC was defined as follows: <0.5 as poor, 0.5–0.75 as moderate, 0.75–0.9 as good, and >0.90 as excellent reliability ( 48 ).

As a secondary outcome of reliability, we examined absolute reliability, with the aim of providing reference values for future research. The systematic error between the initial test and retest in the 2MST was assessed using Bland–Altman plots ( 49 ). Following the reporting framework ( 50 ) recommended in a recent review ( 51 ), our analysis confirmed the normality of the mean and the difference between two values (initial and retest) using Q–Q plots and the Shapiro–Wilk test. Subsequently, we calculated the mean of the differences with their 95% confidence intervals (CIs), as well as the limits of agreement (LoA) and their upper and lower 95% CIs. Fixed bias was examined using the mean of the difference, 95% CIs, and a one-sample t -test, whereas proportional bias was assessed based on the significance of the Pearson's product-moment correlation coefficient. These analyses based on Bland–Altman plots were performed using the web tool provided by Olofsen et al. ( 52 ), with a detailed methodology described in their paper ( 53 ). The LoA for the 2MST performed by individuals with stroke or LMSD has not yet been reported; therefore, as an alternative, in the present study we assumed that the LoA for the 6MWT of the participants with stroke or hip fracture (ranges corresponding to ±35% and ±18% of the sample mean, respectively) were within acceptable ranges ( 54 , 55 ).

3.1 Characteristics of the participants

In the stroke group, 43 individuals who met the criteria participated in the study, but four were excluded due to an improper administration of either the 6MWT or 2MST. The final sample consisted of 39 individuals for the validity analysis and 15 for the reliability analysis. In the LMSD group, 42 individuals who met the criteria participated. Individuals who had undergone a total hip arthroplasty did not participate in this study. One participant was excluded due to an improper administration of 2MST. The final sample consisted of 42 individuals for validity and 19 individuals for reliability.

The descriptive statistics for each dataset in both groups are presented in Tables 1 , 2 . The stroke group consisted of middle-aged to older adults, with a slightly higher number of males suffering from cerebral infarction. At least 70% of the participants in the stroke group were able to walk independently within the hospital (FAC ≥ 4). The LMSD group mostly included older females with hip fractures who were almost (≥95%) independently ambulatory within the hospital (FAC ≥ 4). In both the stroke and LMSD groups, for the validity and reliability datasets, the walking distance in the 6MWT was approx. 320 m. In contrast, in the 2MST, the average number of steps for the stroke group in the validity data was 78, compared with 91 in the LMSD group, which was slightly higher. However, in the reliability dataset, both groups exhibited an average of 94–100 steps.

www.frontiersin.org

Table 1 . Characteristics of the participants in the stroke group.

www.frontiersin.org

Table 2 . Characteristics of the participants in the lower-limb musculoskeletal disorders (LMSD) group.

3.2 Validity and congruence

Regarding concurrent validity, significant moderate correlations between the 2MST and the 6MWT were observed in both groups (stroke ρ  = 0.55, p  < 0.01; LMSD ρ  = 0.60, p  < 0.01) ( Figures 1A,B and Table 3 ). Table 4 presents the results of the simple regression analysis estimating 6MWT from 2MST for each group. For the congruence between the 2MST and 6MWT, in the stroke group, the 95% prediction interval for the mean value of the 2MST was between a lower bound of 177.1 m and an upper bound of 462.9 m, with a range of 285.7 m ( Figure 2A ). In the LMSD group, the 95% prediction interval for the mean value of the 2MST was between a lower limit of 128.1 m and an upper limit of 499.9 m, with a range of 371.8 m ( Figure 2B ). The range of the predictive interval was wide in both groups, ranging from ±45% to 58% of mean 6MWT. Regarding construct validity, the 6MWT showed a significant moderate correlation with the RMI (mobility) result in both groups (stroke ρ  = 0.51, p  < 0.01; LMSD, ρ  = 0.43, p  < 0.01). In the stroke group, a significant moderate correlation was observed between the Motricity Index (the strength of the affected lower limb) ( ρ  = 0.67, p  < 0.01). However, the 2MST did not show any significant correlation with these variables ( Table 3 ). It should be noted that because of the very low incidence of pain in the stroke group, pain was not included in the analysis for this group.

www.frontiersin.org

Figure 1 . Scatterplots of the participants' results on the 2-min step test (2MST) and 6-min walk test (6MWT) in ( A ) the group with stroke and ( B ) the group with lower-limb musculoskeletal disorders.

www.frontiersin.org

Table 3 . The 2MST and the 6MWT and correlations between each variable.

www.frontiersin.org

Table 4 . Results of single regression analysis to estimate 6MWT from 2MST.

www.frontiersin.org

Figure 2 . ( A ) stroke group: 95% prediction interval = 371.8 m (128.1–499.9), ( B ) lower-limb musculoskeletal disorders group: 95% prediction interval = 285.7 m (177.1–462.9).

3.3 Reliability

Both groups demonstrated excellent results with ICC 1,1 above 0.9 (stroke: 0.93, LMSD: 0.97). The SEM was 6.4 (95% CI: 4.7–10.1) in the stroke group and 5.3 (95% CI: −5.1–2.2) in the LMSD group ( Table 5 ). From the Bland–Altman plots, a significant fixed error of approx. 6.5 steps increase on the retest was observed in the stroke group, although proportional errors were not significant, with the LoA ranging from −24.2 to 11.3 ( Figure 3A and Table 5 ). The LoA for the stroke group in the 2MST was within an error width of ±19%, relative to the sample mean of 94.4 steps. No systematic error was observed in the LMSD group, and the LoA ranged from −16.2 to 13.3 ( Figure 3B and Table 5 ). The LoA for the LMSD group in the 2MST was within an error width of ±15%, relative to the sample mean of 95.7 steps. The 95% CIs for the upper and lower bounds in the LoA for both groups were wide, and the estimates of the population parameters were not stable ( Table 5 ).

www.frontiersin.org

Table 5 . Results of relative and absolute reliability.

www.frontiersin.org

Figure 3 . Bland–Altman plots of the test–retest 2MST. Solid line : the mean of the difference, dotted line : range of limit of agreement, chain line : 95% CI of the lower and upper limits of agreement. ( A ) stroke group, ( B ) lower–limb musculoskeletal disorders group.

4 Discussion

This study aimed to assess the validity and reliability of the 2MST as a tool for measuring the exercise endurance of individuals with stroke or an LMSD. The results indicated a moderate correlation between the 2MST and 6MWT in both groups, but the degree of congruence was insufficient. Although mobility and the 6MWT were correlated in both groups, no correlations with the 2MST were observed. The ICC for the 2MST was excellent in both groups, but only the stroke group exhibited a fixed bias of increased step count at retest. Based on these results, we assert that the 2MST is a valid and reliable tool for assessing the exercise endurance of individuals with stroke or LMSD. However, it is important to consider the potential for increased step count bias during retesting when assessing the exercise endurance in individuals with stroke.

The concurrent validity of the 2MST and the 6MST has already been confirmed in other diseases and populations ( 17 – 24 ), and our present findings extend the applicability of the 2MST as an assessment of exercise endurance. These results were obtained presumably because the participants were at least able to walk under supervision (FAC ≥ 3) and met the minimum unilateral lower-limb muscle strength required to perform the 2MST (capable of anti-gravity movements). Although the disease differs, it is known that some individuals with Parkinson's disease are unable to complete 2 min of marching ( 56 ), while those with mild walking disorders classified as Hoehn and Yahr stages I and II showed a correlation between the 2MST and 6MWT, and no correlation was observed in those with more severe walking disorders classified as stages III and IV ( 21 ).

These findings suggest that the present participants were appropriate for examining the concurrent validity of the 2MST and the 6MWT. However, our analyses revealed that the predicted ranges of the 6MWT estimated from the 2MST were wide, with 371.8 meters (±45%) for the stroke group and 285.7 (±58%) meters for the LMSD group, indicating insufficient congruence between the 2MST and 6MWT. This difference corresponds to the variations in the construct validity between the 2MST and 6MWT, which will be discussed later. In summary, although both the 2MST and 6MWT measure exercise endurance, they are performance tests that reflect different physical functions; therefore, the congruence between the 2MST and 6MWT is considered insufficient.

Interestingly, we observed that the 6MWT was associated with mobility and affected-limb muscle strength in the stroke group as well as mobility in the individuals with LMSD. The 2MST did not demonstrate a significant relationship. The 6MWT involves walking and is thus influenced by walking ability and other contributing factors, such as the muscle strength of the affected limb. In other words, it reflects not only exercise endurance but also walking ability and walking-related physical function. However, as the 2MST was not associated with mobility or muscle strength of the affected limb in this study, this result can be interpreted as an assessment focused on exercise endurance, independent of walking ability.

Previous research demonstrated that the 2MST is associated with the modified Rankin Scale, walking speed, and muscle strength in individuals with stroke ( 57 ). In individuals with knee osteoarthritis, pain intensity and physical function are associated with the 2MST ( 28 ). These findings are not in agreement with our present results, and there are several possible explanations for this discrepancy. In our study, the 2MST was administered in a stable environment with handrails, making it less likely that variations in mobility, physical function, and pain associated with stepping would affect the participants' test performance. We also observed that the pain intensity during walking was almost nonexistent in the individuals with stroke (NRS median, 0) and minimal in those with LMSD (NRS median, 1). An earlier investigation of individuals with knee osteoarthritis found pain to be more severe (NRS mean 8.12) ( 28 ). We thus propose that the 2MST performed with handrails is a method that easily cancels the influence of physical functions related to mobility and pain. Based on these considerations, we argue that the 6MWT and the 2MST should be selectively used depending on the situation and purpose. Given that the 6MWT is an established instrument with substantial evidence available, it should be prioritized 6MWT when possible. However, when environmental constraints or other factors make conducting the 6MWT challenging, the use of 2MST is justified. Furthermore, the 6MWT is appropriate for evaluating exercise endurance, including walking ability, whereas the 2MST is more suitable for evaluating exercise endurance with reduced influence from walking ability.

The relative reliability was excellent in both the present stroke and LMSD groups, comparable to or even better than that reported in previous studies (ICC: 0.83–0.945) that documented intra-rater reliability ( 14 , 16 , 20 , 22 , 26 , 27 ). A notable point is that the range of the LoA in absolute reliability (±18% for stroke and ±15% for LMSD) was smaller than the values set alternatively by the 6MWT, i.e., ±35% for stroke ( 54 ) and ±18% for LMSD ( 55 ). Moreover, although there is limited evidence, recent investigations of the absolute reliability of the 2MST reported the range of LoA to be approx. ±32% for individuals with symptomatic peripheral artery disease ( 22 ) and approx. ±30% for individuals post-coronary revascularization ( 20 ). The LoA in our present study was superior for a performance test of exercise endurance. The high reliability of the assessment may be due to the well-trained physiotherapists, and the 2MST was conducted in a stable environment using handrails. It is also possible that not restricting the assessment by physiotherapists familiar with patients' conditions leads to high reliability. However, this approach may introduce examiner bias, and caution should be exercised in this regard. We also detected a fixed bias with an increase of 6.5 steps (∼8%) during the retest for the individuals with stroke, which could be interpreted as a learning effect. Since the result of 2MST was not blinded to the participants in this study, the learning effect is more likely to be induced in the retest. It is known that for older adults, the number of steps in the 2MST significantly increases in the third test compared with the first ( 16 ). Other studies of the absolute reliability of the 2MST described no systematic error in individuals with symptomatic peripheral artery disease ( 22 ). However, there was an increase of 7.5–7.7 steps (∼11%) on retest for individuals post-coronary revascularization ( 20 ). Similarly, learning effects upon retesting have been suggested in the 6MWT in individuals with stroke and hip fracture ( 54 , 55 ). It remains unclear which participant characteristics are more likely to produce learning effects, but at least for individuals with stroke undergoing the 2MST, a careful interpretation of results considering fixed bias is warranted. According to the bias risk assessment tool for reliability and measurement error developed by COSMIN, note that not blinding both the examiner and the participants to the test results causes a risk of bias ( 58 ).

This study has several limitations. We did not examine the concurrent validity of exercise capacity by investigating its relationship with maximal or peak oxygen uptake. The %HRR in both tests was between 15% and 20%, indicating a low exercise load. In individuals with heart failure and morbid obesity, the concurrent validity between the peak oxygen uptake and the 2MST has been reported ( 18 , 25 ). To examine the validity of the 2MST as a more rigorous assessment of exercise capacity based on exercise endurance, future studies including exhaled gas analyses are needed. Additionally, the assessments that we used for structural validity were mostly simple ones, and a replication study using more sensitive interval scales (such as walking speed or handheld dynamometry) is needed. Moreover, the absolute reliability remains a preliminary result due to the small sample size, and the 95% CI for the LoA was large. Although the sample size for Bland–Altman analyses remains a topic of debate ( 51 ), sample sizes of 100 or 200 are traditionally considered necessary to reflect population characteristics ( 46 ).

Lastly, a sensitivity analysis was not performed in this study. Constructing subgroups from a larger sample size and performing a sensitivity analysis are desired to examine the consistency of our results and provide more clinically interpretable and concrete findings. Despite these limitations, the strength of this study is providing externally valid results from a multicenter collaboration data. This is the first study to examine the validity and reliability of the 2MST as an assessment of exercise endurance in individuals with stroke or LMSD, offering evidence to promote the clinical application of this convenient test. Systematic reviews of the 2MST have indicated a lack of evidence of reliability, particularly absolute reliability ( 59 ), and our present study provides valuable foundational knowledge for future research.

5 Conclusions

Our research findings demonstrated that the 2MST is a valid and reliable method for assessing the exercise endurance of individuals with stroke or an LMSD. It is important to validate absolute reliability using a larger sample size, and when testing individuals with stroke, it may be necessary to consider the potential bias of increased step counts during retesting.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Medical Ethics Committee of Nagoya Gakuin University. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study.

Author contributions

TI: Writing – review & editing, Writing – original draft, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Funding acquisition, Formal Analysis, Data curation, Conceptualization. HK: Writing – review & editing, Writing – original draft, Resources, Methodology, Investigation, Data curation, Conceptualization. KY: Writing – review & editing, Writing – original draft, Resources, Methodology, Investigation, Data curation, Conceptualization. NS: Writing – review & editing, Writing – original draft, Resources, Methodology, Investigation, Data curation, Conceptualization. TO: Writing – review & editing, Writing – original draft, Supervision, Resources, Methodology, Investigation, Data curation, Conceptualization.

The authors declare financial support was received for the research, authorship, and/or publication of this article.

This study was conducted with the support of the Nagoya Gakuin University Research Grant for the fiscal years 2021–2022.

Acknowledgments

We express our deepest gratitude to the staff of the collaborating hospitals (Itami Kousei Neurosurgical Hospital, Senri-chuo Hospital, Hatsudai Rehabilitation Hospital, and Nishiyamato Rehabilitation Hospital) and for the support from the Nagoya Gakuin University Research Grant (2021–2022).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

1. Burtscher M. Exercise limitations by the oxygen delivery and utilization systems in aging and disease: coordinated adaptation and deadaptation of the lung-heart muscle axis—a mini-review. Gerontology . (2013) 59:289–96. doi: 10.1159/000343990

PubMed Abstract | Crossref Full Text | Google Scholar

2. Kokkinos P, Sheriff H, Kheirbek R. Physical inactivity and mortality risk. Cardiol Res Pract . (2011) 2011:924945. doi: 10.4061/2011/924945

3. Bull FC, Al-Ansari SS, Biddle S, Borodulin K, Buman MP, Cardon G, et al. World health organization 2020 guidelines on physical activity and sedentary behaviour. Br J Sports Med . (2020) 54:1451–62. doi: 10.1136/bjsports-2020-102955

4. Butland RJ, Pang J, Gross ER, Woodcock AA, Geddes DM. Two-, six-, and 12-minute walking tests in respiratory disease. Br Med J . (1982) 284:1607–8. doi: 10.1136/bmj.284.6329.1607

Crossref Full Text | Google Scholar

5. Sullivan JE, Crowner BE, Kluding PM, Nichols D, Rose DK, Yoshida R, et al. Outcome measures for individuals with stroke: process and recommendations from the American physical therapy association neurology section task force. Phys Ther . (2013) 93:1383–96. doi: 10.2522/ptj.20120492

6. McDonough CM, Harris-Hayes M, Kristensen MT, Overgaard JA, Herring TB, Kenny AM, et al. Physical therapy management of older adults with hip fracture. J Orthop Sports Phys Ther . (2021) 51:CPG1–81. doi: 10.2519/jospt.2021.0301

7. Coleman G, Dobson F, Hinman RS, Bennell K, White DK. Measures of physical performance. Arthritis Care Res . (2020) 72(Suppl 10):452–85. doi: 10.1002/acr.24373

8. Cabinet Office, Government of Japan. Annual Report on Aging Society 2023. Cabinet Office, Government of Japan . Available online at: https://www8.cao.go.jp/kourei/whitepaper/w-2023/html/zenbun/s1_1_2.html (accessed December 1, 2023).

9. Cabinet Office, Government of Japan. Annual Report on Aging Society 2022. Cabinet Office, Government of Japan . Available online at: https://www8.cao.go.jp/kourei/whitepaper/w-2022/html/zenbun/s1_2_2.html (accessed December 1, 2023).

10. Yamada M, Arai H. Long-term care system in Japan. Ann Geriatr Med Res . (2020) 24:174–80. doi: 10.4235/agmr.20.0037

11. Population Division, United Nations. World population prospects 2022. World Population Prospects 2022. Available online at: https://population.un.org/wpp/ (accessed December 1, 2023).

12. Seron P, Oliveros M-J, Gutierrez-Arias R, Fuentes-Aspe R, Torres-Castro RC, Merino-Osorio C, et al. Effectiveness of telerehabilitation in physical therapy: a rapid overview. Phys Ther . (2021) 101:1–18. doi: 10.1093/ptj/pzab053

13. Zheng J, Hou M, Liu L, Wang X. Knowledge structure and emerging trends of telerehabilitation in recent 20 years: a bibliometric analysis via CiteSpace. Front Public Health . (2022) 10:904855. doi: 10.3389/fpubh.2022.904855

14. Rikli RE, Jessie Jones C. Development and validation of a functional fitness test for community-residing older adults. J Aging Phys Act . (1999) 7:129–61. doi: 10.1123/japa.7.2.129

15. Rikli RE, Jessie Jones C. Functional fitness normative scores for community-residing older adults, ages 60–94. J Aging Phys Act . (1999) 7:162–81. doi: 10.1123/japa.7.2.162

16. Miotto JM, Chodzko-Zajko WJ, Reich JL, Supler MM. Reliability and validity of the fullerton functional fitness test: an independent replication study. J Aging Phys Act . (1999) 7:339–53. doi: 10.1123/japa.7.4.339

17. Berlanga LA, Matos-Duarte M, Abdalla P, Alves E, Mota J, Bohn L. Validity of the two-minute step test for healthy older adults. Geriatr Nurs . (2023) 51:415–21. doi: 10.1016/j.gerinurse.2023.04.009

18. Węgrzynowska-Teodorczyk K, Mozdzanowska D, Josiak K, Siennicka A, Nowakowska K, Banasiak W, et al. Could the two-minute step test be an alternative to the six-minute walk test for patients with systolic heart failure? Eur J Prev Cardiol . (2016) 23:1307–13. doi: 10.1177/2047487315625235

19. Oliveros MJ, Seron P, Román C, Gálvez M, Navarro R, Latin G, et al. Two-minute step test as a complement to six-minute walk test in subjects with treated coronary artery disease. Front Cardiovasc Med . (2022) 9:848589. doi: 10.3389/fcvm.2022.848589

20. Chow JJL, Fitzgerald C, Rand S. The 2 min step test: a reliable and valid measure of functional capacity in older adults post coronary revascularisation. Physiother Res Int . (2023) 28:e1984. doi: 10.1002/pri.1984

21. Mollinedo-Cardalda I, Cancela-Carral JM. The 2-minute step test: its applicability in the evaluation of balance in patients diagnosed with Parkinson’s disease. Top Geriatr Rehabil . (2022) 38:42–8. doi: 10.1097/TGR.0000000000000341

22. Braghieri HA, Kanegusuku H, Corso SD, Cucato GG, Monteiro F, Wolosker N, et al. Validity and reliability of 2-min step test in patients with symptomatic peripheral artery disease. J Vasc Nurs . (2021) 39:33–8. doi: 10.1016/j.jvn.2021.02.004

23. Srithawong A, Poncumhak P, Manoy P, Kumfu S, Promsrisuk T, Prasertsri P, et al. The optimal cutoff score of the 2-min step test and its association with physical fitness in type 2 diabetes mellitus. J Exerc Rehabil . (2022) 18:214–21. doi: 10.12965/jer.2244232.116

24. Pedrosa R, Holanda G. Correlation between the walk, 2-minute step and TUG tests among hypertensive older women. Braz J Phys Ther . (2009) 13:252–6. doi: 10.1590/S1413-35552009005000030

25. Ricci PA, Cabiddu R, Jürgensen SP, André LD, Oliveira CR, Di Thommazo-Luporini L, et al. Validation of the two-minute step test in obese with comorbibities and morbidly obese patients. Braz J Med Biol Res . (2019) 52:e8402. doi: 10.1590/1414-431X20198402

26. Nogueira MA, Almeida TDN, Andrade GS, Ribeiro AS, Rêgo AS, Dias RdS, et al. Reliability and accuracy of 2-minute step test in active and sedentary lean adults. J Manipulative Physiol Ther . (2021) 44:120–7. doi: 10.1016/j.jmpt.2020.07.013

27. de Jesus SFC, Bassi-Dibai D, Pontes-Silva A, da Silva de Araujo A, de Freitas Faria Silva S, Veneroso CE, et al. Construct validity and reliability of the 2-minute step test (2MST) in individuals with low back pain. BMC Musculoskelet Disord . (2022) 23:1062. doi: 10.1186/s12891-022-06050-w

28. de Morais Almeida TF, Dibai-Filho AV, de Freitas Thomaz F, Lima EAA, Cabido CET. Construct validity and reliability of the 2-minute step test in patients with knee osteoarthritis. BMC Musculoskelet Disord . (2022) 23:159. doi: 10.1186/s12891-022-05114-1

29. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol . (2010) 63:737–45. doi: 10.1016/j.jclinepi.2010.02.006

30. Gagnier JJ, Lai J, Mokkink LB, Terwee CB. COSMIN reporting guideline for studies on measurement properties of patient-reported outcome measures. Qual Life Res . (2021) 30:2197–218. doi: 10.1007/s11136-021-02822-4

31. Faul F, Erdfelder E, Buchner A, Lang A-G. Statistical power analyses using G*power 3.1: tests for correlation and regression analyses. Behav Res Methods . (2009) 41:1149–60. doi: 10.3758/BRM.41.4.1149

32. Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med . (2012) 31:3972–81. doi: 10.1002/sim.5466

33. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis . (1987) 40:373–83. doi: 10.1016/0021-9681(87)90171-8

34. Quan H, Li B, Couris CM, Fushimi K, Graham P, Hider P, et al. Updating and validating the Charlson comorbidity index and score for risk adjustment in hospital discharge abstracts using data from 6 countries. Am J Epidemiol . (2011) 173:676–82. doi: 10.1093/aje/kwq433

35. Holden MK, Gill KM, Magliozzi MR, Nathan J, Piehl-Baker L. Clinical gait assessment in the neurologically impaired. Reliability and meaningfulness. Phys Ther . (1984) 64:35–40. doi: 10.1093/ptj/64.1.35

36. Collen FM, Wade DT, Robb GF, Bradshaw CM. The rivermead mobility index: a further development of the rivermead motor assessment. Int Disabil Stud . (1991) 13:50–4. doi: 10.3109/03790799109166684

37. Maeshima S, Yuzuki O, Kobayashi T, Koyama A, Moriyasu M, Osawa A. [Reliability and validity of the Japanese version of rivermead mobility index] rivermead mobility index nihongoban no sakusei to sono shiyou nituite (in Japanese). Sogo Rehabil . (2005) 33:875–9. doi: 10.11477/mf.1552100180

38. Williamson A, Hoggart B. Pain: a review of three commonly used pain rating scales. J Clin Nurs . (2005) 14:798–804. doi: 10.1111/j.1365-2702.2005.01121.x

39. Kleyweg RP, van der Meché FG, Schmitz PI. Interobserver agreement in the assessment of muscle strength and functional abilities in Guillain-Barré syndrome. Muscle Nerve . (1991) 14:1103–9. doi: 10.1002/mus.880141111

40. Demeurisse G, Demol O, Robaye E. Motor evaluation in vascular hemiplegia. Eur Neurol . (1980) 19:382–9. doi: 10.1159/000115178

41. Collin C, Wade D. Assessing motor impairment after stroke: a pilot reliability study. J Neurol Neurosurg Psychiatry . (1990) 53:576–9. doi: 10.1136/jnnp.53.7.576

42. ATS Committee on Proficiency Standards for Clinical Pulmonary Function Laboratories. ATS statement: guidelines for the six-minute walk test. Am J Respir Crit Care Med . (2002) 166:111–7. doi: 10.1164/ajrccm.166.1.at1102

43. Borg GA. Psychophysical bases of perceived exertion. Med Sci Sports Exerc . (1982) 14:377–81. doi: 10.1249/00005768-198205000-00012

44. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull . (1979) 86:420–8. doi: 10.1037//0033-2909.86.2.420

45. Stratford PW, Goldsmith CH. Use of the standard error as a reliability index of interest: an applied example using elbow flexor strength data. Phys Ther . (1997) 77:745–50. doi: 10.1093/ptj/77.7.745

46. Bland M. Frequently asked questions on the design and analysis of measurement studies. Martin Bland’s Home Page . Available online at: https://www-users.york.ac.uk/∼mb55/meas/comfaq.htm (accessed January 15, 2024).

47. Schober P, Boer C, Schwarte LA. Correlation coefficients: appropriate use and interpretation. Anesth Analg . (2018) 126:1763–8. doi: 10.1213/ANE.0000000000002864

48. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med . (2016) 15:155–63. doi: 10.1016/j.jcm.2016.02.012

49. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet . (1986) 1:307–10. doi: 10.1016/S0140-6736(86)90837-8

50. Abu-Arafeh A, Jordan H, Drummond G. Reporting of method comparison studies: a review of advice, an assessment of current practice, and specific suggestions for future reports. Br J Anaesth . (2016) 117:569–75. doi: 10.1093/bja/aew320

51. Gerke O. Reporting standards for a Bland–Atman agreement analysis: a review of methodological reviews. Diagnostics . (2020) 10:334. doi: 10.3390/diagnostics10050334

52. Olofsen E. Webpage for Bland–Altman analysis. Department of Anesthesiology of the LUMC Available online at: https://sec.lumc.nl/method_agreement_analysis/ (accessed January 10, 2023).

53. Olofsen E, Dahan A, Borsboom G, Drummond G. Improvements in the application and reporting of advanced Bland–Altman methods of comparison. J Clin Monit Comput . (2015) 29:127–39. doi: 10.1007/s10877-014-9577-3

54. Liu J, Drutz C, Kumar R, McVicar L, Weinberger R, Brooks D, et al. Use of the six-minute walk test poststroke: is there a practice effect? Arch Phys Med Rehabil . (2008) 89:1686–92. doi: 10.1016/j.apmr.2008.02.026

55. Overgaard JA, Larsen CM, Holtze S, Ockholm K, Kristensen MT. Interrater reliability of the 6-minute walk test in women with hip fracture. J Geriatr Phys Ther . (2017) 40:158–66. doi: 10.1519/JPT.0000000000000088

56. Cancela JM, Ayán C, Gutiérrez-Santiago A, Prieto I, Varela S. The senior fitness test as a functional measure in Parkinson’s disease: a pilot study. Parkinsonism Relat Disord . (2012) 18:170–3. doi: 10.1016/j.parkreldis.2011.09.016

57. Taylor-Piliae RE, Latt LD, Hepworth JT, Coull BM. Predictors of gait velocity among community-dwelling stroke survivors. Gait Posture . (2012) 35:395–9. doi: 10.1016/j.gaitpost.2011.10.358

58. Mokkink LB, Boers M, van der Vleuten CPM, Bouter LM, Alonso J, Patrick DL, et al. COSMIN risk of bias tool to assess the quality of studies on reliability or measurement error of outcome measurement instruments: a delphi study. BMC Med Res Methodol . (2020) 20:293. doi: 10.1186/s12874-020-01179-5

59. Bohannon RW, Crouch RH. Two-minute step test of exercise capacity: systematic review of procedures, performance, and clinimetric properties. J Geriatr Phys Ther . (2019) 42:105–12. doi: 10.1519/JPT.0000000000000164

Keywords: stroke, lower-limb musculoskeletal disorder, exercise capacity, exercise endurance, two-minute step test

Citation: Ishigaki T, Kubo H, Yoshida K, Shimizu N and Ogawa T (2024) Validity and reliability of the 2-min step test in individuals with stroke and lower-limb musculoskeletal disorders. Front. Rehabil. Sci. 5:1384369. doi: 10.3389/fresc.2024.1384369

Received: 9 February 2024; Accepted: 1 April 2024; Published: 16 April 2024.

Reviewed by:

© 2024 Ishigaki, Kubo, Yoshida, Shimizu and Ogawa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Tomoya Ishigaki [email protected]

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

8.4: Small Sample Tests for a Population Mean

  • Last updated
  • Save as PDF
  • Page ID 522

Learning Objectives

  • To learn how to apply the five-step test procedure for test of hypotheses concerning a population mean when the sample size is small.

In the previous section hypotheses testing for population means was described in the case of large samples. The statistical validity of the tests was insured by the Central Limit Theorem, with essentially no assumptions on the distribution of the population. When sample sizes are small, as is often the case in practice, the Central Limit Theorem does not apply. One must then impose stricter assumptions on the population to give statistical validity to the test procedure. One common assumption is that the population from which the sample is taken has a normal probability distribution to begin with. Under such circumstances, if the population standard deviation is known, then the test statistic

\[\frac{(\bar{x}-\mu _0)}{\sigma /\sqrt{n}} \nonumber \]

still has the standard normal distribution, as in the previous two sections. If \(\sigma\) is unknown and is approximated by the sample standard deviation \(s\), then the resulting test statistic

\[\dfrac{(\bar{x}-\mu _0)}{s/\sqrt{n}} \nonumber \]

follows Student’s \(t\)-distribution with \(n-1\) degrees of freedom.

Standardized Test Statistics for Small Sample Hypothesis Tests Concerning a Single Population Mean

 If \(\sigma\) is known: \[Z=\frac{\bar{x}-\mu _0}{\sigma /\sqrt{n}} \nonumber \]

If \(\sigma\) is unknown: \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \]

  • The first test statistic (\(\sigma\) known) has the standard normal distribution.
  • The second test statistic (\(\sigma\) unknown) has Student’s \(t\)-distribution with \(n-1\) degrees of freedom.
  • The population must be normally distributed.

The distribution of the second standardized test statistic (the one containing \(s\)) and the corresponding rejection region for each form of the alternative hypothesis (left-tailed, right-tailed, or two-tailed), is shown in Figure \(\PageIndex{1}\). This is just like Figure 8.2.1 except that now the critical values are from the \(t\)-distribution. Figure 8.2.1 still applies to the first standardized test statistic (the one containing (\(\sigma\)) since it follows the standard normal distribution.

ecf5f771ca148089665859c88d8679df.jpg

The \(p\)-value of a test of hypotheses for which the test statistic has Student’s \(t\)-distribution can be computed using statistical software, but it is impractical to do so using tables, since that would require \(30\) tables analogous to Figure 7.1.5, one for each degree of freedom from \(1\) to \(30\). Figure 7.1.6 can be used to approximate the \(p\)-value of such a test, and this is typically adequate for making a decision using the \(p\)-value approach to hypothesis testing, although not always. For this reason the tests in the two examples in this section will be made following the critical value approach to hypothesis testing summarized at the end of Section 8.1, but after each one we will show how the \(p\)-value approach could have been used.

Example \(\PageIndex{1}\)

The price of a popular tennis racket at a national chain store is \(\$179\). Portia bought five of the same racket at an online auction site for the following prices:

\[155\; 179\; 175\; 175\; 161 \nonumber \]

Assuming that the auction prices of rackets are normally distributed, determine whether there is sufficient evidence in the sample, at the \(5\%\) level of significance, to conclude that the average price of the racket is less than \(\$179\) if purchased at an online auction.

  • Step 1 . The assertion for which evidence must be provided is that the average online price \(\mu\) is less than the average price in retail stores, so the hypothesis test is \[H_0: \mu =179\\ \text{vs}\\ H_a: \mu <179\; @\; \alpha =0.05 \nonumber \]
  • Step 2 . The sample is small and the population standard deviation is unknown. Thus the test statistic is \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \] and has the Student \(t\)-distribution with \(n-1=5-1=4\) degrees of freedom.
  • Step 3 . From the data we compute \(\bar{x}=169\) and \(s=10.39\). Inserting these values into the formula for the test statistic gives \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}}=\frac{169-179}{10.39/\sqrt{5}}=-2.152 \nonumber \]
  • Step 4 . Since the symbol in \(H_a\) is “\(<\)” this is a left-tailed test, so there is a single critical value, \(-t_\alpha =-t_{0.05}[df=4]\). Reading from the row labeled \(df=4\) in Figure 7.1.6 its value is \(-2.132\). The rejection region is \((-\infty ,-2.132]\).
  • Step 5 . As shown in Figure \(\PageIndex{2}\) the test statistic falls in the rejection region. The decision is to reject \(H_0\). In the context of the problem our conclusion is:

The data provide sufficient evidence, at the \(5\%\) level of significance, to conclude that the average price of such rackets purchased at online auctions is less than \(\$179\).

Rejection Region and Test Statistic

To perform the test in Example \(\PageIndex{1}\) using the \(p\)-value approach, look in the row in Figure 7.1.6 with the heading \(df=4\) and search for the two \(t\)-values that bracket the unsigned value \(2.152\) of the test statistic. They are \(2.132\) and \(2.776\), in the columns with headings \(t_{0.050}\) and \(t_{0.025}\). They cut off right tails of area \(0.050\) and \(0.025\), so because \(2.152\) is between them it must cut off a tail of area between \(0.050\) and \(0.025\). By symmetry \(-2.152\) cuts off a left tail of area between \(0.050\) and \(0.025\), hence the \(p\)-value corresponding to \(t=-2.152\) is between \(0.025\) and \(0.05\). Although its precise value is unknown, it must be less than \(\alpha =0.05\), so the decision is to reject \(H_0\).

Example \(\PageIndex{2}\)

A small component in an electronic device has two small holes where another tiny part is fitted. In the manufacturing process the average distance between the two holes must be tightly controlled at \(0.02\) mm, else many units would be defective and wasted. Many times throughout the day quality control engineers take a small sample of the components from the production line, measure the distance between the two holes, and make adjustments if needed. Suppose at one time four units are taken and the distances are measured as

Determine, at the \(1\%\) level of significance, if there is sufficient evidence in the sample to conclude that an adjustment is needed. Assume the distances of interest are normally distributed.

  • Step 1 . The assumption is that the process is under control unless there is strong evidence to the contrary. Since a deviation of the average distance to either side is undesirable, the relevant test is \[H_0: \mu =0.02\\ \text{vs}\\ H_a: \mu \neq 0.02\; @\; \alpha =0.01 \nonumber \] where \(\mu\) denotes the mean distance between the holes.
  • Step 2 . The sample is small and the population standard deviation is unknown. Thus the test statistic is \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}} \nonumber \] and has the Student \(t\)-distribution with \(n-1=4-1=3\) degrees of freedom.
  • Step 3 . From the data we compute \(\bar{x}=0.02075\) and \(s=0.00171\). Inserting these values into the formula for the test statistic gives \[T=\frac{\bar{x}-\mu _0}{s /\sqrt{n}}=\frac{0.02075-0.02}{0.00171\sqrt{4}}=0.877 \nonumber \]
  • Step 4 . Since the symbol in \(H_a\) is “\(\neq\)” this is a two-tailed test, so there are two critical values, \(\pm t_{\alpha/2} =-t_{0.005}[df=3]\). Reading from the row in Figure 7.1.6 labeled \(df=3\) their values are \(\pm 5.841\). The rejection region is \((-\infty ,-5.841]\cup [5.841,\infty )\).
  • Step 5 . As shown in Figure \(\PageIndex{3}\) the test statistic does not fall in the rejection region. The decision is not to reject \(H_0\). In the context of the problem our conclusion is:

The data do not provide sufficient evidence, at the \(1\%\) level of significance, to conclude that the mean distance between the holes in the component differs from \(0.02\) mm.

Rejection Region and Test Statistic

To perform the test in "Example \(\PageIndex{2}\)" using the \(p\)-value approach, look in the row in Figure 7.1.6 with the heading \(df=3\) and search for the two \(t\)-values that bracket the value \(0.877\) of the test statistic. Actually \(0.877\) is smaller than the smallest number in the row, which is \(0.978\), in the column with heading \(t_{0.200}\). The value \(0.978\) cuts off a right tail of area \(0.200\), so because \(0.877\) is to its left it must cut off a tail of area greater than \(0.200\). Thus the \(p\)-value, which is the double of the area cut off (since the test is two-tailed), is greater than \(0.400\). Although its precise value is unknown, it must be greater than \(\alpha =0.01\), so the decision is not to reject \(H_0\).

Key Takeaway

  • There are two formulas for the test statistic in testing hypotheses about a population mean with small samples. One test statistic follows the standard normal distribution, the other Student’s \(t\)-distribution.
  • The population standard deviation is used if it is known, otherwise the sample standard deviation is used.
  • Either five-step procedure, critical value or \(p\)-value approach, is used with either test statistic.

IMAGES

  1. Ch8: Hypothesis Testing (2 Samples)

    hypothesis test two sample sizes

  2. Two Sample t Test (Independent Samples)

    hypothesis test two sample sizes

  3. PPT

    hypothesis test two sample sizes

  4. How to determine correct sample size for hypothesis testing?

    hypothesis test two sample sizes

  5. Hypothesis Testing Example Two Sample t-Test

    hypothesis test two sample sizes

  6. Ch8: Hypothesis Testing (2 Samples)

    hypothesis test two sample sizes

VIDEO

  1. ONE SAMPLE HYPOTHESIS TESTING USING SPSS

  2. Hypothesis test(Two sample propotions) using Excel || Ft.Nirmal Bajacharya

  3. Hypothesis Testing for Mean: p-value is more than the level of significance (Hat Size Example)

  4. Two-Sample Hypothesis Tests

  5. Hypothesis Test for Two Sample Proportion

  6. Hypothesis Test Two Population Means Using Statcrunch Example 1

COMMENTS

  1. Two Sample t-test: Definition, Formula, and Example

    where x 1 and x 2 are the sample means, n 1 and n 2 are the sample sizes, and where s p is calculated as: ... 0.05, and 0.01) then you can reject the null hypothesis. Two Sample t-test: Assumptions. For the results of a two sample t-test to be valid, the following assumptions should be met:

  2. 10: Hypothesis Testing with Two Samples

    10.5: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  3. T Test Overview: How to Use & Examples

    We'll use a two-sample t test to evaluate if the difference between the two group means is statistically significant. The t test output is below. In the output, you can see that the treatment group (Sample 1) has a mean of 109 while the control group's (Sample 2) average is 100. The p-value for the difference between the groups is 0.112.

  4. 25.3

    Let's take a look at two examples that illustrate the kind of sample size calculation we can make to ensure our hypothesis test has sufficient power. Example 25-4 Section Let \(X\) denote the crop yield of corn measured in the number of bushels per acre.

  5. Putting It Together: Hypothesis Testing with Two Samples

    The difference of two proportions is approximately normal if there are at least five successes and five failures in each sample. When conducting a hypothesis test for a difference of two proportions, the random samples must be independent and the population must be at least ten times the sample size.

  6. t-test Calculator

    A one-sample t-test (to test the mean of a single group against a hypothesized mean); A two-sample t-test (to compare the means for two groups); or. A paired t-test (to check how the mean from the same group changes after some intervention). Decide on the alternative hypothesis: Two-tailed; Left-tailed; or. Right-tailed.

  7. Two-Sample t-Test

    The figure below shows results for the two-sample t -test for the body fat data from JMP software. Figure 5: Results for the two-sample t-test from JMP software. The results for the two-sample t -test that assumes equal variances are the same as our calculations earlier. The test statistic is 2.79996.

  8. Two Sample t-test: Definition, Formula, and Example

    where x 1 and x 2 are the sample means, n 1 and n 2 are the sample sizes, ... 0.05, and 0.01) then you can reject the null hypothesis. Two Sample t-test: Assumptions. For the results of a two sample t-test to be valid, the following assumptions should be met:

  9. Two-sample t test for difference of means

    And let's assume that we are working with a significance level of 0.05. So pause the video, and conduct the two sample T test here, to see whether there's evidence that the sizes of tomato plants differ between the fields. Alright, now let's work through this together. So like always, let's first construct our null hypothesis.

  10. Hypothesis Testing: 2 Means (Independent Samples)

    Since we are being asked for convincing statistical evidence, a hypothesis test should be conducted. In this case, we are dealing with averages from two samples or groups (the home run distances), so we will conduct a Test of 2 Means. n1 = 70 n 1 = 70 is the sample size for the first group. n2 = 66 n 2 = 66 is the sample size for the second group.

  11. Power and Sample Size Determination

    Sample size estimates for hypothesis testing are often based on achieving 80% or 90% power. ... Sample Sizes for Two Independent Samples, Continuous Outcome. In studies where the plan is to perform a test of hypothesis comparing the means of a continuous outcome variable in two independent populations, the hypotheses of interest are: ...

  12. Hypothesis Testing: Two Samples

    When performing a hypothesis test comparing matched or paired samples, the following points hold true: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  13. 5.5

    For a test for two proportions, we are interested in the difference between two groups. If the difference is zero, then they are not different (i.e., they are equal). Therefore, the null hypothesis will always be: H 0: p 1 − p 2 = 0. Another way to look at it is H 0: p 1 = p 2.

  14. 10: Hypothesis Testing with Two Samples

    10.4: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  15. Two Sample Z-Test: Definition, Formula, and Example

    This tutorial provides an introduction to the two sample z-test, including a definition, formula, and example. Statology. ... and 0.01) then you can reject the null hypothesis. Two Sample Z-Test: Assumptions. For the results of a two sample z-test to be valid, the following assumptions should be met: ... (sample 1 size) = 20; x 2 (sample 2 mean ...

  16. Two Sample t-test Calculator

    If this is not the case, you should instead use the Welch's t-test calculator. To perform a two sample t-test, simply fill in the information below and then click the "Calculate" button. Enter raw data Enter summary data. Sample 1. 301, 298, 295, 297, 304, 305, 309, 298, 291, 299, 293, 304. Sample 2.

  17. 9.2: Comparing Two Independent Population Means (Hypothesis test)

    This is a test of two independent groups, two population means. Random variable: ˉXg − ˉXb = difference in the sample mean amount of time girls and boys play sports each day. H0: μg = μb. H0: μg − μb = 0. Ha: μg ≠ μb. Ha: μg − μb ≠ 0.

  18. 10: Hypothesis Testing with Two Samples

    A hypothesis test can help determine if a difference in the estimated proportions reflects a difference in the population proportions. 10.4: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two ...

  19. Sample size determination

    Toggle Required sample sizes for hypothesis tests subsection. 3.1 Tables. 3.2 Mead's resource equation. 3.3 Cumulative distribution function. 4 Stratified sample size. 5 Qualitative research. 6 See ... The table shown on the right can be used in a two-sample t-test to estimate the sample sizes of an experimental group and a control group that ...

  20. 8: Hypothesis Testing with Two Samples

    8.5: Matched or Paired Samples When using a hypothesis test for matched or paired samples, the following characteristics should be present: Simple random sampling is used. Sample sizes are often small. Two measurements (samples) are drawn from the same pair of individuals or objects. Differences are calculated from the matched or paired samples.

  21. One-Tailed and Two-Tailed Hypothesis Tests Explained

    Example of a two-tailed 1-sample t-test. Suppose we perform a two-sided 1-sample t-test where we compare the mean strength (4.1) of parts from a supplier to a target value (5). We use a two-tailed test because we care whether the mean is greater than or less than the target value.

  22. A comparison of nonparametric methods for multivariate two-sample tests

    1. Two-sample hypothesis testing can be problematic when samples are generated from unknown multivariate distributions, especially when the sample sizes are small. However, this is quite a common s...

  23. How to Perform a t-test with Unequal Sample Sizes

    #perform independent samples t-test t. test (program1, program2, var. equal = TRUE) Two Sample t-test data: program1 and program2 t = -0.5988, df = 518, p-value = 0.5496 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -14.52474 7.73875 sample estimates: mean of x mean of y 80.5661 83.9591 # ...

  24. SOLVED: You are performing a right-tailed t

    Understand the problem. You are given a test statistic \( t = 2.03 \) from a sample of size 21, and you need to find the \( p \)-value for a right-tailed \( t \)-test. The \( p \)-value will tell you the probability of observing a test statistic as extreme as, or more extreme than, the one observed if the null hypothesis is true. Step 2/6

  25. Frontiers

    Therefore, although the disease differs from that in our study, it was used as a reference value to calculate the sample size. The minimum sample size was 46 participants per group, with a 20% anticipated data loss. This calculation was performed using G*Power 3.1.9.6 [test family: exact; statistical test: correlation (bivariate normal model ...

  26. 8.4: Small Sample Tests for a Population Mean

    Step 1. The assertion for which evidence must be provided is that the average online price is less than the average price in retail stores, so the hypothesis test is. Step 2. The sample is small and the population standard deviation is unknown. Thus the test statistic is and has the Student -distribution with degrees of freedom.

  27. Assessment of non-linear mixed effects model-based approaches to test

    Two classical statistical approaches, t-test and Mixed-Effect Model Repeated Measure (MMRM), were also added to the comparison. The focus was a simulation study where the extent of the model misspecification is known, using a response model with or without drug effect as motivating example in two sample size scenarios.