Logo for Maricopa Open Digital Press

9 Chapter 9 Hypothesis testing

The first unit was designed to prepare you for hypothesis testing. In the first chapter we discussed the three major goals of statistics:

  • Describe: connects to unit 1 with descriptive statistics and graphing
  • Decide: connects to unit 1 knowing your data and hypothesis testing
  • Predict: connects to hypothesis testing and unit 3

The remaining chapters will cover many different kinds of hypothesis tests connected to different inferential statistics. Needless to say, hypothesis testing is the central topic of this course. This lesson is important but that does not mean the same thing as difficult. There is a lot of new language we will learn about when conducting a hypothesis test. Some of the components of a hypothesis test are the topics we are already familiar with:

  • Test statistics
  • Probability
  • Distribution of sample means

Hypothesis testing is an inferential procedure that uses data from a sample to draw a general conclusion about a population. It is a formal approach and a statistical method that uses sample data to evaluate hypotheses about a population. When interpreting a research question and statistical results, a natural question arises as to whether the finding could have occurred by chance. Hypothesis testing is a statistical procedure for testing whether chance (random events) is a reasonable explanation of an experimental finding. Once you have mastered the material in this lesson you will be used to solving hypothesis testing problems and the rest of the course will seem much easier. In this chapter, we will introduce the ideas behind the use of statistics to make decisions – in particular, decisions about whether a particular hypothesis is supported by the data.

Logic and Purpose of Hypothesis Testing

The statistician Ronald Fisher explained the concept of hypothesis testing with a story of a lady tasting tea. Fisher was a statistician from London and is noted as the first person to formalize the process of hypothesis testing. His elegantly simple “Lady Tasting Tea” experiment demonstrated the logic of the hypothesis test.

chapter 9 hypothesis testing quizlet

Figure 1. A depiction of the lady tasting tea Photo Credit

Fisher would often have afternoon tea during his studies. He usually took tea with a woman who claimed to be a tea expert. In particular, she told Fisher that she could tell which was poured first in the teacup, the milk or the tea, simply by tasting the cup. Fisher, being a scientist, decided to put this rather bizarre claim to the test. The lady accepted his challenge. Fisher brought her 8 cups of tea in succession; 4 cups would be prepared with the milk added first, and 4 with the tea added first. The cups would be presented in a random order unknown to the lady.

The lady would take a sip of each cup as it was presented and report which ingredient she believed was poured first. Using the laws of probability, Fisher determined the chances of her guessing all 8 cups correctly was 1/70, or about 1.4%. In other words, if the lady was indeed guessing there was a 1.4% chance of her getting all 8 cups correct. On the day of the experiment, Fisher had 8 cups prepared just as he had requested. The lady drank each cup and made her decisions for each one.

After the experiment, it was revealed that the lady got all 8 cups correct! Remember, had she been truly guessing, the chance of getting this result was 1.4%. Since this probability was so low , Fisher instead concluded that the lady could indeed differentiate between the milk or the tea being poured first. Fisher’s original hypothesis that she was just guessing was demonstrated to be false and was therefore rejected. The alternative hypothesis, that the lady could truly tell the cups apart, was then accepted as true.

This story demonstrates many components of hypothesis testing in a very simple way. For example, Fisher started with a hypothesis that the lady was guessing. He then determined that if she was indeed guessing, the probability of guessing all 8 right was very small, just 1.4%. Since that probability was so tiny, when she did get all 8 cups right, Fisher determined it was extremely unlikely she was guessing. A more reasonable conclusion was that the lady had the skill to tell the cups apart.

In hypothesis testing, we will always set up a particular hypothesis that we want to demonstrate to be true. We then use probability to determine the likelihood of our hypothesis is correct. If it appears our original hypothesis was wrong, we reject it and accept the alternative hypothesis. The alternative hypothesis is usually the opposite of our original hypothesis. In Fisher’s case, his original hypothesis was that the lady was guessing. His alternative hypothesis was the lady was not guessing.

This result does not prove that he does; it could be he was just lucky and guessed right 13 out of 16 times. But how plausible is the explanation that he was just lucky? To assess its plausibility, we determine the probability that someone who was just guessing would be correct 13/16 times or more. This probability can be computed to be 0.0106. This is a pretty low probability, and therefore someone would have to be very lucky to be correct 13 or more times out of 16 if they were just guessing. A low probability gives us more confidence there is evidence Bond can tell whether the drink was shaken or stirred. There is also still a chance that Mr. Bond was very lucky (more on this later!). The hypothesis that he was guessing is not proven false, but considerable doubt is cast on it. Therefore, there is strong evidence that Mr. Bond can tell whether a drink was shaken or stirred.

You may notice some patterns here:

  • We have 2 hypotheses: the original (researcher prediction) and the alternative
  • We collect data
  • We determine how likley or unlikely the original hypothesis is to occur based on probability.
  • We determine if we have enough evidence to support the original hypothesis and draw conclusions.

Now let’s being in some specific terminology:

Null hypothesis : In general, the null hypothesis, written H 0 (“H-naught”), is the idea that nothing is going on: there is no effect of our treatment, no relation between our variables, and no difference in our sample mean from what we expected about the population mean. The null hypothesis indicates that an apparent effect is due to chance. This is always our baseline starting assumption, and it is what we (typically) seek to reject . For mathematical notation, one uses =).

Alternative hypothesis : If the null hypothesis is rejected, then we will need some other explanation, which we call the alternative hypothesis, H A or H 1 . The alternative hypothesis is simply the reverse of the null hypothesis. Thus, our alternative hypothesis is the mathematical way of stating our research question.  In general, the alternative hypothesis (also called the research hypothesis)is there is an effect of treatment, the relation between variables, or differences in a sample mean compared to a population mean. The alternative hypothesis essentially shows evidence the findings are not due to chance.  It is also called the research hypothesis as this is the most common outcome a researcher is looking for: evidence of change, differences, or relationships. There are three options for setting up the alternative hypothesis, depending on where we expect the difference to lie. The alternative hypothesis always involves some kind of inequality (≠not equal, >, or <).

  • If we expect a specific direction of change/differences/relationships, which we call a directional hypothesis , then our alternative hypothesis takes the form based on the research question itself.  One would expect a decrease in depression from taking an anti-depressant as a specific directional hypothesis.  Or the direction could be larger, where for example, one might expect an increase in exam scores after completing a student success exam preparation module.  The directional hypothesis (2 directions) makes up 2 of the 3 alternative hypothesis options.  The other alternative is to state there are differences/changes, or a relationship but not predict the direction.  We use a non-directional alternative hypothesis  (typically see ≠ for mathematical notation).

Probability value (p-value) : the probability of a certain outcome assuming a certain state of the world. In statistics, it is conventional to refer to possible states of the world as hypotheses since they are hypothesized states of the world. Using this terminology, the probability value is the probability of an outcome given the hypothesis. It is not the probability of the hypothesis given the outcome. It is very important to understand precisely what the probability values mean. In the James Bond example, the computed probability of 0.0106 is the probability he would be correct on 13 or more taste tests (out of 16) if he were just guessing. It is easy to mistake this probability of 0.0106 as the probability he cannot tell the difference. This is not at all what it means. The probability of 0.0106 is the probability of a certain outcome (13 or more out of 16) assuming a certain state of the world (James Bond was only guessing).

A low probability value casts doubt on the null hypothesis. How low must the probability value be in order to conclude that the null hypothesis is false? Although there is clearly no right or wrong answer to this question, it is conventional to conclude the null hypothesis is false if the probability value is less than 0.05 (p < .05). More conservative researchers conclude the null hypothesis is false only if the probability value is less than 0.01 (p<.01). When a researcher concludes that the null hypothesis is false, the researcher is said to have rejected the null hypothesis. The probability value below which the null hypothesis is rejected is called the α level or simply α (“alpha”). It is also called the significance level . If α is not explicitly specified, assume that α = 0.05.

Decision-making is part of the process and we have some language that goes along with that. Importantly, null hypothesis testing operates under the assumption that the null hypothesis is true unless the evidence shows otherwise. We (typically) seek to reject the null hypothesis, giving us evidence to support the alternative hypothesis .  If the probability of the outcome given the hypothesis is sufficiently low, we have evidence that the null hypothesis is false. Note that all probability calculations for all hypothesis tests center on the null hypothesis. In the James Bond example, the null hypothesis is that he cannot tell the difference between shaken and stirred martinis. The probability value is low that one is able to identify 13 of 16 martinis as shaken or stirred (0.0106), thus providing evidence that he can tell the difference. Note that we have not computed the probability that he can tell the difference.

The specific type of hypothesis testing reviewed is specifically known as null hypothesis statistical testing (NHST). We can break the process of null hypothesis testing down into a number of steps a researcher would use.

  • Formulate a hypothesis that embodies our prediction ( before seeing the data )
  • Specify null and alternative hypotheses
  • Collect some data relevant to the hypothesis
  • Compute a test statistic
  • Identify the criteria probability (or compute the probability of the observed value of that statistic) assuming that the null hypothesis is true
  • Drawing conclusions. Assess the “statistical significance” of the result

Steps in hypothesis testing

Step 1: formulate a hypothesis of interest.

The researchers hypothesized that physicians spend less time with obese patients. The researchers hypothesis derived from an identified population. In creating a research hypothesis, we also have to decide whether we want to test a directional or non-directional hypotheses. Researchers typically will select a non-directional hypothesis for a more conservative approach, particularly when the outcome is unknown (more about why this is later).

Step 2: Specify the null and alternative hypotheses

Can you set up the null and alternative hypotheses for the Physician’s Reaction Experiment?

Step 3: Determine the alpha level.

For this course, alpha will be given to you as .05 or .01.  Researchers will decide on alpha and then determine the associated test statistic based from the sample. Researchers in the Physician Reaction study might set the alpha at .05 and identify the test statistics associated with the .05 for the sample size.  Researchers might take extra precautions to be more confident in their findings (more on this later).

Step 4: Collect some data

For this course, the data will be given to you.  Researchers collect the data and then start to summarize it using descriptive statistics. The mean time physicians reported that they would spend with obese patients was 24.7 minutes as compared to a mean of 31.4 minutes for normal-weight patients.

Step 5: Compute a test statistic

We next want to use the data to compute a statistic that will ultimately let us decide whether the null hypothesis is rejected or not. We can think of the test statistic as providing a measure of the size of the effect compared to the variability in the data. In general, this test statistic will have a probability distribution associated with it, because that allows us to determine how likely our observed value of the statistic is under the null hypothesis.

To assess the plausibility of the hypothesis that the difference in mean times is due to chance, we compute the probability of getting a difference as large or larger than the observed difference (31.4 – 24.7 = 6.7 minutes) if the difference were, in fact, due solely to chance.

Step 6: Determine the probability of the observed result under the null hypothesis 

Using methods presented in later chapters, this probability associated with the observed differences between the two groups for the Physician’s Reaction was computed to be 0.0057. Since this is such a low probability, we have confidence that the difference in times is due to the patient’s weight (obese or not) (and is not due to chance). We can then reject the null hypothesis (there are no differences or differences seen are due to chance).

Keep in mind that the null hypothesis is typically the opposite of the researcher’s hypothesis. In the Physicians’ Reactions study, the researchers hypothesized that physicians would expect to spend less time with obese patients. The null hypothesis that the two types of patients are treated identically as part of the researcher’s control of other variables. If the null hypothesis were true, a difference as large or larger than the sample difference of 6.7 minutes would be very unlikely to occur. Therefore, the researchers rejected the null hypothesis of no difference and concluded that in the population, physicians intend to spend less time with obese patients.

This is the step where NHST starts to violate our intuition. Rather than determining the likelihood that the null hypothesis is true given the data, we instead determine the likelihood under the null hypothesis of observing a statistic at least as extreme as one that we have observed — because we started out by assuming that the null hypothesis is true! To do this, we need to know the expected probability distribution for the statistic under the null hypothesis, so that we can ask how likely the result would be under that distribution. This will be determined from a table we use for reference or calculated in a statistical analysis program. Note that when I say “how likely the result would be”, what I really mean is “how likely the observed result or one more extreme would be”. We need to add this caveat as we are trying to determine how weird our result would be if the null hypothesis were true, and any result that is more extreme will be even more weird, so we want to count all of those weirder possibilities when we compute the probability of our result under the null hypothesis.

Let’s review some considerations for Null hypothesis statistical testing (NHST)!

Null hypothesis statistical testing (NHST) is commonly used in many fields. If you pick up almost any scientific or biomedical research publication, you will see NHST being used to test hypotheses, and in their introductory psychology textbook, Gerrig & Zimbardo (2002) referred to NHST as the “backbone of psychological research”. Thus, learning how to use and interpret the results from hypothesis testing is essential to understand the results from many fields of research.

It is also important for you to know, however, that NHST is flawed, and that many statisticians and researchers think that it has been the cause of serious problems in science, which we will discuss in further in this unit. NHST is also widely misunderstood, largely because it violates our intuitions about how statistical hypothesis testing should work. Let’s look at an example to see this.

There is great interest in the use of body-worn cameras by police officers, which are thought to reduce the use of force and improve officer behavior. However, in order to establish this we need experimental evidence, and it has become increasingly common for governments to use randomized controlled trials to test such ideas. A randomized controlled trial of the effectiveness of body-worn cameras was performed by the Washington, DC government and DC Metropolitan Police Department in 2015-2016. Officers were randomly assigned to wear a body-worn camera or not, and their behavior was then tracked over time to determine whether the cameras resulted in less use of force and fewer civilian complaints about officer behavior.

Before we get to the results, let’s ask how you would think the statistical analysis might work. Let’s say we want to specifically test the hypothesis of whether the use of force is decreased by the wearing of cameras. The randomized controlled trial provides us with the data to test the hypothesis – namely, the rates of use of force by officers assigned to either the camera or control groups. The next obvious step is to look at the data and determine whether they provide convincing evidence for or against this hypothesis. That is: What is the likelihood that body-worn cameras reduce the use of force, given the data and everything else we know?

It turns out that this is not how null hypothesis testing works. Instead, we first take our hypothesis of interest (i.e. that body-worn cameras reduce use of force), and flip it on its head, creating a null hypothesis – in this case, the null hypothesis would be that cameras do not reduce use of force. Importantly, we then assume that the null hypothesis is true. We then look at the data, and determine how likely the data would be if the null hypothesis were true. If the data are sufficiently unlikely under the null hypothesis that we can reject the null in favor of the alternative hypothesis which is our hypothesis of interest. If there is not sufficient evidence to reject the null, then we say that we retain (or “fail to reject”) the null, sticking with our initial assumption that the null is true.

Understanding some of the concepts of NHST, particularly the notorious “p-value”, is invariably challenging the first time one encounters them, because they are so counter-intuitive. As we will see later, there are other approaches that provide a much more intuitive way to address hypothesis testing (but have their own complexities).

Step 7: Assess the “statistical significance” of the result. Draw conclusions.

The next step is to determine whether the p-value that results from the previous step is small enough that we are willing to reject the null hypothesis and conclude instead that the alternative is true. In the Physicians Reactions study, the probability value is 0.0057. Therefore, the effect of obesity is statistically significant and the null hypothesis that obesity makes no difference is rejected. It is very important to keep in mind that statistical significance means only that the null hypothesis of exactly no effect is rejected; it does not mean that the effect is important, which is what “significant” usually means. When an effect is significant, you can have confidence the effect is not exactly zero. Finding that an effect is significant does not tell you about how large or important the effect is.

How much evidence do we require and what considerations are needed to better understand the significance of the findings? This is one of the most controversial questions in statistics, in part because it requires a subjective judgment – there is no “correct” answer.

What does a statistically significant result mean?

There is a great deal of confusion about what p-values actually mean (Gigerenzer, 2004). Let’s say that we do an experiment comparing the means between conditions, and we find a difference with a p-value of .01. There are a number of possible interpretations that one might entertain.

Does it mean that the probability of the null hypothesis being true is .01? No. Remember that in null hypothesis testing, the p-value is the probability of the data given the null hypothesis. It does not warrant conclusions about the probability of the null hypothesis given the data.

Does it mean that the probability that you are making the wrong decision is .01? No. Remember as above that p-values are probabilities of data under the null, not probabilities of hypotheses.

Does it mean that if you ran the study again, you would obtain the same result 99% of the time? No. The p-value is a statement about the likelihood of a particular dataset under the null; it does not allow us to make inferences about the likelihood of future events such as replication.

Does it mean that you have found a practially important effect? No. There is an essential distinction between statistical significance and practical significance . As an example, let’s say that we performed a randomized controlled trial to examine the effect of a particular diet on body weight, and we find a statistically significant effect at p<.05. What this doesn’t tell us is how much weight was actually lost, which we refer to as the effect size (to be discussed in more detail). If we think about a study of weight loss, then we probably don’t think that the loss of one ounce (i.e. the weight of a few potato chips) is practically significant. Let’s look at our ability to detect a significant difference of 1 ounce as the sample size increases.

A statistically significant result is not necessarily a strong one. Even a very weak result can be statistically significant if it is based on a large enough sample. This is why it is important to distinguish between the statistical significance of a result and the practical significance of that result. Practical significance refers to the importance or usefulness of the result in some real-world context and is often referred to as the effect size .

Many differences are statistically significant—and may even be interesting for purely scientific reasons—but they are not practically significant. In clinical practice, this same concept is often referred to as “clinical significance.” For example, a study on a new treatment for social phobia might show that it produces a statistically significant positive effect. Yet this effect still might not be strong enough to justify the time, effort, and other costs of putting it into practice—especially if easier and cheaper treatments that work almost as well already exist. Although statistically significant, this result would be said to lack practical or clinical significance.

Be aware that the term effect size can be misleading because it suggests a causal relationship—that the difference between the two means is an “effect” of being in one group or condition as opposed to another. In other words, simply calling the difference an “effect size” does not make the relationship a causal one.

Figure 1 shows how the proportion of significant results increases as the sample size increases, such that with a very large sample size (about 262,000 total subjects), we will find a significant result in more than 90% of studies when there is a 1 ounce difference in weight loss between the diets. While these are statistically significant, most physicians would not consider a weight loss of one ounce to be practically or clinically significant. We will explore this relationship in more detail when we return to the concept of statistical power in Chapter X, but it should already be clear from this example that statistical significance is not necessarily indicative of practical significance.

The proportion of signifcant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

Figure 1: The proportion of significant results for a very small change (1 ounce, which is about .001 standard deviations) as a function of sample size.

Challenges with using p-values

Historically, the most common answer to this question has been that we should reject the null hypothesis if the p-value is less than 0.05. This comes from the writings of Ronald Fisher, who has been referred to as “the single most important figure in 20th century statistics” (Efron, 1998 ) :

“If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05 … it is convenient to draw the line at about the level at which we can say: Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials” (Fisher, 1925 )

Fisher never intended p<0.05p < 0.05 to be a fixed rule:

“no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas” (Fisher, 1956 )

Instead, it is likely that p < .05 became a ritual due to the reliance upon tables of p-values that were used before computing made it easy to compute p values for arbitrary values of a statistic. All of the tables had an entry for 0.05, making it easy to determine whether one’s statistic exceeded the value needed to reach that level of significance. Although we use tables in this class, statistical software examines the specific probability value for the calculated statistic.

Assessing Error Rate: Type I and Type II Error

Although there are challenges with p-values for decision making, we will examine a way we can think about hypothesis testing in terms of its error rate.  This was proposed by Jerzy Neyman and Egon Pearson:

“no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong” (Neyman & Pearson, 1933 )

That is: We can’t know which specific decisions are right or wrong, but if we follow the rules, we can at least know how often our decisions will be wrong in the long run.

To understand the decision-making framework that Neyman and Pearson developed, we first need to discuss statistical decision-making in terms of the kinds of outcomes that can occur. There are two possible states of reality (H0 is true, or H0 is false), and two possible decisions (reject H0, or retain H0). There are two ways in which we can make a correct decision:

  • We can reject H0 when it is false (in the language of signal detection theory, we call this a hit )
  • We can retain H0 when it is true (somewhat confusingly in this context, this is called a correct rejection )

There are also two kinds of errors we can make:

  • We can reject H0 when it is actually true (we call this a false alarm , or Type I error ), Type I error  means that we have concluded that there is a relationship in the population when in fact there is not. Type I errors occur because even when there is no relationship in the population, sampling error alone will occasionally produce an extreme result.
  • We can retain H0 when it is actually false (we call this a miss , or Type II error ). Type II error  means that we have concluded that there is no relationship in the population when in fact there is.

Summing up, when you perform a hypothesis test, there are four possible outcomes depending on the actual truth (or falseness) of the null hypothesis H0 and the decision to reject or not. The outcomes are summarized in the following table:

Table 1. The four possible outcomes in hypothesis testing.

  • The decision is not to reject H0 when H0 is true (correct decision).
  • The decision is to reject H0 when H0 is true (incorrect decision known as a Type I error ).
  • The decision is not to reject H0 when, in fact, H0 is false (incorrect decision known as a Type II error ).
  • The decision is to reject H0 when H0 is false ( correct decision ).

Neyman and Pearson coined two terms to describe the probability of these two types of errors in the long run:

  • P(Type I error) = αalpha
  • P(Type II error) = βbeta

That is, if we set αalpha to .05, then in the long run we should make a Type I error 5% of the time. The 𝞪 (alpha) , is associated with the p-value for the level of significance. Again it’s common to set αalpha as .05. In fact, when the null hypothesis is true and α is .05, we will mistakenly reject the null hypothesis 5% of the time. (This is why α is sometimes referred to as the “Type I error rate.”) In principle, it is possible to reduce the chance of a Type I error by setting α to something less than .05. Setting it to .01, for example, would mean that if the null hypothesis is true, then there is only a 1% chance of mistakenly rejecting it. But making it harder to reject true null hypotheses also makes it harder to reject false ones and therefore increases the chance of a Type II error.

In practice, Type II errors occur primarily because the research design lacks adequate statistical power to detect the relationship (e.g., the sample is too small).  Statistical power is the complement of Type II error. We will have more to say about statistical power shortly. The standard value for an acceptable level of β (beta) is .2 – that is, we are willing to accept that 20% of the time we will fail to detect a true effect when it truly exists. It is possible to reduce the chance of a Type II error by setting α to something greater than .05 (e.g., .10). But making it easier to reject false null hypotheses also makes it easier to reject true ones and therefore increases the chance of a Type I error. This provides some insight into why the convention is to set α to .05. There is some agreement among researchers that level of α keeps the rates of both Type I and Type II errors at acceptable levels.

The possibility of committing Type I and Type II errors has several important implications for interpreting the results of our own and others’ research. One is that we should be cautious about interpreting the results of any individual study because there is a chance that it reflects a Type I or Type II error. This is why researchers consider it important to replicate their studies. Each time researchers replicate a study and find a similar result, they rightly become more confident that the result represents a real phenomenon and not just a Type I or Type II error.

Test Statistic Assumptions

Last consideration we will revisit with each test statistic (e.g., t-test, z-test and ANOVA) in the coming chapters.  There are four main assumptions. These assumptions are often taken for granted in using prescribed data for the course.  In the real world, these assumptions would need to be examined, often tested using statistical software.

  • Assumption of random sampling. A sample is random when each person (or animal) point in your population has an equal chance of being included in the sample; therefore selection of any individual happens by chance, rather than by choice. This reduces the chance that differences in materials, characteristics or conditions may bias results. Remember that random samples are more likely to be representative of the population so researchers can be more confident interpreting the results. Note: there is no test that statistical software can perform which assures random sampling has occurred but following good sampling techniques helps to ensure your samples are random.
  • Assumption of Independence. Statistical independence is a critical assumption for many statistical tests including the 2-sample t-test and ANOVA. It is assumed that observations are independent of each other often but often this assumption. Is not met. Independence means the value of one observation does not influence or affect the value of other observations. Independent data items are not connected with one another in any way (unless you account for it in your study). Even the smallest dependence in your data can turn into heavily biased results (which may be undetectable) if you violate this assumption. Note: there is no test statistical software can perform that assures independence of the data because this should be addressed during the research planning phase. Using a non-parametric test is often recommended if a researcher is concerned this assumption has been violated.
  • Assumption of Normality. Normality assumes that the continuous variables (dependent variable) used in the analysis are normally distributed. Normal distributions are symmetric around the center (the mean) and form a bell-shaped distribution. Normality is violated when sample data are skewed. With large enough sample sizes (n > 30) the violation of the normality assumption should not cause major problems (remember the central limit theorem) but there is a feature in most statistical software that can alert researchers to an assumption violation.
  • Assumption of Equal Variance. Variance refers to the spread or of scores from the mean. Many statistical tests assume that although different samples can come from populations with different means, they have the same variance. Equality of variance (i.e., homogeneity of variance) is violated when variances across different groups or samples are significantly different. Note: there is a feature in most statistical software to test for this.

We will use 4 main steps for hypothesis testing:

  • Usually the hypotheses concern population parameters and predict the characteristics that a sample should have
  • Null: Null hypothesis (H0) states that there is no difference, no effect or no change between population means and sample means. There is no difference.
  • Alternative: Alternative hypothesis (H1 or HA) states that there is a difference or a change between the population and sample. It is the opposite of the null hypothesis.
  • Set criteria for a decision. In this step we must determine the boundary of our distribution at which the null hypothesis will be rejected. Researchers usually use either a 5% (.05) cutoff or 1% (.01) critical boundary. Recall from our earlier story about Ronald Fisher that the lower the probability the more confident the was that the Tea Lady was not guessing.  We will apply this to z in the next chapter.
  • Compare sample and population to decide if the hypothesis has support
  • When a researcher uses hypothesis testing, the individual is making a decision about whether the data collected is sufficient to state that the population parameters are significantly different.

Further considerations

The probability value is the probability of a result as extreme or more extreme given that the null hypothesis is true. It is the probability of the data given the null hypothesis. It is not the probability that the null hypothesis is false.

A low probability value indicates that the sample outcome (or one more extreme) would be very unlikely if the null hypothesis were true. We will learn more about assessing effect size later in this unit.

3.  A non-significant outcome means that the data do not conclusively demonstrate that the null hypothesis is false. There is always a chance of error and 4 outcomes associated with hypothesis testing.

chapter 9 hypothesis testing quizlet

  • It is important to take into account the assumptions for each test statistic.

Learning objectives

Having read the chapter, you should be able to:

  • Identify the components of a hypothesis test, including the parameter of interest, the null and alternative hypotheses, and the test statistic.
  • State the hypotheses and identify appropriate critical areas depending on how hypotheses are set up.
  • Describe the proper interpretations of a p-value as well as common misinterpretations.
  • Distinguish between the two types of error in hypothesis testing, and the factors that determine them.
  • Describe the main criticisms of null hypothesis statistical testing
  • Identify the purpose of effect size and power.

Exercises – Ch. 9

  • In your own words, explain what the null hypothesis is.
  • What are Type I and Type II Errors?
  • Why do we phrase null and alternative hypotheses with population parameters and not sample means?
  • If our null hypothesis is “H0: μ = 40”, what are the three possible alternative hypotheses?
  • Why do we state our hypotheses and decision criteria before we collect our data?
  • When and why do you calculate an effect size?

Answers to Odd- Numbered Exercises – Ch. 9

1. Your answer should include mention of the baseline assumption of no difference between the sample and the population.

3. Alpha is the significance level. It is the criteria we use when decided to reject or fail to reject the null hypothesis, corresponding to a given proportion of the area under the normal distribution and a probability of finding extreme scores assuming the null hypothesis is true.

5. μ > 40; μ < 40; μ ≠ 40

7. We calculate effect size to determine the strength of the finding.  Effect size should always be calculated when the we have rejected the null hypothesis.  Effect size can be calculated for non-significant findings as a possible indicator of Type II error.

Introduction to Statistics for Psychology Copyright © 2021 by Alisa Beyer is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

The random variable is the mean Internet speed in megabits per second.

The random variable is the mean number of children an American family has.

The random variable is the proportion of people picked at random in Times Square visiting the city.

  • H 0 : p = 0.42
  • H a : p < 0.42
  • H 0 : μ = 15
  • H a : μ ≠ 15

Type I: The mean price of mid-sized cars is $32,000, but we conclude that it is not $32,000.

Type II: The mean price of mid-sized cars is not $32,000, but we conclude that it is $32,000.

α = the probability that you think the bag cannot withstand –15 degrees F, when, in fact, it can.

β = the probability that you think the bag can withstand –15 degrees F, when, in fact, it cannot.

Type I: The procedure will go well, but the doctors think it will not.

Type II: The procedure will not go well, but the doctors think it will.

A normal distribution or a Student’s t -distribution

Use a Student’s t -distribution

a normal distribution for a single population mean

It must be approximately normally distributed.

They must both be greater than five.

binomial distribution

The outcome of winning is very unlikely.

H 0 : μ > = 73 H a : μ < 73 The p -value is almost zero, which means there is sufficient data to conclude that the mean height of high school students who play basketball on the school team is less than 73 inches at the 5 percent level. The data do support the claim.

The shaded region shows a low p -value.

Do not reject H 0 .

the mean time spent on homework for 26 students

X ¯ ~ N ( 2.5 , 1.5 26 ) X ¯ ~ N ( 2.5 , 1.5 26 )

This is a left-tailed test.

This is a two-tailed test.

a right-tailed test

a left-tailed test

  • H 0 : μ = 34; H a : μ ≠ 34
  • H 0 : p ≤ 0.60; H a : p > 0.60
  • H 0 : μ ≥ 100,000; H a : μ < 100,000
  • H 0 : p = 0.29; H a : p ≠ 0.29
  • H 0 : p = 0.05; H a : p < 0.05
  • H 0 : μ ≤ 10; H a : μ > 10
  • H 0 : p = 0.50; H a : p ≠ 0.50
  • H 0 : μ = 6; H a : μ ≠ 6
  • H 0 : p ≥ 0.11; H a : p < 0.11
  • H 0 : μ ≤ 20,000; H a : μ > 20,000
  • Type I error: We conclude that the mean is not 34 years, when it really is 34 years. Type II error: We conclude that the mean is 34 years, when in fact it really is not 34 years.
  • Type I error: We conclude that more than 60 percent of Americans vote in presidential elections, when the actual percentage is at most 60 percent.Type II error: We conclude that at most 60 percent of Americans vote in presidential elections when, in fact, more than 60 percent do.
  • Type I error: We conclude that the mean starting salary is less than $100,000, when it really is at least $100,000. Type II error: We conclude that the mean starting salary is at least $100,000 when, in fact, it is less than $100,000.
  • Type I error: We conclude that the proportion of high school seniors who take physical education daily is not 29%, when it really is 29%. Type II error: We conclude that the proportion of high school seniors who take physical education daily is 29% when, in fact, it is not 29%.
  • Type I error: We conclude that fewer than 5 percent of adults ride the bus to work in Los Angeles, when the percentage that do is really 29%. Type II error: We conclude that 29%. or more adults ride the bus to work in Los Angeles when, in fact, fewer that 29% do.
  • Type I error: We conclude that the mean number of cars a person owns in his or her lifetime is more than 10, when in reality it is not more than 10. Type II error: We conclude that the mean number of cars a person owns in his or her lifetime is not more than 10 when, in fact, it is more than 10.
  • Type I error: We conclude that the proportion of Americans who prefer to live away from cities is not about half, though the actual proportion is about half. Type II error: We conclude that the proportion of Americans who prefer to live away from cities is half when, in fact, it is not half.
  • Type I error: We conclude that the duration of paid vacations each year for Europeans is not six weeks, when in fact it is six weeks. Type II error: We conclude that the duration of paid vacations each year for Europeans is six weeks when, in fact, it is not.
  • Type I error: We conclude that the proportion is less than 11 percent, when it is really at least 11 percent. Type II error: We conclude that the proportion of women who develop breast cancer is at least 11 percent, when in fact it is less than 11 percent.
  • Type I error: We conclude that the average tuition cost at private universities is more than $20,000, though in reality it is at most $20,000. Type II error: We conclude that the average tuition cost at private universities is at most $20,000 when, in fact, it is more than $20,000.
  • H 0 : μ ≥ 50,000
  • H a : μ < 50,000
  • Let X ¯ X ¯ = the average lifespan of a brand of tires.
  • normal distribution
  • p -value = 0.0103
  • Check student’s solution.
  • Alpha: 0.05
  • Decision: Reject the null hypothesis.
  • Reason for decision: The p -value is less than 0.05.
  • Conclusion: There is sufficient evidence to conclude that the mean lifespan of the tires is less than 50,000 miles.
  • (43,537, 49,463)
  • H 0 : μ ≥ 35.5
  • H a : μ < 35.5
  • Let x ¯ x ¯ = the average mpg for the sample of cars and trucks in the fleet
  • p -value = 0.2578
  • Decision: Do not reject the null hypothesis.
  • Reason for decision: The p-value is greater than 0.05.
  • Conclusion: There is sufficient evidence to support the claim that the manufacturer’s fleet meets the fuel economy standards in the 2016 policy.
  • (31.88 mpg, 37.32 mpg)
  • H 0 : μ = $1.00
  • H a : μ ≠ $1.00
  • Let x ¯ x ¯ = the average cost of a daily newspaper.
  • p -value = 0.3865
  • Alpha: 0.01
  • Reason for decision: The p -value is greater than 0.01.
  • Conclusion: There is sufficient evidence to support the claim that the mean cost of daily papers is $1. The mean cost could be $1.
  • ($0.84, $1.06)
  • H 0 : μ = 10
  • H a : μ ≠ 10
  • Let X ¯ X ¯ = the mean number of sick days an employee takes per year.
  • Student’s t -distribution
  • p -value = 0.300
  • Reason for decision: The p -value is greater than 0.05.
  • Conclusion: At the 5 percent significance level, there is insufficient evidence to conclude that the mean number of sick days is not 10.
  • (4.9443, 11.806)
  • H 0 : p ≥ 0.6
  • H a : p < 0.6
  • Let P′ = the proportion of students who feel more enriched as a result of taking elementary statistics.
  • normal for a single proportion
  • p -value = 0.1308
  • Conclusion: There is insufficient evidence to conclude that less than 60 percent of her students feel more enriched.
  • Confidence interval: (0.409, 0.654) The “plus-4s” confidence interval is (0.411, 0.648)
  • H 0 : μ = 4
  • H a : μ ≠ 4
  • Let X ¯ X ¯ the average I.Q. of a set of brown trout.
  • two-tailed Student's t -test
  • p -value = 0.076
  • Reason for decision: The p -value is greater than 0.05
  • Conclusion: There is insufficient evidence to conclude that the average IQ of brown trout is not four.
  • (3.8865, 5.9468)
  • H 0 : p ≥ 0.13
  • H a : p < 0.13
  • Let P′ = the proportion of Americans who have the disease
  • p -value = 0.0036
  • Conclusion: There is sufficient evidence to conclude that the percentage of Americans who have been diagnosed with the disease is less than 13 percent.
  • (0, 0.0623). The plus-4s confidence interval is (0.0022, 0.0978)
  • H 0 : μ ≥ 129
  • H a : μ < 129
  • Let X ¯ X ¯ = the average time in seconds that Terri finishes Lap 4.
  • Student's t -distribution
  • Conclusion: There is insufficient evidence to conclude that Terri’s mean lap time is less than 129 seconds.
  • (128.63, 130.37)
  • H 0 : p = 0.60
  • H a : p < 0.60
  • Let P′ = the proportion of family members who shed tears at a reunion.
  • Reason for decision: p -value < alpha
  • Conclusion: At the 5 percent significance level, there is sufficient evidence to conclude that the proportion of family members who shed tears at a reunion is less than 0.60. However, the test is weak because the p -value and alpha are quite close, so other tests should be done.
  • We are 95 percent confident that between 38.29 percent and 61.71 percent of family members will shed tears at a family reunion. (0.3829, 0.6171). The plus-4s confidence interval (see chapter 8) is (0.3861, 0.6139)

Note that here the large-sample 1 – PropZTest provides the approximate p -value of 0.0438. Whenever a p -value based on a normal approximation is close to the level of significance, the exact p -value based on binomial probabilities should be calculated whenever possible. This is beyond the scope of this course.

  • H 0 : μ ≥ 22
  • H a : μ < 22
  • Let X ¯ X ¯ = the mean number of bubbles per blow.
  • p -value = 0.00486
  • Conclusion: There is sufficient evidence to conclude that the mean number of bubbles per blow is less than 22.
  • (18.501, 21.499)
  • H 0 : μ ≤ 1
  • H a : μ > 1
  • Let X ¯ X ¯ = the mean cost in dollars of macaroni and cheese in a certain town.
  • p -value = 0.36756
  • Conclusion: The mean cost could be $1, or less. At the 5 percent significance level, there is insufficient evidence to conclude that the mean price of a box of macaroni and cheese is more than $1.
  • (0.8291, 1.241)
  • H 0 : p = 0.01
  • H a : p > 0.01
  • Let P′ = the proportion of errors generated
  • Normal for a single proportion
  • Conclusion: At the 5 percent significance level, there is sufficient evidence to conclude that the proportion of errors generated is more than 0.01.
  • Confidence interval: (0, 0.094). The plus-4s confidence interval is (0.004, 0.144).
  • H 0 : p = 0.50
  • H a : p < 0.50
  • Let P′ = the proportion of friends that has a pierced ear.
  • p -value = 0.0448
  • Reason for decision: The p -value is less than 0.05. (However, they are very close.)
  • Conclusion: There is sufficient evidence to support the claim that less than 50 percent of his friends have pierced ears.
  • Confidence interval: (0.245, 0.515): The plus-4s confidence interval is (0.259, 0.519).
  • H 0 : p = 0.40
  • H a : p < 0.40
  • Let P′ = the proportion of schoolmates who fear public speaking.
  • p -value = 0.1563
  • Conclusion: There is insufficient evidence to support the claim that less than 40 percent of students at the school fear public speaking.
  • Confidence interval: (0.3241, 0.4240): The plus-4s confidence interval is (0.3257, 0.4250).
  • H 0 : p = 0.14
  • H a : p < 0.14
  • Let P′ = the proportion of nursing home residents that have the disease.
  • p -value = 0.3914
  • At the 5 percent significance level, there is insufficient evidence to conclude that the proportion of nursing home residents that have the disease is less than 0.14.
  • Confidence interval: (0.0502, 0.2070): The plus-4s confidence interval (see chapter 8) is (0.0676, 0.2297).
  • H 0 : μ = 69,110
  • H a : μ > 69,110
  • Let X ¯ X ¯ = the mean salary in dollars for California registered nurses.
  • p -value: 0.0466
  • Conclusion: At the 5 percent significance level, there is sufficient evidence to conclude that the mean salary of California registered nurses exceeds $69,110.
  • ($68,757, $73,485)
  • H 0 : p ≥ 0.14, H a : p < 0.14
  • p -value < 0.0002
  • Reject the null hypothesis.
  • At the 5 percent significance level, there is sufficient evidence to conclude that the proportion of Harleys stolen is significantly less than their share of all motorcycles. (conclusion a)
  • H 0 : p = 0.488 H a : p ≠ 0.488
  • p -value = 0.0114
  • alpha = 0.05
  • At the 5 percent level of significance, there is enough evidence to conclude that 48.8 percent of families own stocks.
  • The survey does not appear to be accurate.
  • H 0 : p = 0.517 H a : p ≠ 0.517
  • p -value = 0.9203.
  • alpha = 0.05.
  • Do not reject the null hypothesis.
  • At the 5 percent significance level, there is not enough evidence to conclude that the proportion of homes in Kentucky that are heated by natural gas is 0.517.
  • However, we cannot generalize this result to the entire nation. First, the sample’s population is only the state of Kentucky. Second, it is reasonable to assume that homes in the extreme north and south will have extreme high usage and low usage, respectively. We would need to expand our sample base to include these possibilities if we wanted to generalize this claim to the entire nation.
  • H 0 : µ ≥ 11.52 H a : µ < 11.52
  • p -value = 0.000002 which is almost 0.
  • At the 5 percent significance level, there is enough evidence to conclude that the mean amount of summer rain in the northeaster US is less than 11.52 inches, on average.
  • We would make the same conclusion if alpha was 1 percent because the p -value is almost 0.
  • H 0 : µ ≤ 5.8 H a : µ > 5.8
  • p -value = 0.9987
  • At the 5 percent level of significance, there is not enough evidence to conclude that a woman visits her doctor, on average, more than 5.8 times a year.
  • H 0 : µ ≥ 150 H a : µ < 150
  • p -value = 0.0622
  • alpha = 0.01
  • At the 1 percent significance level, there is not enough evidence to conclude that freshmen students study less than 2.5 hours per day, on average.
  • The student academic group’s claim appears to be correct.

As an Amazon Associate we earn from qualifying purchases.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute Texas Education Agency (TEA). The original material is available at: https://www.texasgateway.org/book/tea-statistics . Changes were made to the original material, including updates to art, structure, and other content updates.

Access for free at https://openstax.org/books/statistics/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Statistics
  • Publication date: Mar 27, 2020
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/statistics/pages/1-introduction
  • Section URL: https://openstax.org/books/statistics/pages/9-solutions

© Jan 23, 2024 Texas Education Agency (TEA). The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.1: Introduction to Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 10211

  • Kyle Siegrist
  • University of Alabama in Huntsville via Random Services

Basic Theory

Preliminaries.

As usual, our starting point is a random experiment with an underlying sample space and a probability measure \(\P\). In the basic statistical model, we have an observable random variable \(\bs{X}\) taking values in a set \(S\). In general, \(\bs{X}\) can have quite a complicated structure. For example, if the experiment is to sample \(n\) objects from a population and record various measurements of interest, then \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th object. The most important special case occurs when \((X_1, X_2, \ldots, X_n)\) are independent and identically distributed. In this case, we have a random sample of size \(n\) from the common distribution.

The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing . Collectively, these concepts are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized them.

A statistical hypothesis is a statement about the distribution of \(\bs{X}\). Equivalently, a statistical hypothesis specifies a set of possible distributions of \(\bs{X}\): the set of distributions for which the statement is true. A hypothesis that specifies a single distribution for \(\bs{X}\) is called simple ; a hypothesis that specifies more than one distribution for \(\bs{X}\) is called composite .

In hypothesis testing , the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis . The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\).

An hypothesis test is a statistical decision ; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value \(\bs{x}\) of the data vector \(\bs{X}\). Thus, we will find an appropriate subset \(R\) of the sample space \(S\) and reject \(H_0\) if and only if \(\bs{x} \in R\). The set \(R\) is known as the rejection region or the critical region . Note the asymmetry between the null and alternative hypotheses. This asymmetry is due to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in \(\bs{x}\) to overturn this assumption in favor of the alternative.

An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that \(H_1\) is a statement in a mathematical theory and that \(H_0\) is its negation. One way that we can prove \(H_1\) is to assume \(H_0\) and work our way logically to a contradiction. In an hypothesis test, we don't prove anything of course, but there are similarities. We assume \(H_0\) and then see if the data \(\bs{x}\) are sufficiently at odds with that assumption that we feel justified in rejecting \(H_0\) in favor of \(H_1\).

Often, the critical region is defined in terms of a statistic \(w(\bs{X})\), known as a test statistic , where \(w\) is a function from \(S\) into another set \(T\). We find an appropriate rejection region \(R_T \subseteq T\) and reject \(H_0\) when the observed value \(w(\bs{x}) \in R_T\). Thus, the rejection region in \(S\) is then \(R = w^{-1}(R_T) = \left\{\bs{x} \in S: w(\bs{x}) \in R_T\right\}\). As usual, the use of a statistic often allows significant data reduction when the dimension of the test statistic is much smaller than the dimension of the data vector.

The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true.

Types of errors:

  • A type 1 error is rejecting the null hypothesis \(H_0\) when \(H_0\) is true.
  • A type 2 error is failing to reject the null hypothesis \(H_0\) when the alternative hypothesis \(H_1\) is true.

Similarly, there are two ways to make a correct decision: we could reject \(H_0\) when \(H_1\) is true or we could fail to reject \(H_0\) when \(H_0\) is true. The possibilities are summarized in the following table:

Of course, when we observe \(\bs{X} = \bs{x}\) and make our decision, either we will have made the correct decision or we will have committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we can consider the probabilities of the various errors.

If \(H_0\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_0\)), then \(\P(\bs{X} \in R)\) is the probability of a type 1 error for this distribution. If \(H_0\) is composite, then \(H_0\) specifies a variety of different distributions for \(\bs{X}\) and thus there is a set of type 1 error probabilities.

The maximum probability of a type 1 error, over the set of distributions specified by \( H_0 \), is the significance level of the test or the size of the critical region.

The significance level is often denoted by \(\alpha\). Usually, the rejection region is constructed so that the significance level is a prescribed, small value (typically 0.1, 0.05, 0.01).

If \(H_1\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_1\)), then \(\P(\bs{X} \notin R)\) is the probability of a type 2 error for this distribution. Again, if \(H_1\) is composite then \(H_1\) specifies a variety of different distributions for \(\bs{X}\), and thus there will be a set of type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the probability of a type 1 error, by making the rejection region \(R\) smaller, we necessarily increase the probability of a type 2 error because the complementary region \(S \setminus R\) is larger.

The extreme cases can give us some insight. First consider the decision rule in which we never reject \(H_0\), regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = \emptyset\). A type 1 error is impossible, so the significance level is 0. On the other hand, the probability of a type 2 error is 1 for any distribution defined by \(H_1\). At the other extreme, consider the decision rule in which we always rejects \(H_0\) regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = S\). A type 2 error is impossible, but now the probability of a type 1 error is 1 for any distribution defined by \(H_0\). In between these two worthless tests are meaningful tests that take the evidence \(\bs{x}\) into account.

If \(H_1\) is true, so that the distribution of \(\bs{X}\) is specified by \(H_1\), then \(\P(\bs{X} \in R)\), the probability of rejecting \(H_0\) is the power of the test for that distribution.

Thus the power of the test for a distribution specified by \( H_1 \) is the probability of making the correct decision.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with region \(R_1\) is uniformly more powerful than the test with region \(R_2\) if \[ \P(\bs{X} \in R_1) \ge \P(\bs{X} \in R_2) \text{ for every distribution of } \bs{X} \text{ specified by } H_1 \]

Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more powerful for some distributions specified by \(H_1\) while the other test will be more powerful for other distributions specified by \(H_1\).

If a test has significance level \(\alpha\) and is uniformly more powerful than any other test with significance level \(\alpha\), then the test is said to be a uniformly most powerful test at level \(\alpha\).

Clearly a uniformly most powerful test is the best we can do.

\(P\)-value

In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region \(R_\alpha\)) for any given significance level \(\alpha \in (0, 1)\). Typically, \(R_\alpha\) decreases (in the subset sense) as \(\alpha\) decreases.

The \(P\)-value of the observed value \(\bs{x}\) of \(\bs{X}\), denoted \(P(\bs{x})\), is defined to be the smallest \(\alpha\) for which \(\bs{x} \in R_\alpha\); that is, the smallest significance level for which \(H_0\) is rejected, given \(\bs{X} = \bs{x}\).

Knowing \(P(\bs{x})\) allows us to test \(H_0\) at any significance level for the given data \(\bs{x}\): If \(P(\bs{x}) \le \alpha\) then we would reject \(H_0\) at significance level \(\alpha\); if \(P(\bs{x}) \gt \alpha\) then we fail to reject \(H_0\) at significance level \(\alpha\). Note that \(P(\bs{X})\) is a statistic . Informally, \(P(\bs{x})\) can often be thought of as the probability of an outcome as or more extreme than the observed value \(\bs{x}\), where extreme is interpreted relative to the null hypothesis \(H_0\).

Analogy with Justice Systems

There is a helpful analogy between statistical hypothesis testing and the criminal justice system in the US and various other countries. Consider a person charged with a crime. The presumed null hypothesis is that the person is innocent of the crime; the conjectured alternative hypothesis is that the person is guilty of the crime. The test of the hypotheses is a trial with evidence presented by both sides playing the role of the data. After considering the evidence, the jury delivers the decision as either not guilty or guilty . Note that innocent is not a possible verdict of the jury, because it is not the point of the trial to prove the person innocent. Rather, the point of the trial is to see whether there is sufficient evidence to overturn the null hypothesis that the person is innocent in favor of the alternative hypothesis of that the person is guilty. A type 1 error is convicting a person who is innocent; a type 2 error is acquitting a person who is guilty. Generally, a type 1 error is considered the more serious of the two possible errors, so in an attempt to hold the chance of a type 1 error to a very low level, the standard for conviction in serious criminal cases is beyond a reasonable doubt .

Tests of an Unknown Parameter

Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable \(\bs{X}\) depends on a parameter \(\theta\) taking values in a parameter space \(\Theta\). The parameter may be vector-valued, so that \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_n)\) and \(\Theta \subseteq \R^k\) for some \(k \in \N_+\). The hypotheses generally take the form \[ H_0: \theta \in \Theta_0 \text{ versus } H_1: \theta \notin \Theta_0 \] where \(\Theta_0\) is a prescribed subset of the parameter space \(\Theta\). In this setting, the probabilities of making an error or a correct decision depend on the true value of \(\theta\). If \(R\) is the rejection region, then the power function \( Q \) is given by \[ Q(\theta) = \P_\theta(\bs{X} \in R), \quad \theta \in \Theta \] The power function gives a lot of information about the test.

The power function satisfies the following properties:

  • \(Q(\theta)\) is the probability of a type 1 error when \(\theta \in \Theta_0\).
  • \(\max\left\{Q(\theta): \theta \in \Theta_0\right\}\) is the significance level of the test.
  • \(1 - Q(\theta)\) is the probability of a type 2 error when \(\theta \notin \Theta_0\).
  • \(Q(\theta)\) is the power of the test when \(\theta \notin \Theta_0\).

If we have two tests, we can compare them by means of their power functions.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with rejection region \(R_1\) is uniformly more powerful than the test with rejection region \(R_2\) if \( Q_1(\theta) \ge Q_2(\theta)\) for all \( \theta \notin \Theta_0 \).

Most hypothesis tests of an unknown real parameter \(\theta\) fall into three special cases:

Suppose that \( \theta \) is a real parameter and \( \theta_0 \in \Theta \) a specified value. The tests below are respectively the two-sided test , the left-tailed test , and the right-tailed test .

  • \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\)
  • \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\)
  • \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\)

Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides \(\theta\) (known as nuisance parameters ).

Equivalence Between Hypothesis Test and Confidence Sets

There is an equivalence between hypothesis tests and confidence sets for a parameter \(\theta\).

Suppose that \(C(\bs{x})\) is a \(1 - \alpha\) level confidence set for \(\theta\). The following test has significance level \(\alpha\) for the hypothesis \( H_0: \theta = \theta_0 \) versus \( H_1: \theta \ne \theta_0 \): Reject \(H_0\) if and only if \(\theta_0 \notin C(\bs{x})\)

By definition, \(\P[\theta \in C(\bs{X})] = 1 - \alpha\). Hence if \(H_0\) is true so that \(\theta = \theta_0\), then the probability of a type 1 error is \(P[\theta \notin C(\bs{X})] = \alpha\).

Equivalently, we fail to reject \(H_0\) at significance level \(\alpha\) if and only if \(\theta_0\) is in the corresponding \(1 - \alpha\) level confidence set. In particular, this equivalence applies to interval estimates of a real parameter \(\theta\) and the common tests for \(\theta\) given above .

In each case below, the confidence interval has confidence level \(1 - \alpha\) and the test has significance level \(\alpha\).

  • Suppose that \(\left[L(\bs{X}, U(\bs{X})\right]\) is a two-sided confidence interval for \(\theta\). Reject \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\) or \(\theta_0 \gt U(\bs{X})\).
  • Suppose that \(L(\bs{X})\) is a confidence lower bound for \(\theta\). Reject \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\).
  • Suppose that \(U(\bs{X})\) is a confidence upper bound for \(\theta\). Reject \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\) if and only if \(\theta_0 \gt U(\bs{X})\).

Pivot Variables and Test Statistics

Recall that confidence sets of an unknown parameter \(\theta\) are often constructed through a pivot variable , that is, a random variable \(W(\bs{X}, \theta)\) that depends on the data vector \(\bs{X}\) and the parameter \(\theta\), but whose distribution does not depend on \(\theta\) and is known. In this case, a natural test statistic for the basic tests given above is \(W(\bs{X}, \theta_0)\).

Study Site Homepage

  • Request new password
  • Create a new account

Statistics with R

Student resources, chapter 9: hypothesis tests: introduction, basic concepts, and an example.

The following exercises test your understanding of hypothesis testing in the context of the triangle taste test described in Chapter 9. In this case, the test consists of 14 identical trials on which the subject attempts to identify the odd sample on each trial. Assume that there are two possible rejection regions: RR 8 = {8, 9, 10 11, 12, 13, 14} and RR 10 = {10, 11, 12, 13, 14}.

RR10 = {10, 11, 12, 13, 14}

1. With a rejection region of RR 8 = {8 , 9 , 10 , 11 , 12 , 13 ,1 4}, what is the probability of a Type I error? Recall that since a Type I error occurs when the subject has no taste-discrimination ability, p = 1 / 3.title

Answer: 0.05762. This is equal to the probability of 8 or more correct identifications if the subject has no taste discrimination ability (and p = 1 / 3).

1 - pbinom(7, 14, 1/3)

## [1] 0.0576163

sum(dbinom(8 : 14, 14, 1/3))

##  [1]   0.0576163

2. With a rejection region of RR 8 = {8 , 9 , 10 , 11 , 12 , 13 , 14 }, what is the probability of a Type II error, if the subject has a probability of p = 0 . 80 of identifying the odd sample?

Answer: 0.01161; this is the probability of 7 or fewer correct identifications if the subject is able to identify the odd sample with 0.80 probability.

pbinom(7, 14, 0.80)

## [1] 0.01160991

sum(dbinom(0 : 7, 14, 0.80))

3. With a rejection region of RR 10 = {10 , 11 , 12 , 13 , 14 }, what is the probability of a Type I error?

Answer: 0.00404. This is equal to the probability of 10 or more correct identifications if the subject has no taste discrimination ability (and p = 1 / 3).

1 - pbinom(9, 14, 1/3

##   [1]   0.004039541

sum(dbinom(10 : 14, 14, 1/3))

##   [1]  0.004039541

4. With a rejection region of RR 10 = {10 , 11 , 12 , 13 , 14}, what is the probability of a Type II error, if the subject has a probability of p = 0 . 80 of identifying the odd sample?

Answer: 0.1298. This is the probability of 9 or fewer correct identifications if the subject is able to identify the odd sample with 0.80 probability.

pbinom(9, 14, 0.80)

## [1] 0.1298396

sum(dbinom(0 : 9, 14, 0.80))

5. Please answer the following questions about this triangle test taste. 

(a) Which of the rejection regions should we prefer: RR 8 or RR 10 ? Why?

Answer: We would prefer RR 10 since the probability of a Type I error is considerably lower.

(b) In general, a hypothesis test can result in two different types of errors. Describe those two errors in this case where we are attempting to identify someone with a high degree of taste-discrimination ability. Which is more serious?

Answer: In the case of the triangle taste test, a Type I error occurs when we reject the null hypothesis when it is true; that is, when we conclude that some- one with no taste-discrimination ability actually has it. A Type II error occurs when we do not reject the null hypothesis when it is false; in other words, when we conclude that someone who has ability does not actually have it after all. The consequences of committing a Type I error are more serious: the person hired as taster is less likely to discern problems with the product. The consequences of committing a Type II error are less serious since they mean that we fail to identify a subject who has taste-discrimination ability and we must continue interviewing and testing until the next qualified person applies for the position. Whereas the brewery can withstand a Type II error, it might not survive a Type I error.

(c) We saw in Chapter 9 that α and β are usually in a trade-off relationship. That is, if we select one of two possible rejection regions—such as either RR 8 or RR 10 —we can reduce α only if we are willing to have a higher β . Can you think of anything we might do to reduce both α and β simultaneously? What would that be?

Answer: It is possible to reduce both α and β simultaneously by increasing the sample size n . In this case, it would involve increasing the number of trials

Logo for LOUIS Pressbooks: Open Educational Resources from the Louisiana Library Network

Chapter 9: Hypothesis Testing with Two Samples

Chapter 9 Homework

9.2 homework.

DIRECTIONS: For each of the word problems, use a solution sheet to do the hypothesis test. The solution sheet is found in Appendix E . Please feel free to make copies of the solution sheets. For the online version of the book, it is suggested that you copy the .doc or the .pdf files.

If you are using a Student’s t -distribution for a homework problem in what follows, including for paired data, you may assume that the underlying population is normally distributed. (When using these tests in a real situation, you must first prove that assumption, however.)

The mean number of English courses taken in a two–year time period by male and female college students is believed to be about the same. An experiment is conducted and data are collected from 29 males and 16 females. The males took an average of three English courses with a standard deviation of 0.8. The females took an average of four English courses with a standard deviation of 1.0. Are the means statistically the same?

A student at a four-year college claims that mean enrollment at four–year colleges is higher than at two–year colleges in the United States. Two surveys are conducted. Of the 35 two–year colleges surveyed, the mean enrollment was 5,068 with a standard deviation of 4,777. Of the 35 four-year colleges surveyed, the mean enrollment was 5,466 with a standard deviation of 8,191.

Subscripts: 1: two-year colleges; 2: four-year colleges

  • H 0 : μ 1 ≥ μ 2
  • H a : μ 1 < μ 2
  • [latex]{\overline{X}}_{1}–{\overline{X}}_{2}[/latex] is the difference between the mean enrollments of the two-year colleges and the four-year colleges.
  • Student’s- t
  • test statistic: -0.2480
  • p -value: 0.4019
  • Alpha: 0.05
  • Decision: Do not reject
  • Reason for Decision: p -value > alpha
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the mean enrollment at four-year colleges is higher than at two-year colleges.

At Rachel’s 11 th birthday party, eight girls were timed to see how long (in seconds) they could hold their breath in a relaxed position. After a two-minute rest, they timed themselves while jumping. The girls thought that the mean difference between their jumping and relaxed times would be zero. Test their hypothesis.

Mean entry-level salaries for college graduates with mechanical engineering degrees and electrical engineering degrees are believed to be approximately the same. A recruiting office thinks that the mean mechanical engineering salary is actually lower than the mean electrical engineering salary. The recruiting office randomly surveys 50 entry level mechanical engineers and 60 entry level electrical engineers. Their mean salaries were 💲46,100 and 💲46,700, respectively. Their standard deviations were 💲3,450 and 💲4,210, respectively. Conduct a hypothesis test to determine if you agree that the mean entry-level mechanical engineering salary is lower than the mean entry-level electrical engineering salary.

Subscripts: 1: mechanical engineering; 2: electrical engineering

  • H 0 : µ 1 ≥ µ 2
  • H a : µ 1 < µ 2
  • [latex]{\overline{X}}_{1}-{\overline{X}}_{2}[/latex] is the difference between the mean entry level salaries of mechanical engineers and electrical engineers.
  • test statistic: t = –0.82
  • p -value: 0.2061
  • Decision: Do not reject the null hypothesis.
  • Conclusion: At the 5% significance level, there is insufficient evidence to conclude that the mean entry-level salaries of mechanical engineers is lower than that of electrical engineers.

Marketing companies have collected data implying that teenage girls use more ringtones on their cellular phones than teenage boys do. In one particular study of 40 randomly chosen teenage girls and boys (20 of each) with cellular phones, the mean number of ringtones for the girls was 3.2 with a standard deviation of 1.5. The mean for the boys was 1.7 with a standard deviation of 0.8. Conduct a hypothesis test to determine if the means are approximately the same or if the girls’ mean is higher than the boys’ mean.

Use the information from [link] to answer the next four exercises.

Using the data from Lap 1 only, conduct a hypothesis test to determine if the mean time for completing a lap in races is the same as it is in practices.

  • H 0 : µ 1 = µ 2
  • H a : µ 1 ≠ µ 2
  • [latex]{\overline{X}}_{1}-{\overline{X}}_{2}[/latex] is the difference between the mean times for completing a lap in races and in practices.
  • test statistic: –4.70
  • p -value: 0.0001
  • Decision: Reject the null hypothesis.
  • Reason for Decision: p -value < alpha
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the mean time for completing a lap in races is different from that in practices.

Repeat the test in Exercise 10.83 , but use Lap 5 data this time.

Repeat the test in Exercise 10.83 , but this time combine the data from Laps 1 and 5.

  • is the difference between the mean times for completing a lap in races and in practices.
  • test statistic: –5.08
  • p -value: zero

In two to three complete sentences, explain in detail how you might use Terri Vogel’s data to answer the following question. “Does Terri Vogel drive faster in races than she does in practices?”

Use the following information to answer the next two exercises. The Eastern and Western Major League Soccer conferences have a new Reserve Division that allows new players to develop their skills. Data for a randomly picked date showed the following annual goals.

Conduct a hypothesis test to answer the next two exercises.

The exact distribution for the hypothesis test is:

  • the normal distribution
  • the Student’s t -distribution
  • the uniform distribution
  • the exponential distribution

If the level of significance is 0.05, the conclusion is:

  • There is sufficient evidence to conclude that the W Division teams score fewer goals, on average, than the E teams
  • There is insufficient evidence to conclude that the W Division teams score more goals, on average, than the E teams.
  • There is insufficient evidence to conclude that the W teams score fewer goals, on average, than the E teams score.
  • Unable to determine

Suppose a statistics instructor believes that there is no significant difference between the mean class scores of statistics day students on Exam 2 and statistics night students on Exam 2. She takes random samples from each of the populations. The mean and standard deviation for 35 statistics day students were 75.86 and 16.91. The mean and standard deviation for 37 statistics night students were 75.41 and 19.73. The “day” subscript refers to the statistics day students. The “night” subscript refers to the statistics night students. A concluding statement is:

  • There is sufficient evidence to conclude that statistics night students’ mean on Exam 2 is better than the statistics day students’ mean on Exam 2.
  • There is insufficient evidence to conclude that the statistics day students’ mean on Exam 2 is better than the statistics night students’ mean on Exam 2.
  • There is insufficient evidence to conclude that there is a significant difference between the means of the statistics day students and night students on Exam 2.
  • There is sufficient evidence to conclude that there is a significant difference between the means of the statistics day students and night students on Exam 2.

Researchers interviewed street prostitutes in Canada and the United States. The mean age of the 100 Canadian prostitutes upon entering prostitution was 18 with a standard deviation of six. The mean age of the 130 United States prostitutes upon entering prostitution was 20 with a standard deviation of eight. Is the mean age of entering prostitution in Canada lower than the mean age in the United States? Test at a 1% significance level.

Test: two independent sample means, population standard deviations unknown.

Random variable: [latex]{\overline{X}}_{1}-{\overline{X}}_{2}[/latex]

Distribution: H 0 : μ 1 = μ 2 H a : μ 1 < μ 2 The mean age of entering prostitution in Canada is lower than the mean age in the United States.

This is a normal distribution curve with mean equal to zero. A vertical line near the tail of the curve to the left of zero extends from the axis to the curve. The region under the curve to the left of the line is shaded representing p-value = 0.0157.

Graph: left-tailed

p -value : 0.0151

Decision: Do not reject H 0 .

Conclusion: At the 1% level of significance, from the sample data, there is not sufficient evidence to conclude that the mean age of entering prostitution in Canada is lower than the mean age in the United States.

A powder diet is tested on 49 people, and a liquid diet is tested on 36 different people. Of interest is whether the liquid diet yields a higher mean weight loss than the powder diet. The powder diet group had a mean weight loss of 42 pounds with a standard deviation of 12 pounds. The liquid diet group had a mean weight loss of 45 pounds with a standard deviation of 14 pounds.

Suppose a statistics instructor believes that there is no significant difference between the mean class scores of statistics day students on Exam 2 and statistics night students on Exam 2. She takes random samples from each of the populations. The mean and standard deviation for 35 statistics day students were 75.86 and 16.91, respectively. The mean and standard deviation for 37 statistics night students were 75.41 and 19.73. The “day” subscript refers to the statistics day students. The “night” subscript refers to the statistics night students. An appropriate alternative hypothesis for the hypothesis test is:

  • μ day > μ night
  • μ day < μ night
  • μ day = μ night
  • μ day ≠ μ night

DIRECTIONS: For each of the word problems, use a solution sheet to do the hypothesis test. The solution sheet is found in [link] . Please feel free to make copies of the solution sheets. For the online version of the book, it is suggested that you copy the .doc or the .pdf files.

If you are using a Student’s t -distribution for one of the following homework problems, including for paired data, you may assume that the underlying population is normally distributed. (When using these tests in a real situation, you must first prove that assumption, however.)

A study is done to determine if students in the California state university system take longer to graduate, on average, than students enrolled in private universities. One hundred students from both the California state university system and private universities are surveyed. Suppose that from years of research, it is known that the population standard deviations are 1.5811 years and 1 year, respectively. The following data are collected. The California state university system students took on average 4.5 years with a standard deviation of 0.8. The private university students took on average 4.1 years with a standard deviation of 0.3.

Parents of teenage boys often complain that auto insurance costs more, on average, for teenage boys than for teenage girls. A group of concerned parents examines a random sample of insurance bills. The mean annual cost for 36 teenage boys was 💲679. For 23 teenage girls, it was 💲559. From past years, it is known that the population standard deviation for each group is 💲180. Determine whether or not you believe that the mean cost for auto insurance for teenage boys is greater than that for teenage girls.

Subscripts: 1 = boys, 2 = girls

  • H 0 : µ 1 ≤ µ 2
  • H a : µ 1 > µ 2
  • The random variable is the difference in the mean auto insurance costs for boys and girls.
  • test statistic: z = 2.50
  • p -value: 0.0062
  • Check student’s solution.
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the mean cost of auto insurance for teenage boys is greater than that for girls.

A group of transfer bound students wondered if they will spend the same mean amount on texts and supplies each year at their four-year university as they have at their community college. They conducted a random survey of 54 students at their community college and 66 students at their local four-year university. The sample means were 💲947 and 💲1,011, respectively. The population standard deviations are known to be 💲254 and 💲87, respectively. Conduct a hypothesis test to determine if the means are statistically the same.

Some manufacturers claim that non-hybrid sedan cars have a lower mean miles-per-gallon (mpg) than hybrid ones. Suppose that consumers test 21 hybrid sedans and get a mean of 31 mpg with a standard deviation of seven mpg. Thirty-one non-hybrid sedans get a mean of 22 mpg with a standard deviation of four mpg. Suppose that the population standard deviations are known to be six and three, respectively. Conduct a hypothesis test to evaluate the manufacturers’ claim.

Subscripts: 1 = non-hybrid sedans, 2 = hybrid sedans

  • The random variable is the difference in the mean miles per gallon of non-hybrid sedans and hybrid sedans.
  • test statistic: 6.36
  • p -value: 0
  • Reason for decision: p -value < alpha
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the mean miles per gallon of non-hybrid sedans is less than that of hybrid sedans.

A baseball fan wanted to know if there is a difference between the number of games played in a World Series when the American League won the series versus when the National League won the series. From 1922 to 2012, the population standard deviation of games won by the American League was 1.14, and the population standard deviation of games won by the National League was 1.11. Of 19 randomly selected World Series games won by the American League, the mean number of games won was 5.76. The mean number of 17 randomly selected games won by the National League was 5.42. Conduct a hypothesis test.

One of the questions in a study of marital satisfaction of dual-career couples was to rate the statement “I’m pleased with the way we divide the responsibilities for childcare.” The ratings went from one (strongly agree) to five (strongly disagree). [link] contains ten of the paired responses for husbands and wives. Conduct a hypothesis test to see if the mean difference in the husband’s versus the wife’s satisfaction level is negative (meaning that, within the partnership, the husband is happier than the wife).

  • H 0 : µ d = 0
  • H a : µ d < 0
  • The random variable X d is the average difference between husband’s and wife’s satisfaction level.
  • test statistic: t = –1.86
  • p -value: 0.0479
  • Check student’s solution
  • Decision: Reject the null hypothesis, but run another test.
  • Conclusion: This is a weak test because alpha and the p -value are close. However, there is insufficient evidence to conclude that the mean difference is negative.

9.3 Homework

If you are using a Student’s t -distribution for one of the following homework problems, including for paired data, you may assume that the underlying population is normally distributed. (In general, you must first prove that assumption, however.)

A recent drug survey showed an increase in the use of drugs and alcohol among local high school seniors as compared to the national percent. Suppose that a survey of 100 local seniors and 100 national seniors is conducted to see if the proportion of drug and alcohol use is higher locally than nationally. Locally, 65 seniors reported using drugs or alcohol within the past month, while 60 national seniors reported using them.

We are interested in whether the proportions of female suicide victims for ages 15 to 24 are the same for the whites and the blacks races in the United States. We randomly pick one year, 1992, to compare the races. The number of suicides estimated in the United States in 1992 for white females is 4,930. Five hundred eighty were aged 15 to 24. The estimate for black females is 330. Forty were aged 15 to 24. We will let female suicide victims be our population.

  • H 0 : P W = P B
  • H a : P W ≠ P B
  • The random variable is the difference in the proportions of white and black suicide victims, aged 15 to 24.
  • normal for two proportions
  • test statistic: –0.1944
  • p -value: 0.8458
  • Reason for decision: p -value > alpha
  • Conclusion: At the 5% significance level, there is insufficient evidence to conclude that the proportions of white and black female suicide victims, aged 15 to 24, are different.

Elizabeth Mjelde, an art history professor, was interested in whether the value from the Golden Ratio formula, [latex]\left(\frac{\text{larger + smaller dimension}}{\text{larger dimension}}\right)[/latex] was the same in the Whitney Exhibit for works from 1900 to 1919 as for works from 1920 to 1942. Thirty-seven early works were sampled, averaging 1.74 with a standard deviation of 0.11. Sixty-five of the later works were sampled, averaging 1.746 with a standard deviation of 0.1064. Do you think that there is a significant difference in the Golden Ratio calculation?

A recent year was randomly picked from 1985 to the present. In that year, there were 2,051 Hispanic students at Cabrillo College out of a total of 12,328 students. At Lake Tahoe College, there were 321 Hispanic students out of a total of 2,441 students. In general, do you think that the percent of Hispanic students at the two colleges is basically the same or different?

Subscripts: 1 = Cabrillo College, 2 = Lake Tahoe College

  • H 0 : p 1 = p 2
  • H a : p 1 ≠ p 2
  • The random variable is the difference between the proportions of Hispanic students at Cabrillo College and Lake Tahoe College.
  • test statistic: 4.29
  • p -value: 0.00002
  • Conclusion: There is sufficient evidence to conclude that the proportions of Hispanic students at Cabrillo College and Lake Tahoe College are different.

Use the following information to answer the next three exercises. Neuroinvasive West Nile virus is a severe disease that affects a person’s nervous system . It is spread by the Culex species of mosquito. In the United States in 2010 there were 629 reported cases of neuroinvasive West Nile virus out of a total of 1,021 reported cases and there were 486 neuroinvasive reported cases out of a total of 712 cases reported in 2011. Is the 2011 proportion of neuroinvasive West Nile virus cases more than the 2010 proportion of neuroinvasive West Nile virus cases? Using a 1% level of significance, conduct an appropriate hypothesis test.

  • a test of two proportions
  • a test of two independent means
  • a test of a single mean
  • a test of matched pairs.

An appropriate null hypothesis is:

  • p 2011 ≤ p 2010
  • p 2011 ≥ p 2010
  • μ 2011 ≤ μ 2010
  • p 2011 > p 2010

The p -value is 0.0022. At a 1% level of significance, the appropriate conclusion is

  • There is sufficient evidence to conclude that the proportion of people in the United States in 2011 who contracted neuroinvasive West Nile disease is less than the proportion of people in the United States in 2010 who contracted neuroinvasive West Nile disease.
  • There is insufficient evidence to conclude that the proportion of people in the United States in 2011 who contracted neuroinvasive West Nile disease is more than the proportion of people in the United States in 2010 who contracted neuroinvasive West Nile disease.
  • There is insufficient evidence to conclude that the proportion of people in the United States in 2011 who contracted neuroinvasive West Nile disease is less than the proportion of people in the United States in 2010 who contracted neuroinvasive West Nile disease.
  • There is sufficient evidence to conclude that the proportion of people in the United States in 2011 who contracted neuroinvasive West Nile disease is more than the proportion of people in the United States in 2010 who contracted neuroinvasive West Nile disease.

Researchers conducted a study to find out if there is a difference in the use of eReaders by different age groups. Randomly selected participants were divided into two age groups. In the 16- to 29-year-old group, 7% of the 628 surveyed use eReaders, while 11% of the 2,309 participants 30 years old and older use eReaders.

Test: two independent sample proportions.

Random variable: p ′ 1 – p ′ 2

Distribution:

The proportion of eReader users is different for the 16- to 29-year-old users from that of the 30 and older users.

Graph: two-tailed

This is a normal distribution curve with mean equal to zero. Both the right and left tails of the curve are shaded. Each tail represents 1/2(p-value) = 0.0017.

p -value : 0.0033

Conclusion: At the 5% level of significance, from the sample data, there is sufficient evidence to conclude that the proportion of eReader users 16 to 29 years old is different from the proportion of eReader users 30 and older.

Adults aged 18 years old and older were randomly selected for a survey on obesity. Adults are considered obese if their body mass index (BMI) is at least 30. The researchers wanted to determine if the proportion of women who are obese in the south is less than the proportion of southern men who are obese. The results are shown in [link] . Test at the 1% level of significance.

Two computer users were discussing tablet computers. A higher proportion of people ages 16 to 29 use tablets than the proportion of people age 30 and older. [link] details the number of tablet owners for each age group. Test at the 1% level of significance.

Test: two independent sample proportions

Random variable: p′ 1 − p′ 2

  • H a : p 1 > p 2

A higher proportion of tablet owners are aged 16 to 29 years old than are 30 years old and older.

Graph: right-tailed

This is a normal distribution curve with mean equal to zero. A vertical line near the tail of the curve to the right of zero extends from the axis to the curve. The region under the curve to the right of the line is shaded representing p-value = 0.2354.

p -value: 0.2354

Decision: Do not reject the H 0 .

Conclusion: At the 1% level of significance, from the sample data, there is not sufficient evidence to conclude that a higher proportion of tablet owners are aged 16 to 29 years old than are 30 years old and older.

A group of friends debated whether more men use smartphones than women. They consulted a research study of smartphone use among adults. The results of the survey indicate that of the 973 men randomly sampled, 379 use smartphones. For women, 404 of the 1,304 who were randomly sampled use smartphones. Test at the 5% level of significance.

While her husband spent 2½ hours picking out new speakers, a statistician decided to determine whether the percentage of men who enjoy shopping for electronic equipment is higher than the percentage of women who enjoy shopping for electronic equipment. The population was Saturday afternoon shoppers. Out of 67 men, 24 said they enjoyed the activity. Eight of the 24 women surveyed claimed to enjoy the activity. Interpret the results of the survey.

Subscripts: 1: men; 2: women

  • H 0 : p 1 ≤ p 2
  • [latex]{{P}^{\prime }}_{1}-{{P}^{\prime }}_{2}[/latex] is the difference between the proportions of men and women who enjoy shopping for electronic equipment.
  • test statistic: 0.22
  • p -value: 0.4133
  • Conclusion: At the 5% significance level, there is insufficient evidence to conclude that the proportion of men who enjoy shopping for electronic equipment is more than the proportion of women.

We are interested in whether children’s educational computer software costs less, on average, than children’s entertainment software. Thirty-six educational software titles were randomly picked from a catalog. The mean cost was 💲31.14 with a standard deviation of 💲4.69. Thirty-five entertainment software titles were randomly picked from the same catalog. The mean cost was 💲33.86 with a standard deviation of 💲10.87. Decide whether children’s educational software costs less, on average, than children’s entertainment software.

Joan Nguyen recently claimed that the proportion of college-age males with at least one pierced ear is as high as the proportion of college-age females. She conducted a survey in her classes. Out of 107 males, 20 had at least one pierced ear. Out of 92 females, 47 had at least one pierced ear. Do you believe that the proportion of males has reached the proportion of females?

  • [latex]{{P}^{\prime }}_{1}-{{P}^{\prime }}_{2}[/latex] is the difference between the proportions of men and women that have at least one pierced ear.
  • test statistic: –4.82
  • Conclusion: At the 5% significance level, there is sufficient evidence to conclude that the proportions of males and females with at least one pierced ear is different.

Use the data sets found in [link] to answer this exercise. Is the proportion of race laps Terri completes slower than 130 seconds less than the proportion of practice laps she completes slower than 135 seconds?

“To Breakfast or Not to Breakfast?” by Richard Ayore

In American society, birthdays are one of those days that everyone looks forward to. People of different ages and peer groups gather to mark the 18th, 20th, …, birthdays. During this time, one looks back to see what he or she has achieved for the past year and also focuses ahead for more to come.

If, by any chance, I am invited to one of these parties, my experience is always different. Instead of dancing around with my friends while the music is booming, I get carried away by memories of my family back home in Kenya. I remember the good times I had with my brothers and sister while we did our daily routine.

Every morning, I remember we went to the shamba (garden) to weed our crops. I remember one day arguing with my brother as to why he always remained behind just to join us an hour later. In his defense, he said that he preferred waiting for breakfast before he came to weed. He said, “This is why I always work more hours than you guys!”

And so, to prove him wrong or right, we decided to give it a try. One day we went to work as usual without breakfast, and recorded the time we could work before getting tired and stopping. On the next day, we all ate breakfast before going to work. We recorded how long we worked again before getting tired and stopping. Of interest was our mean increase in work time. Though not sure, my brother insisted that it was more than two hours. Using the data in [link] , solve our problem.

  • H a : µ d > 0
  • The random variable X d is the mean difference in work times on days when eating breakfast and on days when not eating breakfast.
  • test statistic: 4.8963

p -value: 0.0004

  • Conclusion: At the 5% level of significance, there is sufficient evidence to conclude that the mean difference in work times on days when eating breakfast and on days when not eating breakfast has increased.

9.4 Homework

If you are using a Student’s t -distribution for the homework problems, including for paired data, you may assume that the underlying population is normally distributed. (When using these tests in a real situation, you must first prove that assumption, however.)

Ten individuals went on a low–fat diet for 12 weeks to lower their cholesterol. The data are recorded in [link] . Do you think that their cholesterol levels were significantly lowered?

p -value = 0.1494

At the 5% significance level, there is insufficient evidence to conclude that the medication lowered cholesterol levels after 12 weeks.

Use the following information to answer the next two exercises. A new AIDS prevention drug was tried on a group of 224 HIV positive patients. Forty-five patients developed AIDS after four years. In a control group of 224 HIV positive patients, 68 developed AIDS after four years. We want to test whether the method of treatment reduces the proportion of patients that develop AIDS after four years or if the proportions of the treated group and the untreated group stay the same.

Let the subscript t = treated patient and ut = untreated patient.

The appropriate hypotheses are:

  • H 0 : p t < p ut and H a : p t ≥ p ut
  • H 0 : p t ≤ p ut and H a : p t > p ut
  • H 0 : p t = p ut and H a : p t ≠ p ut
  • H 0 : p t = p ut and H a : p t < p ut

If the p -value is 0.0062 what is the conclusion (use α = 0.05)?

  • The method has no effect.
  • There is sufficient evidence to conclude that the method reduces the proportion of HIV positive patients who develop AIDS after four years.
  • There is sufficient evidence to conclude that the method increases the proportion of HIV positive patients who develop AIDS after four years.
  • There is insufficient evidence to conclude that the method reduces the proportion of HIV positive patients who develop AIDS after four years.

Use the following information to answer the next two exercises. An experiment is conducted to show that blood pressure can be consciously reduced in people trained in a “biofeedback exercise program.” Six subjects were randomly selected and blood pressure measurements were recorded before and after the training. The difference between blood pressures was calculated (after – before) producing the following results: [latex]{\overline{x}}_{d}[/latex] = −10.2 s d = 8.4. Using the data, test the hypothesis that the blood pressure has decreased after the training.

The distribution for the test is:

  • N (−10.2, 8.4)
  • N(−10.2, [latex]\frac{8.4}{\sqrt{6}}[/latex])

If α = 0.05, the p -value and the conclusion are

  • 0.0014; There is sufficient evidence to conclude that the blood pressure decreased after the training.
  • 0.0014; There is sufficient evidence to conclude that the blood pressure increased after the training.
  • 0.0155; There is sufficient evidence to conclude that the blood pressure decreased after the training.
  • 0.0155; There is sufficient evidence to conclude that the blood pressure increased after the training.

A golf instructor is interested in determining if her new technique for improving players’ golf scores is effective. She takes four new students. She records their 18-hole scores before learning the technique and then after having taken her class. She conducts a hypothesis test. The data are as follows.

The correct decision is:

  • Reject H 0 .
  • Do not reject the H 0 .

A local cancer support group believes that the estimate for new female breast cancer cases in the south is higher in 2013 than in 2012. The group compared the estimates of new female breast cancer cases by southern state in 2012 and in 2013. The results are in [link] .

Test: two matched pairs or paired samples ( t -test)

Random variable: [latex]{\overline{X}}_{d}[/latex]

Distribution: t 12

H 0 : μ d = 0 H a : μ d > 0

The mean of the differences of new female breast cancer cases in the south between 2013 and 2012 is greater than zero. The estimate for new female breast cancer cases in the south is higher in 2013 than in 2012.

This is a normal distribution curve with mean equal to zero. A vertical line near the tail of the curve to the right of zero extends from the axis to the curve. The region under the curve to the right of the line is shaded representing p-value = 0.0004.

Decision: Reject H 0

Conclusion: At the 5% level of significance, from the sample data, there is sufficient evidence to conclude that there was a higher estimate of new female breast cancer cases in 2013 than in 2012.

A traveler wanted to know if the prices of hotels are different in the ten cities that he visits the most often. The list of the cities with the corresponding hotel prices for his two favorite hotel chains is in [link] . Test at the 1% level of significance.

A politician asked his staff to determine whether the underemployment rate in the northeast decreased from 2011 to 2012. The results are in [link] .

Test: matched or paired samples ( t -test)

Difference data: {–0.9, –3.7, –3.2, –0.5, 0.6, –1.9, –0.5, 0.2, 0.6, 0.4, 1.7, –2.4, 1.8}

Random Variable: [latex]{\overline{X}}_{d}[/latex]

Distribution: H 0 : μ d = 0 H a : μ d < 0

The mean of the differences of the rate of underemployment in the northeastern states between 2012 and 2011 is less than zero. The underemployment rate went down from 2011 to 2012.

Graph: left-tailed.

This is a normal distribution curve with mean equal to zero. A vertical line near the tail of the curve to the right of zero extends from the axis to the curve. The region under the curve to the right of the line is shaded representing p-value = 0.1207.

p -value: 0.1207

Conclusion: At the 5% level of significance, from the sample data, there is not sufficient evidence to conclude that there was a decrease in the underemployment rates of the northeastern states from 2011 to 2012.

Bringing It Together

Use the following information to answer the next ten exercises. indicate which of the following choices best identifies the hypothesis test.

  • independent group means, population standard deviations and/or variances known
  • independent group means, population standard deviations and/or variances unknown
  • matched or paired samples
  • single mean
  • two proportions
  • single proportion

A powder diet is tested on 49 people, and a liquid diet is tested on 36 different people. The population standard deviations are two pounds and three pounds, respectively. Of interest is whether the liquid diet yields a higher mean weight loss than the powder diet.

A new chocolate bar is taste-tested on consumers. Of interest is whether the proportion of children who like the new chocolate bar is greater than the proportion of adults who like it.

The mean number of English courses taken in a two–year time period by male and female college students is believed to be about the same. An experiment is conducted and data are collected from nine males and 16 females.

A football league reported that the mean number of touchdowns per game was five. A study is done to determine if the mean number of touchdowns has decreased.

A study is done to determine if students in the California state university system take longer to graduate than students enrolled in private universities. One hundred students from both the California state university system and private universities are surveyed. From years of research, it is known that the population standard deviations are 1.5811 years and one year, respectively.

According to a YWCA Rape Crisis Center newsletter, 75% of rape victims know their attackers. A study is done to verify this.

According to a recent study, U.S. companies have a mean maternity-leave of six weeks.

A recent drug survey showed an increase in use of drugs and alcohol among local high school students as compared to the national percent. Suppose that a survey of 100 local youths and 100 national youths is conducted to see if the proportion of drug and alcohol use is higher locally than nationally.

A new SAT study course is tested on 12 individuals. Pre-course and post-course scores are recorded. Of interest is the mean increase in SAT scores. The following data are collected:

University of Michigan researchers reported in the Journal of the National Cancer Institute that quitting smoking is especially beneficial for those under age 49. In this American Cancer Society study, the risk (probability) of dying of lung cancer was about the same as for those who had never smoked.

Lesley E. Tan investigated the relationship between left-handedness vs. right-handedness and motor competence in preschool children. Random samples of 41 left-handed preschool children and 41 right-handed preschool children were given several tests of motor skills to determine if there is evidence of a difference between the children based on this experiment. The experiment produced the means and standard deviations shown [link] . Determine the appropriate test and best distribution to use for that test.

  • Two independent means, normal distribution
  • Two independent means, Student’s-t distribution
  • Matched or paired samples, Student’s-t distribution
  • Two population proportions, normal distribution

A golf instructor is interested in determining if her new technique for improving players’ golf scores is effective. She takes four (4) new students. She records their 18-hole scores before learning the technique and then after having taken her class. She conducts a hypothesis test. The data are as [link] .

  • a test of two independent means.
  • a test of two proportions.
  • a test of a single mean.
  • a test of a single proportion.

Introductory Statistics Copyright © 2024 by LOUIS: The Louisiana Library Network is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

chapter 9 hypothesis testing quizlet

  • Register or Log In
  • 0) { document.location='/search/'+document.getElementById('quicksearch').value.trim().toLowerCase(); }">

Chapter 9 Self-Assessment

Quiz content, are you sure, select your country.

VIDEO

  1. MAT130/131: Chapter 9

  2. FA II STATISTICS/ Chapter no 7 / Testing of hypothesis/ Z distribution / Example 7.8

  3. Chapter 9 hypothesis test means proctored tests Statcrunch

  4. Sampling Theory

  5. Math 14 8.2.17-T Identify the null hypothesis, alternative hypothesis, test statistic

  6. Chapter 9

COMMENTS

  1. Chapter 9

    7. For a lower tail test, the p-value is the probability of obtaining a value for the test statistic a. at least as small as that provided by the sample b. at least as large as that provided by the sample c. at least as small as that provided by the population d. at least as large as that provided by the population.

  2. Ch. 9 Chapter Review

    To test a null hypothesis, find the p-value for the sample data and graph the results. When deciding whether or not to reject the null the hypothesis, keep these two parameters in mind: α > p-value, reject the null hypothesis. α ≤ p-value, do not reject the null hypothesis. 9.5 Additional Information and Full Hypothesis Test Examples. The ...

  3. Ch. 9 Practice

    Introduction; 9.1 Null and Alternative Hypotheses; 9.2 Outcomes and the Type I and Type II Errors; 9.3 Distribution Needed for Hypothesis Testing; 9.4 Rare Events, the Sample, and the Decision and Conclusion; 9.5 Additional Information and Full Hypothesis Test Examples; 9.6 Hypothesis Testing of a Single Mean and Single Proportion; Key Terms; Chapter Review; Formula Review

  4. 9 Chapter 9 Hypothesis testing

    9. Chapter 9 Hypothesis testing. The first unit was designed to prepare you for hypothesis testing. In the first chapter we discussed the three major goals of statistics: Describe: connects to unit 1 with descriptive statistics and graphing. Decide: connects to unit 1 knowing your data and hypothesis testing.

  5. Ch. 9 Solutions

    Introduction; 9.1 Null and Alternative Hypotheses; 9.2 Outcomes and the Type I and Type II Errors; 9.3 Distribution Needed for Hypothesis Testing; 9.4 Rare Events, the Sample, and the Decision and Conclusion; 9.5 Additional Information and Full Hypothesis Test Examples; 9.6 Hypothesis Testing of a Single Mean and Single Proportion; Key Terms; Chapter Review; Formula Review

  6. 9.1: Introduction to Hypothesis Testing

    This page titled 9.1: Introduction to Hypothesis Testing is shared under a CC BY 2.0 license and was authored, remixed, and/or curated by Kyle Siegrist ( Random Services) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request. In hypothesis testing, the goal is ...

  7. Chapter 9: Hypothesis Tests: Introduction, Basic Concepts, and an

    Chapter 7: Point Estimation and Sampling Distributions; Chapter 8: Confidence Interval Estimation; Chapter 9: Hypothesis Tests: Introduction, Basic Concepts, and an Example; Chapter 10: Hypothesis Tests About u and p: Applications; Chapter 11: Comparisions of Means and Proportions; Chapter 12: Simple Linear Regression; Chapter 13: Multiple ...

  8. Chapter 9: Hypothesis Tests: Introduction, Basic Concepts, and an

    What would that be? The following exercises test your understanding of hypothesis testing in the context of the triangle taste test described in Chapter 9. In this case, the test consists of 14 identical trials on which the subject attempts to identify the odd sample on each trial. Assume that there are two possible rejection regions: RR8 = {8 ...

  9. Chapter 9 Homework

    t 9; test statistic: t = -1.86; p-value: 0.0479; Check student's solution; Alpha: 0.05; Decision: Reject the null hypothesis, but run another test. Reason for Decision: p-value < alpha; Conclusion: This is a weak test because alpha and the p-value are close. However, there is insufficient evidence to conclude that the mean difference is ...

  10. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses Skip: 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures 9.9 Foundational Issues STA 611 (Lecture 19 ...

  11. Chapter 9: Hypothesis Testing Flashcards

    Study with Quizlet and memorize flashcards containing terms like Hypothesis testing involves, In two-sample significance tests, the two samples must be selected _____ and _____, The mean of the sampling distribution is ____ and more. ... Chapter 9: Hypothesis Testing. Flashcards; Learn; Test; Match; Q-Chat; Get a hint. Hypothesis testing involves.

  12. Chapter 9: Hypothesis Testing

    Quiz. In hypothesis testing, the hypothesis which is tentatively assumed to be true is called the. When the null hypothesis has been true, but the sample information has resulted in the rejection of the null, a _________ has been made. Which of the following does not need to be known in order to compute the P-value?

  13. Chapter 9 Self-Assessment

    Chapter 9 Self-Assessment. Quiz Content ... the logic of hypothesis testing. correct incorrect. a flawed experiment. correct incorrect * not completed. ... The last step in the scientific method is to accept or reject the hypothesis. TRUE correct incorrect. FALSE correct incorrect * not completed.

  14. PDF Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing. P a g e 1 | 5. Problem: Consider the following scenario. Our class has 50 students, 25 students are junior level and the other 25 students are senior level. Now I have 20 movie tickets from a friend, and I want to give out these 20 tickets to 20 random students in the class. It turns out that 19 senior students ...

  15. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures 9.9 Foundational Issues Hypothesis Testing 10 / 31

  16. PDF Chapter (9) Fundamentals of Hypothesis Testing: One-Sample Tests

    Step (1) : State the null hypothesis and the alternate hypothesis. H0: μ = 30 H1: μ ≠ 30 Step (2) : The level of significance. ( = 0.05.) Step (3) : Select the test statistic and compute the P-value. σ is assumed known so this is a Z test. Step (4) : Make a decision and interpret the result.

  17. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9 9.1 Problems of Testing Hypotheses Introduction Statistical Inference: Given a probability model f(xj ) (and possibly a prior p( )) we may be interested in Parameter estimation - Chapters 7 and 8 Making decisions - Hypothesis testing, Chapter 9 E.g. If the disease affects 2% or more of the population, the state

  18. Math133 Smart Book Mod 4 Chapter 9

    Smartbook Assignment Module 4 Chapter 9. Hypothesis testing is used to test assumptions and theories in business and science. A hypothesis, or assumption, can be discarded or reformulated if the sample data is found to be inconsistent with the hypothesis. ... In hypothesis testing, 2 correct decisions are possible: Not rejecting the null ...

  19. Notes

    CHAPTER 9 HYPOTHESIS TESTING. 9 Introduction to Hypothesis Testing. Hypothesis testing is the second type of inferences we can make about population parameters. A formulated believe is a hypothesis. Two competing hypotheses on a particular population of interest. Collect evidence that conclude which competing hypothesis is supported.

  20. PDF Chapter 9 Chapter 9: Hypothesis Testing

    Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses Skip: 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.5 The t Test 9.6 Comparing the Means of Two Normal Distributions 9.7 The F Distributions 9.8 Bayes Test Procedures 9.9 Foundational Issues Hypothesis Testing 10 / 31

  21. MATH 128 : Statistics

    Ch. 9 Hypothesis Testing for a Single Population Mean.pdf 10/1/2020 MyOpenMath Chapter 9: Hypothesis Tests for a Single Population Mean Question 1 Jessica Johnson 1/1 pt 0-2 99 Test the claim that the mean GPA of night students is larger than 2.2 at the 0.10 significance level.