• Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Introduction to Factor Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution - Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning

Understanding Hypothesis Testing

Data preprocessing.

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

\mu

Key Terms of Hypothesis Testing

\alpha

  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

\mu \geq 50

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

\mu =

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

\alpha

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

H_0

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

\alpha

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

p\leq\alpha

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}

  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

t=\frac{x̄-μ}{s/\sqrt{n}}

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

  • i,j are the rows and columns index respectively.

E_{ij}

Real life Hypothesis Testing example

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Hypothesis Testing

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

(203.8 - 200) / (5 \div \sqrt{25})

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( ): No effect or difference exists. Alternative Hypothesis ( ): An effect or difference exists. Significance Level ( ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science
  • Data Science
  • Machine Learning
  • 10 Best Todoist Alternatives in 2024 (Free)
  • How to Get Spotify Premium Free Forever on iOS/Android
  • Yahoo Acquires Instagram Co-Founders' AI News Platform Artifact
  • OpenAI Introduces DALL-E Editor Interface
  • Top 10 R Project Ideas for Beginners in 2024

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

  • Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

hypothesis testing in computer science

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

hypothesis testing in computer science

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

Correlation – connecting the dots, the role of correlation in data analysis, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, skewness and kurtosis – peaks and tails, understanding data through skewness and kurtosis”, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

hypothesis testing in computer science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

hypothesis testing in computer science

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

hypothesis testing in computer science

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Prevent plagiarism. Run a free check.

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved April 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Hypothesis Testing Steps & Examples

Hypothesis Testing Workflow

Table of Contents

What is a Hypothesis testing?

As per the definition from Oxford languages, a hypothesis is a supposition or proposed explanation made on the basis of limited evidence as a starting point for further investigation. As per the Dictionary page on Hypothesis , Hypothesis means a proposition or set of propositions, set forth as an explanation for the occurrence of some specified group of phenomena, either asserted merely as a provisional conjecture to guide investigation (working hypothesis) or accepted as highly probable in the light of established facts.

The hypothesis can be defined as the claim that can either be related to the truth about something that exists in the world, or, truth about something that’s needs to be established a fresh . In simple words, another word for the hypothesis is the “claim” . Until the claim is proven to be true, it is called the hypothesis. Once the claim is proved, it becomes the new truth or new knowledge about the thing. For example , let’s say that a claim is made that students studying for more than 6 hours a day gets more than 90% of marks in their examination. Now, this is just a claim or a hypothesis and not the truth in the real world. However, in order for the claim to become the truth for widespread adoption, it needs to be proved using pieces of evidence, e.g., data.  In order to reject this claim or otherwise, one needs to do some empirical analysis by gathering data samples and evaluating the claim. The process of gathering data and evaluating the claims or hypotheses with the goal to reject or otherwise (failing to reject) can be called as hypothesis testing . Note the wordings – “failing to reject”. It means that we don’t have enough evidence to reject the claim. Thus, until the time that new evidence comes up, the claim can be considered the truth. There are different techniques to test the hypothesis in order to reach the conclusion of whether the hypothesis can be used to represent the truth of the world.

One must note that the hypothesis testing never constitutes a proof that the hypothesis is absolute truth based on the observations. It only provides added support to consider the hypothesis as truth until the time that new evidences can against the hypotheses can be gathered. We can never be 100% sure about truth related to those hypotheses based on the hypothesis testing.

Simply speaking, hypothesis testing is a framework that can be used to assert whether the claim or the hypothesis made about a real-world/real-life event can be seen as the truth or otherwise based on the given data (evidences).

Hypothesis Testing Examples

Before we get ahead and start understanding more details about hypothesis and hypothesis testing steps, lets take a look at some  real-world examples  of how to think about hypothesis and hypothesis testing when dealing with real-world problems :

  • Customers are churning because they ain’t getting response to their complaints or issues
  • Customers are churning because there are other competitive services in the market which are providing these services at lower cost.
  • Customers are churning because there are other competitive services which are providing more services at the same cost.
  • It is claimed that a 500 gm sugar packet for a particular brand, say XYZA, contains sugar of less than 500 gm, say around 480gm.  Can this claim be taken as truth? How do we know that this claim is true? This is a hypothesis until proved.
  • A group of doctors claims that quitting smoking increases lifespan. Can this claim be taken as new truth? The hypothesis is that quitting smoking results in an increase in lifespan.
  • It is claimed that brisk walking for half an hour every day reverses diabetes. In order to accept this in your lifestyle, you may need evidence that supports this claim or hypothesis.
  • It is claimed that doing Pranayama yoga for 30 minutes a day can help in easing stress by 50%. This can be termed as hypothesis and would require testing / validation for it to be established as a truth and recommended for widespread adoption.
  • One common real-life example of hypothesis testing is election polling. In order to predict the outcome of an election, pollsters take a sample of the population and ask them who they plan to vote for. They then use hypothesis testing to assess whether their sample is representative of the population as a whole. If the results of the hypothesis test are significant, it means that the sample is representative and that the poll can be used to predict the outcome of the election. However, if the results are not significant, it means that the sample is not representative and that the poll should not be used to make predictions.
  • Machine learning models make predictions based on the input data. Each of the machine learning model representing a function approximation can be taken as a hypothesis. All different models constitute what is called as hypothesis space .
  • As part of a linear regression machine learning model , it is claimed that there is a relationship between the response variables and predictor variables? Can this hypothesis or claim be taken as truth? Let’s say, the hypothesis is that the housing price depends upon the average income of people already staying in the locality. How true is this hypothesis or claim? The relationship between response variable and each of the predictor variables can be evaluated using T-test and T-statistics .
  • For linear regression model , one of the hypothesis is that there is no relationship between the response variable and any of the predictor variables. Thus, if b1, b2, b3 are three parameters, all of them is equal to 0. b1 = b2 = b3 = 0. This is where one performs F-test and use F-statistics to test this hypothesis.

You may note different hypotheses which are listed above. The next step would be validate some of these hypotheses. This is where data scientists will come into picture. One or more data scientists may be asked to work on different hypotheses. This would result in these data scientists looking for appropriate data related to the hypothesis they are working. This section will be detailed out in near future.

State the Hypothesis to begin Hypothesis Testing

The first step to hypothesis testing is defining or stating a hypothesis. Before the hypothesis can be tested, we need to formulate the hypothesis in terms of mathematical expressions. There are two important aspects to pay attention to, prior to the formulation of the hypothesis. The following represents different types of hypothesis that could be put to hypothesis testing:

  • Claim made against the well-established fact : The case in which a fact is well-established, or accepted as truth or “knowledge” and a new claim is made about this well-established fact. For example , when you buy a packet of 500 gm of sugar, you assume that the packet does contain at the minimum 500 gm of sugar and not any less, based on the label of 500 gm on the packet. In this case, the fact is given or assumed to be the truth. A new claim can be made that the 500 gm sugar contains sugar weighing less than 500 gm. This claim needs to be tested before it is accepted as truth. Such cases could be considered for hypothesis testing if this is claimed that the assumption or the default state of being is not true. The claim to be established as new truth can be stated as “alternate hypothesis”. The opposite state can be stated as “null hypothesis”. Here the claim that the 500 gm packet consists of sugar less than 500 grams would be stated as alternate hypothesis. The opposite state which is the sugar packet consists 500 gm is null hypothesis.
  • Claim to establish the new truth : The case in which there is some claim made about the reality that exists in the world (fact). For example , the fact that the housing price depends upon the average income of people already staying in the locality can be considered as a claim and not assumed to be true. Another example could be the claim that running 5 miles a day would result in a reduction of 10 kg of weight within a month. There could be varied such claims which when required to be proved as true have to go through hypothesis testing. The claim to be established as new truth can be stated as “alternate hypothesis”. The opposite state can be stated as “null hypothesis”. Running 5 miles a day would result in reduction of 10 kg within a month would be stated as alternate hypothesis.

Based on the above considerations, the following hypothesis can be stated for doing hypothesis testing.

  • The packet of 500 gm of sugar contains sugar of weight less than 500 gm. (Claim made against the established fact). This is a new knowledge which requires hypothesis testing to get established and acted upon.
  • The housing price depends upon the average income of the people staying in the locality. This is a new knowledge which requires hypothesis testing to get established and acted upon.
  • Running 5 miles a day results in a reduction of 10 kg of weight within a month. This is a new knowledge which requires hypothesis testing to get established for widespread adoption.

Formulate Null & Alternate Hypothesis as Next Step

Once the hypothesis is defined or stated, the next step is to formulate the null and alternate hypothesis in order to begin hypothesis testing as described above.

What is a null hypothesis?

In the case where the given statement is a well-established fact or default state of being in the real world, one can call it a null hypothesis (in the simpler word, nothing new). Well-established facts don’t need any hypothesis testing and hence can be called the null hypothesis. In cases, when there are any new claims made which is not well established in the real world, the null hypothesis can be thought of as the default state or opposite state of that claim. For example , in the previous section, the claim or hypothesis is made that the students studying for more than 6 hours a day gets more than 90% of marks in their examination. The null hypothesis, in this case, will be that the claim is not true or real. The null hypothesis can be stated that there is no relationship or association between the students reading more than 6 hours a day and they getting 90% of the marks. Any occurrence is only a chance occurrence. Another example of hypothesis is when somebody is alleged that they have performed a crime.

Null hypothesis is denoted by letter H with 0, e.g., [latex]H_0[/latex]

What is an alternate hypothesis?

When the given statement is a claim (unexpected event in the real world) and not yet proven, one can call/formulate it as an alternate hypothesis and accordingly define a null hypothesis which is the opposite state of the hypothesis. The alternate hypothesis is a new knowledge or truth that needs to be established. In simple words, the hypothesis or claim that needs to be tested against reality in the real world can be termed the alternate hypothesis. In order to reach a conclusion that the claim (alternate hypothesis) can be considered the new knowledge or truth (based on the available evidence), it would be important to reject the null hypothesis. It should be noted that null and alternate hypotheses are mutually exclusive and at the same time asymmetric. In the example given in the previous section, the claim that the students studying for more than 6 hours get more than 90% of marks can be termed as the alternate hypothesis.

Alternate hypothesis is denoted with H subscript a, e.g., [latex]H_a[/latex]

Once the hypothesis is formulated as null([latex]H_0[/latex]) and alternate hypothesis ([latex]H_a[/latex]), there are two possible outcomes that can happen from hypothesis testing. These outcomes are the following:

  • Reject the null hypothesis : There is enough evidence based on which one can reject the null hypothesis. Let’s understand this with the help of an example provided earlier in this section. The null hypothesis is that there is no relationship between the students studying more than 6 hours a day and getting more than 90% marks. In a sample of 30 students studying more than 6 hours a day, it was found that they scored 91% marks. Given that the null hypothesis is true, this kind of hypothesis testing result will be highly unlikely. This kind of result can’t happen by chance. That would mean that the claim can be taken as the new truth or new knowledge in the real world. One can go and take further samples of 30 students to perform some more testing to validate the hypothesis. If similar results show up with other tests, it can be said with very high confidence that there is enough evidence to reject the null hypothesis that there is no relationship between the students studying more than 6 hours a day and getting more than 90% marks. In such cases, one can go to accept the claim as new truth that the students studying more than 6 hours a day get more than 90% marks. The hypothesis can be considered the new truth until the time that new tests provide evidence against this claim.
  • Fail to reject the null hypothesis : There is not enough evidence-based on which one can reject the null hypothesis (well-established fact or reality). Thus, one would fail to reject the null hypothesis. In a sample of 30 students studying more than 6 hours a day, the students were found to score 75%. Given that the null hypothesis is true, this kind of result is fairly likely or expected. With the given sample, one can’t reject the null hypothesis that there is no relationship between the students studying more than 6 hours a day and getting more than 90% marks.

Examples of formulating the null and alternate hypothesis

The following are some examples of the null and alternate hypothesis.

Hypothesis Testing Steps

Here is the diagram which represents the workflow of Hypothesis Testing.

Hypothesis Testing Workflow

Figure 1. Hypothesis Testing Steps

Based on the above, the following are some of the  steps to be taken when doing hypothesis testing:

  • State the hypothesis : First and foremost, the hypothesis needs to be stated. The hypothesis could either be the statement that is assumed to be true or the claim which is made to be true.
  • Formulate the hypothesis : This step requires one to identify the Null and Alternate hypotheses or in simple words, formulate the hypothesis. Take an example of the canned sauce weighing 500 gm as the Null Hypothesis.
  • Set the criteria for a decision : Identify test statistics that could be used to assess the Null Hypothesis. The test statistics with the above example would be the average weight of the sugar packet, and t-statistics would be used to determine the P-value. For different kinds of problems, different kinds of statistics including Z-statistics, T-statistics, F-statistics, etc can be used.
  • Identify the level of significance (alpha) : Before starting the hypothesis testing, one would be required to set the significance level (also called as  alpha ) which represents the value for which a P-value less than or equal to  alpha  is considered statistically significant. Typical values of  alpha  are 0.1, 0.05, and 0.01. In case the P-value is evaluated as statistically significant, the null hypothesis is rejected. In case, the P-value is more than the  alpha  value, the null hypothesis is failed to be rejected.
  • Compute the test statistics : Next step is to calculate the test statistics (z-test, t-test, f-test, etc) to determine the P-value. If the sample size is more than 30, it is recommended to use z-statistics. Otherwise, t-statistics could be used. In the current example where 20 packets of canned sauce is selected for hypothesis testing, t-statistics will be calculated for the mean value of 505 gm (sample mean). The t-statistics would then be calculated as the difference of 505 gm (sample mean) and the population means (500 gm) divided by the sample standard deviation divided by the square root of sample size (20).
  • Calculate the P-value of the test statistics : Once the test statistics have been calculated, find the P-value using either of t-table or a z-table. P-value is the probability of obtaining a test statistic (t-score or z-score) equal to or more extreme than the result obtained from the sample data, given that the null hypothesis H0 is true.
  • Compare P-value with the level of significance : The significance level is set as the allowable range within which if the value appears, one will be failed to reject the Null Hypothesis. This region is also called as Non-rejection region . The value of alpha is compared with the p-value. If the p-value is less than the significance level, the test is statistically significant and hence, the null hypothesis will be rejected.

P-Value: Key to Statistical Hypothesis Testing

Once you formulate the hypotheses, there is the need to test those hypotheses. Meaning, say that the null hypothesis is stated as the statement that housing price does not depend upon the average income of people staying in the locality, it would be required to be tested by taking samples of housing prices and, based on the test results, this Null hypothesis could either be rejected or failed to be rejected . In hypothesis testing, the following two are the outcomes:

  • Reject the Null hypothesis
  • Fail to Reject the Null hypothesis

Take the above example of the sugar packet weighing 500 gm. The Null hypothesis is set as the statement that the sugar packet weighs 500 gm. After taking a sample of 20 sugar packets and testing/taking its weight, it was found that the average weight of the sugar packets came to 495 gm. The test statistics (t-statistics) were calculated for this sample and the P-value was determined. Let’s say the P-value was found to be 15%. Assuming that the level of significance is selected to be 5%, the test statistic is not statistically significant (P-value > 5%) and thus, the null hypothesis fails to get rejected. Thus, one could safely conclude that the sugar packet does weigh 500 gm. However, if the average weight of canned sauce would have found to be 465 gm, this is way beyond/away from the mean value of 500 gm and one could have ended up rejecting the Null Hypothesis based on the P-value .

Hypothesis Testing for Problem Analysis & Solution Implementation

Hypothesis testing can be applied in both problem analysis and solution implementation. The following represents method on how you can apply hypothesis testing technique for both problem and solution space:

  • Problem Analysis : Hypothesis testing is a systematic way to validate assumptions or educated guesses during problem analysis. It allows for a structured investigation into the nature of a problem and its potential root causes. In this process, a null hypothesis and an alternative hypothesis are usually defined. The null hypothesis generally asserts that no significant change or effect exists, while the alternative hypothesis posits the opposite. Through controlled experiments, data collection, or statistical analysis, these hypotheses are then tested to determine their validity. For example, if a software company notices a sudden increase in user churn rate, they might hypothesize that the recent update to their application is the root cause. The null hypothesis could be that the update has no effect on churn rate, while the alternative hypothesis would assert that the update significantly impacts the churn rate. By analyzing user behavior and feedback before and after the update, and perhaps running A/B tests where one user group has the update and another doesn’t, the company can test these hypotheses. If the alternative hypothesis is confirmed, the company can then focus on identifying specific issues in the update that may be causing the increased churn, thereby moving closer to a solution.
  • Solution Implementation : Hypothesis testing can also be a valuable tool during the solution implementation phase, serving as a method to evaluate the effectiveness of proposed remedies. By setting up a specific hypothesis about the expected outcome of a solution, organizations can create targeted metrics and KPIs to measure success. For example, if a retail business is facing low customer retention rates, they might implement a loyalty program as a solution. The hypothesis could be that introducing a loyalty program will increase customer retention by at least 15% within six months. The null hypothesis would state that the loyalty program has no significant effect on retention rates. To test this, the company can compare retention metrics from before and after the program’s implementation, possibly even setting up control groups for more robust analysis. By applying statistical tests to this data, the company can determine whether their hypothesis is confirmed or refuted, thereby gauging the effectiveness of their solution and making data-driven decisions for future actions.
  • Tests of Significance
  • Hypothesis testing for the Mean
  • z-statistics vs t-statistics (Khan Academy)

Hypothesis testing quiz

The claim that needs to be established is set as ____________, the outcome of hypothesis testing is _________.

Please select 2 correct answers

P-value is defined as the probability of obtaining the result as extreme given the null hypothesis is true

There is a claim that doing pranayama yoga results in reversing diabetes. which of the following is true about null hypothesis.

In this post, you learned about hypothesis testing and related nuances such as the null and alternate hypothesis formulation techniques, ways to go about doing hypothesis testing etc. In data science, one of the reasons why one needs to understand the concepts of hypothesis testing is the need to verify the relationship between the dependent (response) and independent (predictor) variables. One would, thus, need to understand the related concepts such as hypothesis formulation into null and alternate hypothesis, level of significance, test statistics calculation, P-value, etc. Given that the relationship between dependent and independent variables is a sort of hypothesis or claim , the null hypothesis could be set as the scenario where there is no relationship between dependent and independent variables.

Recent Posts

Ajitesh Kumar

  • Self-Supervised Learning vs Transfer Learning: Examples - April 3, 2024
  • OKRs vs KPIs vs KRAs: Differences and Examples - February 21, 2024
  • CEP vs Traditional Database Examples - February 2, 2024

Ajitesh Kumar

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Self-Supervised Learning vs Transfer Learning: Examples
  • OKRs vs KPIs vs KRAs: Differences and Examples
  • CEP vs Traditional Database Examples
  • Retrieval Augmented Generation (RAG) & LLM: Examples
  • Attention Mechanism in Transformers: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Forgot password? New user? Sign up

Existing user? Log in

Hypothesis Testing

Already have an account? Log in here.

A hypothesis test is a statistical inference method used to test the significance of a proposed (hypothesized) relation between population statistics (parameters) and their corresponding sample estimators . In other words, hypothesis tests are used to determine if there is enough evidence in a sample to prove a hypothesis true for the entire population.

The test considers two hypotheses: the null hypothesis , which is a statement meant to be tested, usually something like "there is no effect" with the intention of proving this false, and the alternate hypothesis , which is the statement meant to stand after the test is performed. The two hypotheses must be mutually exclusive ; moreover, in most applications, the two are complementary (one being the negation of the other). The test works by comparing the \(p\)-value to the level of significance (a chosen target). If the \(p\)-value is less than or equal to the level of significance, then the null hypothesis is rejected.

When analyzing data, only samples of a certain size might be manageable as efficient computations. In some situations the error terms follow a continuous or infinite distribution, hence the use of samples to suggest accuracy of the chosen test statistics. The method of hypothesis testing gives an advantage over guessing what distribution or which parameters the data follows.

Definitions and Methodology

Hypothesis test and confidence intervals.

In statistical inference, properties (parameters) of a population are analyzed by sampling data sets. Given assumptions on the distribution, i.e. a statistical model of the data, certain hypotheses can be deduced from the known behavior of the model. These hypotheses must be tested against sampled data from the population.

The null hypothesis \((\)denoted \(H_0)\) is a statement that is assumed to be true. If the null hypothesis is rejected, then there is enough evidence (statistical significance) to accept the alternate hypothesis \((\)denoted \(H_1).\) Before doing any test for significance, both hypotheses must be clearly stated and non-conflictive, i.e. mutually exclusive, statements. Rejecting the null hypothesis, given that it is true, is called a type I error and it is denoted \(\alpha\), which is also its probability of occurrence. Failing to reject the null hypothesis, given that it is false, is called a type II error and it is denoted \(\beta\), which is also its probability of occurrence. Also, \(\alpha\) is known as the significance level , and \(1-\beta\) is known as the power of the test. \(H_0\) \(\textbf{is true}\)\(\hspace{15mm}\) \(H_0\) \(\textbf{is false}\) \(\textbf{Reject}\) \(H_0\)\(\hspace{10mm}\) Type I error Correct Decision \(\textbf{Reject}\) \(H_1\) Correct Decision Type II error The test statistic is the standardized value following the sampled data under the assumption that the null hypothesis is true, and a chosen particular test. These tests depend on the statistic to be studied and the assumed distribution it follows, e.g. the population mean following a normal distribution. The \(p\)-value is the probability of observing an extreme test statistic in the direction of the alternate hypothesis, given that the null hypothesis is true. The critical value is the value of the assumed distribution of the test statistic such that the probability of making a type I error is small.
Methodologies: Given an estimator \(\hat \theta\) of a population statistic \(\theta\), following a probability distribution \(P(T)\), computed from a sample \(\mathcal{S},\) and given a significance level \(\alpha\) and test statistic \(t^*,\) define \(H_0\) and \(H_1;\) compute the test statistic \(t^*.\) \(p\)-value Approach (most prevalent): Find the \(p\)-value using \(t^*\) (right-tailed). If the \(p\)-value is at most \(\alpha,\) reject \(H_0\). Otherwise, reject \(H_1\). Critical Value Approach: Find the critical value solving the equation \(P(T\geq t_\alpha)=\alpha\) (right-tailed). If \(t^*>t_\alpha\), reject \(H_0\). Otherwise, reject \(H_1\). Note: Failing to reject \(H_0\) only means inability to accept \(H_1\), and it does not mean to accept \(H_0\).
Assume a normally distributed population has recorded cholesterol levels with various statistics computed. From a sample of 100 subjects in the population, the sample mean was 214.12 mg/dL (milligrams per deciliter), with a sample standard deviation of 45.71 mg/dL. Perform a hypothesis test, with significance level 0.05, to test if there is enough evidence to conclude that the population mean is larger than 200 mg/dL. Hypothesis Test We will perform a hypothesis test using the \(p\)-value approach with significance level \(\alpha=0.05:\) Define \(H_0\): \(\mu=200\). Define \(H_1\): \(\mu>200\). Since our values are normally distributed, the test statistic is \(z^*=\frac{\bar X - \mu_0}{\frac{s}{\sqrt{n}}}=\frac{214.12 - 200}{\frac{45.71}{\sqrt{100}}}\approx 3.09\). Using a standard normal distribution, we find that our \(p\)-value is approximately \(0.001\). Since the \(p\)-value is at most \(\alpha=0.05,\) we reject \(H_0\). Therefore, we can conclude that the test shows sufficient evidence to support the claim that \(\mu\) is larger than \(200\) mg/dL.

If the sample size was smaller, the normal and \(t\)-distributions behave differently. Also, the question itself must be managed by a double-tail test instead.

Assume a population's cholesterol levels are recorded and various statistics are computed. From a sample of 25 subjects, the sample mean was 214.12 mg/dL (milligrams per deciliter), with a sample standard deviation of 45.71 mg/dL. Perform a hypothesis test, with significance level 0.05, to test if there is enough evidence to conclude that the population mean is not equal to 200 mg/dL. Hypothesis Test We will perform a hypothesis test using the \(p\)-value approach with significance level \(\alpha=0.05\) and the \(t\)-distribution with 24 degrees of freedom: Define \(H_0\): \(\mu=200\). Define \(H_1\): \(\mu\neq 200\). Using the \(t\)-distribution, the test statistic is \(t^*=\frac{\bar X - \mu_0}{\frac{s}{\sqrt{n}}}=\frac{214.12 - 200}{\frac{45.71}{\sqrt{25}}}\approx 1.54\). Using a \(t\)-distribution with 24 degrees of freedom, we find that our \(p\)-value is approximately \(2(0.068)=0.136\). We have multiplied by two since this is a two-tailed argument, i.e. the mean can be smaller than or larger than. Since the \(p\)-value is larger than \(\alpha=0.05,\) we fail to reject \(H_0\). Therefore, the test does not show sufficient evidence to support the claim that \(\mu\) is not equal to \(200\) mg/dL.

The complement of the rejection on a two-tailed hypothesis test (with significance level \(\alpha\)) for a population parameter \(\theta\) is equivalent to finding a confidence interval \((\)with confidence level \(1-\alpha)\) for the population parameter \(\theta\). If the assumption on the parameter \(\theta\) falls inside the confidence interval, then the test has failed to reject the null hypothesis \((\)with \(p\)-value greater than \(\alpha).\) Otherwise, if \(\theta\) does not fall in the confidence interval, then the null hypothesis is rejected in favor of the alternate \((\)with \(p\)-value at most \(\alpha).\)

  • Statistics (Estimation)
  • Normal Distribution
  • Correlation
  • Confidence Intervals

Problem Loading...

Note Loading...

Set Loading...

Exploring Data Science with R and the Tidyverse: A Concise Introduction

6 hypothesis testing.

In the previous chapters, we learned about randomness and sampling. Quite often, a data scientist receives some data and must make some assertion about it. There are typically two kinds of situations:

  • She has one dataset and a model that the data should have followed. She needs to decide if it is likely that the dataset indeed follows the model?
  • She has two datasets. She needs to decide if it is possible to explain the two datasets using a single model.

Here are examples of the two situations.

  • A company making coins is testing the fairness of a coin. Using some machine, the company tosses the coin 10,000 times. They record the face that turns up. By examining the record of the 10,000 tosses, can you tell how likely it is that the coin is fair?
  • Is the proportion of enrolled Asian American students at Harvard University disproportionately less than the pool of Harvard-admissible applicants? ( SFFA v. Harvard )

For both situations, an important consideration to make is in terms of how to compare the differences and figuring out how a sample at hand was generated.

6.1 Testing a Model

Suppose 10,000 tosses of a coin generate the following counts for “Heads” and “Tails”:

By dividing each number by 10,000, we get the proportion of the occurrence of each face.

We notice that the values are not exactly 0.5 (= \(1/2\) ). How far away is that from what we know about the distribution of a fair coin?

We know that the probability of each face in a fair coin is \(1/2\) . By subtracting \(1/2\) from each and obtaining the absolute difference values from them by removing any negative sign we have:

We then compute the sum and take one half of this value.

The number found by following the above steps is called the test statistic . The test statistic is a statistic used to evaluate a hypothesis, i.e., whether a coin at hand is fair or not. When we compute the test statistic from the data given to us, we call this the observed value of the test statistic .

There are many possible test statistics we could have tried for this problem. This one goes by a special name: the total variation distance (or, for short, TVD). The total variation distance serves as the measure for the difference between two distributions, namely, the difference between some given distribution (e.g., following the coin handed to us) and a sampling distribution (e.g., following a fair coin).

Another possible, and perhaps straightforward, test statistic is to simply count the number of heads that appear in the sample.

Is 4953 heads too small to be due to chance? It is hard to tell without knowing the number of heads we get by chance.

We can conduct some simulation to obtain a sampling distribution of the number of heads produced by a fair coin. As mentioned earlier, the proportion that each face turns up is not constant, even for a fair coin. So, by simulating tosses of a fair coin, we must expect to see a range of the number of heads seen.

6.1.1 Prerequisites

Before starting, let us load the tidyverse as usual.

6.1.2 The rmultinom function

We saw before that we can generate a sampling distribution by putting in place some sampling strategy. Perhaps the most straightforward is simple random sampling with replacement. This will be the approach we continue to make use of here, as well as throughout the rest of the text.

We also learned about two different ways to sample with replacement. sample samples items from a vector , which we used when simulating the expected amount of grains a minister receives after some number of days. slice_sample functions identically, but instead samples from rows of a data frame or tibble. To generate a sampling distribution for this experiment, we could just use sample again. But there is a quicker way, using a function called rmultinom , which is tailored for sampling at random from categorical distributions . We introduce it here and will use it several times this chapter.

Here is how we can use it to generate a sampling distribution of 100 tosses of a fair coin.

For generating a sampling distribution, we are more interested in the proportion of resulting heads and tails. Thus, we should divide by the number of tosses. Note how the probability of heads and tails is about equal.

So we can just as easily simulate proportions instead of counts.

The classic interpretation of rmultinom is that you have some marbles to put into boxes of size size , each with some probability prob ; the length of prob determines the number of boxes. The result shows the number of marbles that end up in each box. Thus, the function takes the following arguments:

  • size , the total number of marbles that are put into the boxes.
  • prob , the distribution of the categories in the population, as a vector of proportions that add up to 1.
  • n , the number of samples to draw from this distribution. We will typically leave this at 1 to make things easier to work with later on.

It returns a vector containing the number of marbles in each category in a random sample of the given size taken from the population. Because this distribution is so special, statisticians have given it a name: the multinomial distribution .

Let us see how we can use it to assess the model for 10,000 tosses of a coin.

6.1.3 A model for 10,000 coin tosses

We can extend our coin toss example code to incorporate the rmultinom function:

How many heads are in this sample?

We can use this to generate a sampling distribution for the number of heads produced by a fair coin. Let’s wrap this up into a function we can call. This will produce one simulated test statistic under the assumption of a fair coin.

Next, we create a vector sample_stats containing 1,000 simulated test statistics. As before, we will use replicate to do the work.

6.1.4 Chance of the observed value of the test statistic occurring

To interpret the results of our simulation, we start by visualizing the results using a histogram of the samples.

hypothesis testing in computer science

Where does the observed value fall in this histogram? We can plot it easily.

hypothesis testing in computer science

Let us look at the proportion of the elements in the vector sample_stats that are at least as small as the observed number of heads or more extreme, whose value we have stored in observed_heads . We simply count the elements matching the requirement and then divide the count by the length of the vector.

The value we get is 0.19, or about 19%. We interpret this value as stating the chance the test statistic achieves a value, under the assumption of the model, at least as extreme as 4953 heads or less. We conclude that a fair coin would yield the observed test statistic value we found (or less) 19% of the time. This value, computed through simulation, is what we call a p-value .

To compute a p-value , we take the following steps:

More specifically, we estimate how far the observation is from the central portion of the histogram by splitting the histogram at the point of observation. As shown in the following figure, the portion less than or equal to the observation (in dark cyan) and the portion greater than or equal to the observation (in orange). Note that we can include the equality, when the sampled value equals the observed value, on either side.

hypothesis testing in computer science

The more orange bars that are visible in the histogram, the higher the likelihood of the observation. Conversely, the less orange bars there are, the lower the likelihood of seeing such an observation. The area covered by the orange bars is formally called the “area in the tail.” The area in the tail is designated a special name: the p-value .

When we compute a p-value, we have in mind two possible interpretations. We call them the Null Hypothesis and the Alternative Hypothesis .

  • Null Hypothesis (NH): The hypothesis we use to create the model for simulation. For example, we assume that the coin we have is a fair coin, about 50% equal chance to see either face. Any variation from what we expect is because of chance variation.
  • Alternative Hypothesis (AH): The opposite, or counterpart, of the Null Hypothesis. For example, the AH states that the coin is biased towards tails. The difference we observed is caused by something other than randomness.

It is important that your null hypothesis acknowledges differences in the data. For example, if the null hypothesis states that a die is fair, why did you not get any 3 ’s when rolling the die 6 times?

We can provide one more definition of the p-value in the language of these hypotheses:

The chance, under the null hypothesis , of getting a test statistic equal to the observed test statistic or more extreme in the direction of the alternative hypothesis .

6.2 Case Study: Entering Harvard

Harvard University is one of the most prestigious universities in the United States. A recent lawsuit by Students for Fair Admissions (SFFA) led by Edward Blum against Harvard University alleges that Harvard has used soft racial quotas to keep the numbers of Asian-Americans low.

Put differently, the allegation claims that, from the pool of Harvard-admissible American applicants, Harvard uses a racial discriminatory score that allows the college to choose disproportionately less Asian-Americans. As of this writing, the lawsuit has been appealed and is to appear before the Supreme Court.

This section examines the “Harvard model” to assess, at a basic level, the claim put forward by the lawsuit.

6.2.1 Prerequisites

Before starting, let’s load the tidyverse as usual.

6.2.2 Students for Fair Admissions

Harvard University publishes some statistics on the class of 2024. According to the data, the proportions of the class for Whites, Blacks, Hispanics, Asians, and Others (International and Native Americans) are respectively 46.1%, 14.7%, 12.7%, 24.4%, and 2.1%.

We do not have the data of the students admissible to enter Harvard so, in lieu of this, we refer to student demographics enrolled in a four-year college. According to Chronicle of Higher Education 2020-2021 Almanac, the racial demographics of full-time students in 4-year private non-profit institutions – Harvard is one of them – in Fall 2018 are: 63.6% White, 11.5% Black, 12.3% Hispanic, 8.1% Asian, and 4.5% Other.

Let’s compile this information into a tibble.

The distributions may look quite different from each other. Of course, the demographics from the Almanac includes students who did not apply to Harvard, those who might not have got into Harvard, those applied and did not get in, and international students. Moreover, the Almanac data covers all full-time students in 2018 but not students who entered college in 2020.

Notwithstanding these differences, let us conduct an experiment to see how the demographics from Harvard look different from those given by the Almanac in terms of a sampling distribution.

As we will be handling proportions, let us scale the numbers down (by dividing each element by 100) so that they are expressed as percentages.

We will also write a function that computes the total variation distance (TVD) between two vectors.

In this study, our observed value is the TVD between the distribution of students in the Harvard class and the Almanac.

The Harvard class of 2024 has 2015 students. We can think of the process of sampling 2015 people to fill the “Harvard class” from those who “were attending” a four-year non-profit college in Fall 2018 and then examining their racial distribution. Of course, we cannot reach out to those individuals specifically, but we know the distribution of the entire population from which we want to sample. Therefore, we can simulate a large number of times what this “Harvard class” looks like and compare it with what we know about the actual Harvard class distribution, available in harvard_diff .

Our sampling plan can be framed as a “boxes and marbles” problem, as we saw in the previous section. There are five “boxes” to choose from, where each corresponds to a race: White, Black, Hispanic, Asian, and Other. The goal is to place marbles (which correspond to students) in each of the boxes, where the probability of ending up in any of the boxes is given by the Almanac.

This is an excellent fit for the rmultinom function. For example, here is one simulation of the proportion of races found in a “Harvard class.”

How far is our simulated class from the Almanac? We compute the TVD to find out.

We wrap our work into a function.

Let us simulate what 10,000 classes could look like. This will be contained in a vector called sample_class_tvds . Also, as before, we will use replicate to do the work.

We can now visualize the result.

hypothesis testing in computer science

Where does the true Harvard class lie on this histogram? We can plot a point geom to find out.

hypothesis testing in computer science

The orange dot shows the distance value of the Harvard value from the Almanac value. What we see is that the proportion of the races at Harvard is nothing like the national proportion.

6.2.3 Proportion of Asian American students

The prior experiment looked at the proportion of all races. However, the claim given by the lawsuit is specifically about Harvard-admissible students who are Asian American. We can now address this by refining our model to include only Asian American students and non-Asian American students.

As before, we transform the data to be in terms of proportions.

The proportion of Asian American in private non-profit 4-year colleges is just 8.1% while the race occupies 24.4% of the Harvard freshman class. Let’s recompute our observed TVD value.

Note that we took a shortcut for computing the TVD here. When dealing with two categories, the TVD is equal to the distance between the two proportions in one of the categories.

Re-running the simulation is easy. Note that instead of passing class_props as an argument to the function one_simulated_class , we pass the new tibble class_props_asian .

We again visualize the result.

hypothesis testing in computer science

Where does the observed value fall in this histogram?

hypothesis testing in computer science

We find that the result is the same; the Harvard proportion of Asian American students is not at all like the national value. We can state, with great confidence, that Harvard enrolls much more Asian students than the national average.

Important note: The reader should be cautioned not to accept these results as direct evidence against the suit’s case. As noted at the outset of this section, we do not know the proportion of Harvard-admissible students and must instead rely on a national Almanac for reference. The base population of students can be very much different which is, in fact, something we anticipated.

6.3 Significance Levels

So far we have evaluated models by comparing some observation to a prediction made by a model. For instance, we compared:

  • Racial demographics at Harvard University with the national Almanac at four-year non-profit private colleges.
  • 10,000 tosses of an unknown coin at hand with a fair coin.

Sometimes the observed value of the test statistic ends up in the “bulk” of the predictions; sometimes it ends up very far away. But how do we define what “close” or “far” is? And at what point does the observed value transition from being “close” to “far”?

This section examines the significance of an observed value. To set up the discussion, we introduce another example still on the topic of academics: midterm scores in a hypothetical Computer Science course.

6.3.1 Prerequisites

Before starting, let’s load the tidyverse as usual. We will also use a dataset from the edsdata package, so let us load this in as well.

6.3.2 A midterm grumble?

A hypothetical Computer Science course had 40 enrolled students and was divided into 3 lab sections. A Teaching Assistant (TA) leads each section. After a midterm exam was given, the students in one section noticed that their midterm scores were lower compared to students in the other two lab sections. They complained that their performance was due to the TA’s teaching. The professor faced a dilemma: is it the case that the TA is at fault for poor teaching or are the students from that section more vocal about their grumbles following a exam?

If we were to fill that lab section with randomly selected students from the class, it is possible that their average midterm grade will look a lot like the score the grumbling students are unhappy about. It turns out that what we have stated here is a chance model that we can simulate.

Let’s have a look at the data from each student in the course. The following tibble csc_labs contains midterm and final scores, and the lab section the student is enrolled in.

We can use the dplyr verb group_by to examine the mean midterm grade as well as the number of students in each section.

Indeed, it seems that the section H students fared the worst, albeit by a small margin, among the three sections. Our statistic then is the mean grade of students in the lab section. Thus, our observed statistic is the mean grade from section H, which is about 68.56.

We formally state our null and alternative hypothesis.

Null Hypothesis: The mean midterm grades of students in lab section H looks like the mean grades of a “section H” that is generated by randomly sampling the same number of students from the class.

Alternative Hypothesis: The section H midterm grades are too low.

To form a random sample, we will need to sample without replacement 9 students from the course to fill up the theoretical “lab section H”.

We can look at the mean midterm score for this randomly sampled section.

Now that we know how to simulate one value, we can wrap this into a function.

We will simulate the “section H lab” 10,000 times. Let us run the simulation!

As before, we visualize the resulting distribution of grades.

hypothesis testing in computer science

It seems that the grades cluster around 72. Where does the actual section H section lie? Recall that this value is available in the variable observed_statistic . We overlay a point geom to our above plot.

hypothesis testing in computer science

It seems that the observed statistic is “close” to the center of randomly sampled scores.

6.3.3 Cut-off points

Let’s sort the sample means generated. We will examine in particular the value at 10% and 5% from the bottom. We will first turn to the value at 10%.

We plot our sampling histogram and overlay it with a vertical line at this value.

hypothesis testing in computer science

This means that 10% of our simulated mean statistics are equal to or less than 63. Put differently, the chance of a an average midterm score lower than 63 occurring, under the assumption of a TA teaching in good faith, is around 10%.

The situation is not much different if we look at the value 5% from the bottom, which we find to be about 61. We redraw the situation in on our sampling histogram.

hypothesis testing in computer science

This time, the chance of obtaining a simulated mean midterm score at least as low as 61 is about 5%.

We say that the threshold point 63 is at the 90% significance level and the threshold point 61 is at the 95% significance level.

6.3.4 The significance level is an error probability

The last figure shows that, although rare, a lab section with a “TA teaching in good faith” can still produce mean midterm scores that are at least as low as 61. How often does that occur? The figure gives the answer for that as well: it does so with about 5% chance.

Therefore, if the TA is teaching in good faith and our test uses a 95% significance level to decide whether or not the TA is guilty, then there is about a 5% chance that the test will wrongly conclude that the mean midterm scores are too low and, consequently, the TA is at fault. This example points to a general fact about significance levels:

If you use a \(p\) % significance level for the p-value, and the null hypothesis happens to be true, there is about a \(1-p\) % chance that the test will incorrectly conclude the alternative hypothesis.

Statistical inferences, unlike logical inferences, can be wrong! But the power of statistics is its ability to quantify how often this error will occur. In fact, we can control the chance of wrongly convicting the TA by choosing a higher significance level. We could look at the 99% significance level or even the 99.9% and 99.99% levels; these are commonly referred to in the area of physics which rely on enormous evidence to prove something axiomatic.

Here, too, are trade-offs. By minimizing the error of wrongly convicting a TA teaching in good faith, we increase the chance of another kind of error occurring: our test concluding nothing when in fact there is something unusual about the lab section’s midterm grades.

It is important to note that these cut-off points are convention only and do not have any strong theoretical backing. They were first established by the statistician Ronald Fisher in his seminal work Statistical Methods for Research Workers .

Therefore, declaring an observed statistic as being “too low” or “too high” is a judgment call. In any statistics-based research, you should always provide the observed test statistic and the p-value used in addition to giving your decision; this way your readers can decide for themselves whether or not the results are indeed significant.

6.3.5 The verdict: is the TA guilty?

We can set a modest significance level at 95% for the course case study. Of course, judgment is needed if the decision resulting from this study will cause the TA to be reprimanded – we may tend towards a much more conservative significance level to be fully convinced, even if this means increasing the chance of a “guilty TA” being let free.

We overlay the observed statistic on the sampling histogram. As before, the orange bars show the 95% significance region.

hypothesis testing in computer science

We see that the point does not cross the vertical purple line. We can check numerically how much area “is in the tail”.

hypothesis testing in computer science

We conclude the TA’s defense holds up pretty well: the average lab section H scores are not any different from those generated by chance.

A note, also, on drawing conclusions from a hypothesis test. Even if there was significant evidence to reject the null hypothesis at some conventional cut-off, caution must be exercised in interpreting the poor performance as being directly caused by the TA’s instruction. There could be other variables at play that we did not account for that can affect the significance, e.g., the background of the students enrolled in this particular lab section (did they have less prior programming experience compared to the other sections?). We call these confounding variables , which we will examine in more depth in a later chapter.

In any case, it would be prudent to check in with the TA to get their take on the story.

6.3.6 Choosing a test statistic

By this point, we have been introduced to a few different test statistics. A common challenge when developing a hypothesis test is to first define what a “good” test statistic is for the problem.

Consider your alternative hypothesis and what evidence favors it over the null. If only “large values” or “small values” of the test statistic favor the alternative, then we recommend using the test statistic. For instance, in the midterm example, we considered only “small values” of the sample mean statistic to determine if the lab section H scores are “too low.” In the Harvard admissions example, we considered “large values” of the TVD test statistic to determine if the TVD of the Harvard proportions is “too big” to have been generated by a model under the null hypothesis.

Avoid choosing test statistics where “both big values and small values” favor the alternative. In this case, the area that supports the alternative includes both the left and right “tails”. Consider the following sampling histogram of the test statistic and note the tails as indicated by the orange bars.

hypothesis testing in computer science

We suggest modifying your test statistic so that the evidence favoring the alternative involves only one tail.

Finally, we present a table with some common test statistics and when to use each.

6.4 Permutation Testing

In the previous section, we study the use of hypothesis testing. In this section we learn a simple method to compare two distributions using a method we call permutation testing . This allows us to decide if the two distributions come from the same underlying distribution.

6.4.1 Prerequisites

Let us begin by loading the tidyverse . We will also use a dataset from the edsdata package.

6.4.2 The effect of a tutoring program

The tibble finals from the edsdata package contains final exam grades in a hypothetical Computer Science course for 105 students. They are divided into two groups, based on two different offerings of the course labeled A and B . The more recent offering B featured a tutoring program for students to receive help on assignments and exams. The course instructor is interested in finding out if the tutoring program boosted overall performance in the class, measured by a final exam. This could help the instructor and department decide if the program should continue or even be expanded. Suppose that the dataset is collected over two semesters from the same Computer Science course.

Let’s first load the dataset.

We can examine the number of enrolled students in each of the two offerings.

It appears they are about equal. Let’s now turn to a distribution of the students in the offering that featured the tutoring program (class B ) compared to those in the offering without the program (class A ). To generate an overlaid histogram, we use the positional adjustment argument identity and set an alpha so that the bars are drawn with slight transparency.

hypothesis testing in computer science

By observation alone, it seems that the final scores of students in the offering where the tutoring program was available ( B ) is slightly to the right of the distribution corresponding to scores when the program did not exist ( A ). Could this be chalked up to chance?

As we have done throughout this chapter, we can address this question by means of a hypothesis test. We will state a null and alternative hypothesis that arise from the problem.

Null hypothesis: In the population, the distribution of final exam scores where the tutoring program was available is the same as those when the service did not exist. The difference seen in the sample is because of chance.

Alternative hypothesis: In the population, the distribution of final exam scores when the tutoring program was available are, on average, higher than the scores when the program was not.

According to the alternative hypothesis, the average final score in offering B should be higher than the average final score in offering A . Therefore, a good test statistic we can use is the difference in the mean between the two groups. That is,

\[ \text{test statistic} = \mu_B - \mu_A \]

where \(\mu\) denotes the mean of the group.

First, we form two vectors finalsA and finalsB that contain final scores with respect to the course offering.

The observed value of the statistic can be computed as the following.

We can write a function that computes the statistic for us. We call it mean_diff .

Observe how it returns the same value for the observed statistic.

To predict the statistic under the null hypothesis, we defer to an idea called the permutation test .

6.4.3 A permutation test

Suppose that we are given the following vector of integers.

We can interpret these numbers as indices that refer to an element inside a vector. We imagine that the first half of indices belong to a group A , and the second half group B .

Under the assumption of the null hypothesis, there should be no difference between the two distributions A and B with respect to the underlying population. For example, whether a final exam score belongs to the course offering A or B should have no effect on the mean final score. If so, there should be no consequences if we place both groups into a pot, shuffle them around, and compute the mean difference from the result. The resulting value we get from this process is one simulated value of the test statistic under the null hypothesis.

The first bit of machinery we need is a function that shuffles a sequence of integers. We actually already know one: sample .

In this example, sample receives a vector of numbers 1 through 10 and returns the result after shuffling them. We might also call the result a permutation of the original sequence – hence, its namesake.

If we again interpret the resulting vector as indices, we take the first half to be the indices of the shuffled group A and the second half the shuffled group B.

The remaining work then is to compute the difference in means between the shuffled groups.

The function one_mean_difference puts everything together. It receives two vectors, a and b , puts them together in a pot, and deals out two shuffled vectors with the same size as a and b , respectively. The function returns the value of the simulated statistic by calling the functional compute_statistic . For this example, we use mean_diff .

We are now ready to perform a permutation test for the tutoring program example. We would like to simulate the test statistic under the null hypothesis multiple times and collect the values into a vector. As before, we can use replicate . We will simulate 10,000 values.

6.4.4 Conclusion

Let’s visualize the results.

hypothesis testing in computer science

First, observe how the distribution is centered around 0. Under the assumption of the null hypothesis, there is no difference between the final exam averages in the two course offerings and, therefore, the difference clusters around 0.

Also observe that the observed test statistic is quite far from the center. To get a better sense of how far, we compute the p-value.

This means that the chance of obtaining a mean difference at least as large as \(6.10\) is around 8%. By standards of the conventional cut-off points we have discussed, we would have enough evidence to refute the null hypothesis at a 90% significance level. Would this be enough to convince us that the tutoring program is indeed effective? Let us consider for a moment what it would mean if it does not.

If we were to demand a higher significance level, say 95%, our observed statistic is no longer significant. The logical next step would be to conclude that the null hypothesis is true, bearing the implication that the tutoring program is ineffective. This would be a statistical fallacy! Even if our results are not significant at the desired level, we do NOT take the null hypothesis to be true. Put another way, we fail to reject the null hypothesis . That is a mouthful!

The problem here is a lack of evidence. A lack of evidence does not prove that something does not exist, e.g., the tutoring program is not effective; it very well could be, but our study missed it. Indeed, our permutation test only evaluated one criteria – that is, difference in final exam scores – as a measure for improvement. There are other test statistics or criteria we could have considered, like class participation, which may have benefited from the program. It would be up to the judgment of the department on how to use these results in deciding the merit of the tutoring program.

6.4.5 Comparing Summer and Winter Olympic athletes

We end this section with one more example of a permutation test: comparing the weight information of Summer and Winter Olympic athletes. The dataset is available in the name athletes from the edsdata package.

For the sake of this analysis, we focus on Olympic games after the 2000 Summer Olympics.

We can glance at how much athletes we have in each season.

We observe that Summer athletes make up the bulk of this dataset. Before proceeding any further, we should visualize the weight information with an overlaid histogram.

hypothesis testing in computer science

We give our hypothesis statements.

Null hypothesis: In the population, the distribution of weight information in the Summer Olympics is, on average, the same as the Winter Olympics.

Alternative hypothesis: In the population, the distribution of weight information in the Summer Olympics is, on average, different from the Winter Olympics.

Note that the alternative hypothesis, unlike the tutoring program example, does not care whether the weight information for athletes competing in the Winter Olympics is higher or less than that of athletes in the Summer Olympics. It only states that some difference exists. Therefore, the absolute difference in the means would be a good test statistic to use for this problem.

\[ \text{test statistic} = | \mu_B - \mu_A | \]

Note how it does not matter which group ends up as A and likewise for B. Let’s write a function to compute this statistic; it is a slight variation of the mean_diff we saw before.

6.4.6 The test

We form two vectors winter_weights and summer_weights that contain the weight information with respect to the season.

We are now ready to perform the permutation test. As before, let us simulate the test statistic under the null hypothesis 10,000 times.

6.4.7 Conclusion

We are ready to visualize the results.

hypothesis testing in computer science

The observed statistic is quite far away from the distribution of simulated test statistics. Let’s do a numerical check.

The chance of obtaining a mean absolute difference of \(1.29\) is roughly 0.1%. We can safely reject the null hypothesis at a significance level over 99%. This confirms that, assuming our dataset is representative of the population of Olympic athletes, the weight information between Summer and Winter Olympic players are likely, on average, to be different.

6.5 Exercises

Be sure to install and load the following packages into your R environment before beginning this exercise set.

Question 1 The College of Galaxy makes available applicant acceptance information for different ethnicities: White (“White”), American Indian or Alaska Native (“AI/AN”), Asian (“Asian”), Black (“Black”), Hispanic (“Hispanic”), and Native Hawaiian or Other Pacific Islander (“NH/OPI”). The tibble galaxy_acceptance gives the acceptance result from one year.

Question 1.1 What proportion of total accepted applicants are of some ethnicity? Add a new variable named prop_accepted that gives the proportion of each ethnicity with respect to the total number of accepted candidates. Assign the resulting tibble to the name galaxy_distribution .

Based on these observations, you may be convinced that the college is biased in favor of enrolling White applicants. Is it justifiable?

To explore the question, you conduct a hypothesis test by comparing the ethnicity distribution at the college to that of degree-granting institutions in the United States. You decide to test the hypothesis that the ethnicity distribution at the College of Galaxy looks like a random sample from the population of accepted applicants in universities across the United States. Using simulation, this is what the data would look like if the hypothesis were true . If it doesn’t, you reject the hypothesis.

Thus, you offer the null hypothesis:

  • Null hypothesis: “The distribution of ethnicities of accepted applicants at the College of Galaxy was a random sample from the population of accepted applicants at degree-granting institutions in the United States.”

Question 1.2 With every null hypothesis we write down a corresponding alternative hypothesis . What is the alternative hypothesis in this case?

We have that there are 1533 accepted applicants at the College of Galaxy. Imagine drawing a random sample of 1533 students from among the admitted students at universities across the United States. This is one student admissible pool we could see if the null hypothesis were true.

The Integrated Postsecondary Education Data System (IPEDS) at the National Center for Education Statistics gives data on U.S. colleges, universities, and technical and vocational institutions. As of Fall 2020, they reported the following ethnicity information about admitted applicants at Title IV degree-granting institutions in the U.S:

Question 1.3 Repeat Question 1.1 for ipeds2020 . Assign the resulting tibble to the name ipeds2020_dist .

Under the null hypothesis, we can simulate one “admissible pool” from the population of students in the U.S as follows:

The first element in this vector contains the number of White students in this sample pool, the second element the number of Native Hawaiian or Other Pacific Islander students, and so on.

Question 1.4 For the ethnicity distribution in our sample, we are interested in the proportion of ethnicities that appear in the admissible pool. Write a function prop_from_sample() that takes as an argument some distribution (e.g., ipeds2020_distribution ) and returns a vector containing the proportion of ethnicities that appear in the sample of 1533 people.

Question 1.5 Call prop_from_sample() to create one vector called one_sample that represents one sample of 1533 people from among the admissible students in the United States.

The total variation distance (TVD) is a useful test statistic when comparing two distributions. This distance should be small if the null hypothesis is true because samples will have similar proportions of ethnicities as the population from which the sample is drawn.

Question 1.6 Write a function called compute_tvd() . It takes as an argument a vector of proportions of ethnicities. The first element in the vector is the proportion of White students, the second element the proportion of Native Hawaiian or Other Pacific Islander students, and so on. The function returns the TVD between the given ethnicity distribution and that of the national population.

Question 1.7 Write a function called one_simulated_tvd() . This function takes no arguments. It generates a “sample pool” under the null hypothesis, computes the test statistic, and then return it.

Question 1.8 Using replicate() , run the simulation 10,000 times to produce 10,000 test statistics. Assign the results to a vector called sample_tvds .

The following chunk shows your simulation augmented with an orange dot that shows the TVD between the ethnicity distribution at College of Galaxy and that of the national population.

Question 1.9 Determine whether the following conclusions can be drawn from these data. Explain your answer.

  • The ethnicity distribution of the admitted applicant pool at the College of Galaxy does not look like that of U.S. universities.
  • The ethnicity distribution of the admitted applicant pool at the College of Galaxy is biased toward white applicants.

Question 2: A strange dice. Your friend Jerry invites you to a game of dice. He asks you to roll a dice 10 times and says that he wins $1 each time a 3 turns up and loses $1 on any other face. Jerry’s dice is six-sided, however, the "2" and "4" faces have been replaced with "3" ’s. The following code chunk simulates the results after one game:

While the game seems like an obvious scam, Jerry claims that his dice is no different than a fair dice in the long run. Can you disprove his claim using a hypothesis test?

Question 2.1 Write a function sample_prop that receives two arguments distribution (e.g., weird_dice_probs ) and size (e.g., 10 rolls). The function simulates the game using a dice where the probability of each face is given by distribution and the dice is rolled size many times. The proportion of each face that appeared after the simulation is returned.

The following code chunk simulates the result after playing one round of Jerry’s game. You record the sample proportions of the faces that appeared in a tibble named jerry_die_dist .

Let us define the distribution for what we know is a fair six-sided die.

Here is what the jerry_die_dist distribution looks like when visualized:

Question 2.2 Define a null hypothesis and an alternative hypothesis for this question.

We saw in Section 7.4 that the mean is equivalent to weighing each face by the proportion of times it appears. The mean of jerry_die_dist can be computed as follows:

For reference, here is the mean of a fair six-sided dice. Observe how close this value is to the mean of Jerry’s dice:

The following function mystery_test_stat1() takes a single tibble dist (e.g., jerry_die_dist ) as its argument and computes a test statistic by comparing it to fair_die_dist .

Question 2.3 What test statistic is being used in mystery_test_stat1 ?

Question 2.4 Write a function called one_simulated_stat . The function receives a single argument stat_func . The function generates sample proportions after one round of Jerry’s game under the assumption of the null hypothesis , computes the test statistic from this sample using the argument stat_func , and returns it.

Question 2.5 Complete the following function called simulate_dice_experiment . The function receives two arguments, an observed_dist (e.g., jerry_die_dist ) and a stat_func . The function computes the observed value of the test statistic using observed_dist . It then simulates the game 10,000 times to produce 10,000 different test statistics. The function then prints the p-value and plots a histogram of your simulated test statistics. Also shown is where the observed value falls on this histogram (orange dot) and the cut-off for the 95% significance level.

Question 2.6 Run the experiment using your function simulate_dice_experiment using the observed distribution from jerry_die_dist and the mystery test statistic.

The evidence so far has been unsuccessful in refuting Jerry’s claim. Maybe you should stop playing games with Jerry…

As a desperate final attempt before giving up and agreeing to play Jerry’s game, you try using a different test statistic to simulate called mystery_test_stat2 .

Question 2.7 Repeat Question 2.6 , this time using mystery_test_stat2 instead.

Question 2.8 At a significance level of 95%, what do we conclude from the first experiment? How about the second experiment?

Question 2.9 Examine the difference between the test statistics in mystery_test_stat1 and mystery_test_stat2 . Why is it that the conclusion of the test is different depending on the test statistic selected?

Question 2.10 Which of the following statements are FALSE ? Indicate them by including its number in the following vector pvalue_answers .

  • The p-value printed is the probability that the die is fair.
  • The p-value printed is the probability that the die is NOT fair.
  • The p-value cutoff (5%) is the probability that the die is NOT fair.
  • The p-value cutoff (5%) is the probability of seeing a test statistic as extreme or more extreme than this one if the null hypothesis were true.

Question 2.11 For the statements you selected to be FALSE, explain why they are wrong.

Question 3 This question is a continuation of Question 2 . The following incomplete function experiment_rejects_null receives four arguments: a tibble describing the probability distribution of a dice, a function to compute a test statistic, a p-value cutoff, and a number of repetitions to use. The function simulates 10 rolls of the given dice, and tests the null hypothesis about that dice using the test statistic given by stat_func . The function returns a Boolean: TRUE if the experiment rejects the null hypothesis at p_value_cutoff , and FALSE otherwise.

Question 3.1 Read and understand the above function. Then complete the missing portion that computes the observed value of the test statistic and simulates num_repetitions many test statistics under the null hypothesis.

The following code chunk simulates the result after testing Jerry’s dice with mystery_test_stat1 at the P-value cut-off of 5%. Run it a few times to get a rough sense of the results.

Question 3.2 Repeat the experiment experiment_rejects_null(weird_dice_probs, mystery_test_stat1, 0.05, 250) 300 times using replicate . Assign experiment_results to a vector that stores the result of each experiment.

Note : This code chunk will need some time to finish (approximately a few minutes). This will be a little slow. 300 repetitions of the simulation should require a minute or so of computation, and should be enough to get an answer that is roughly correct.

Question 3.3 Compute the proportion of times the function returned TRUE in experiment_results . Assign your answer to prop_reject .

Question 3.4 Does your answer to Question 3.3 make sense? What value did you expect to get? Put another way, what is the probability that the null hypothesis is rejected when the dice is actually fair?

Question 3.5 What does it mean for the function to return TRUE when weird_dice_probs is passed as an argument? From the perspective of finding the truth about Jerry’s (phony) claim, is the experiment successful? What if the function returned TRUE when fair_die_dist is passed as an argument instead?

Question 4. The United States House of Representatives in the 116th Congress (2019-2021) had 435 members. According to the Center for American Women and Politics (CAWP) , 101 were women and 334 men. The following tibble house gives the head counts:

In this question, we will examine whether women are underrepresented in the chamber.

Question 4.1 If men and women are equally represented in the chamber, then the chance of either gender occupying any seat should be like that of a fair coin flip. For instance, if the chamber consisted of just 10 seats, then one “House of Representatives” might look like:

Using this, write a null and alternative hypothesis for this problem.

Question 4.2 Using Question 4.1 , write a function called one_sample_house that simulates one “House” under the null hypothesis. The function receives two arguments, gender_prop and house_size . The function samples "Female" or "Male" house_size many times where the chance of either gender appearing is given by gender_prop . The function then returns a tibble with the gender head counts in the simulated sample. Following is one possible returned tibble:

Question 4.3 A good test statistic for this problem is the difference in the head count of males from the head count of females. Write a function that takes a tibble head_count_tib as an argument (that has the format as in Question 4.2 ). The function computes and returns the test statistic from this tibble.

Question 4.4 Compute the observed value of the test statistic using your one_diff_stat() . Assign the resulting value to the name observed_value_house .

Question 4.5 Write a function called simulate_one_stat that simulates one test statistic. The function receives two arguments, the gender proportions prop and the total seats ( total_seats ) to fill in the simulated “House”. The function simulates a sample under the null hypothesis, and computes and returns the test statistic from the sample.

Question 4.6 Simulate 10,000 different test statistics under the null hypothesis. Store the results to a vector named test_stats .

The following ggplot2 code visualizes your results:

Question 4.7 Based on the experiment, what can you say about the representation of women in the House?

Let us now approach the analysis another another way. Instead of assuming equal representation, let us base the comparison by using the representation of women candidates in the preceding 2018 U.S. House Primary elections. The tibble house_primary from the edsdata package compiles primary election results for Democratic and Republican U.S. House candidates running in elections from 2012 to 2018. The data is prepared by the Michael G. Miller Dataverse part of the Harvard Dataverse .

Question 4.8 Form a two-element vector named primary_prop that gives the proportion of female and male candidates, respectively, in the 2018 U.S. House Primary elections. This can be accomplished as follows:

  • Filter the data to the year 2018 . The data should not contain the results for any elections that resulted in a runoff (where runoff = 1 ).
  • Summarize and count each gender that appears in gender in the resulting tibble.
  • Add a variable that computes the proportions from these counts.
  • Pull the proportions as a vector and assign it to primary_prop .

Question 4.9 Repeat Question 4.6 this time using the proportions given by primary_prop .

The following code visualizes the revised result:

Question 4.10 Compute the p-value using the test_stats you generated by comparing it with observed_value_house . Assign your answer to p_value .

Question 4.11 Why is it that in the first histogram the simulated test statistics cluster around 0 and in the second histogram the simulated values cluster around a value much greater? Is the statement of the null hypothesis the same in both cases?

Question 4.12 Now that we have analyzed the data in two ways, are women equally represented in the House? Why or why not?

Question 5. Cathy recently received from a friend a replica dollar coin which appears to be slightly biased towards “Heads”. Cathy tosses the coin 20 times in a row counts how many times “Heads” turns up. She repeats this for 10 trials. Her results are summarized in the following tibble:

Question 5.1 What is the total heads observed? Assign your result to a double named total_heads_observed .

Let us write an experiment and check how plausible it is for this coin to be fair.

Question 5.2 Given the outcome of 20 trials, which of the following test statistics would be reasonable for this hypothesis test?

  • The total number of heads.
  • The total number of heads minus the total number of tails.
  • Whether there is at least one head.
  • Whether there is at least one tail.
  • The total variation distance between the probability distribution of a fair coin and the observed distribution of heads and tails.
  • The trial with the minimum number of heads.

Assign the name good_test_stats to a vector of integers corresponding to these test statistics.

Question 5.3 Let us write a code that simulates tossing a fair coin. Write a function called one_test_stat that receives a parameter num_trials . The function simulates a fair coin toss 20 times, records the number of heads, and repeats this procedure num_trials many times. The function returns the total number of heads over the given number of trials.

Question 5.4 Repeat Cathy’s experiment 10,000 times. Store the results in total_head_stats .

Question 5.5 Compute a p-value using total_head_stats . Assign the result to p_value .

Question 5.6 From the experiment how plausible do you say Cathy’s coin is fair?

Question 6. A popular course in the College of Groundhog is an undergraduate programming course CSC1234. In the spring semester of 2022, the course had three sections, A, B, and C. The sections were taught by different instructors. The course had the same textbook, the same assignments, and the same exams. One same formula was applied to determine the final grade. At the end of a semester, some students in Sections A and C came to their instructors and asked if the instructors had been harsher than the instructor for Section B, because several buddies of theirs in Section B did better in the course. Time for a hypothesis test!

The section and score information for the semester is available in the tibble csc1234 from the edsdata package.

We will use a permutation test to see if the scores for Sections A and C are indeed significantly lower than the scores for Section B. That is, we will compare three groups: Section A with B, Section A with C, and Section B with C.

Question 6.1 Compute the group-wise mean for each section of the course. The tibble should contain two variables: the section name and the mean of that section. Assign the resulting tibble to the name section_means .

Question 6.2 Visualize a histogram of the scores in csc1234 . Use a facet wrap on Section so that you can view the three distributions together separately. We suggest using 10 bins.

We can develop a chance model by hypothesizing that any section’s scores looks like a random sample out of all of the student scores across all three sections. We can then see the difference in mean scores for each of the three pairs of randomly drawn “sections”. This is a specified chance model we can use to simulate and, therefore, is the null hypothesis .

Question 6.3 Define a good alternative hypothesis for this problem.

Question 6.4 Write a function called mean_differences that takes a tibble as its single argument. It then summarizes this tibble by computing the average mean score (in Score ) for each section (in Section ). The function returns a three-element vector of mean differences for each pair: the difference in mean scores between A and B (“A-B”), C and B (“C-B”), and C and A (“C-A”).

Question 6.5 Compute the observed differences in the means of the three sections using mean_differences . Store the results in observed_differences .

The following code chunk puts your observed values into a tibble named observed_diff_tibble .

Question 6.6 Write a function scores_permute_test that does the following:

  • From csc1234 form a new variable that shuffles the values in Score using sample . Overwrite the variable Score with the shuffled values.
  • Call mean_differences on this shuffled tibble.
  • Return the vector of differences.

Question 6.7 Use replicate on the scores_permute_test function you wrote to generate 1,000 sample differences.

The following code chunk creates a tibble named differences_tibble from the simulated test statistics you generated above.

Question 6.8 Generate three histograms using the results in differences_tibble . As with Question 6.2 , use a facet wrap on each pairing (i.e., A-B , C-B , and C-A ). Then attach a red point to each histogram indicating the observed value of the test statistic (use observed_diff_tibble ). We suggest using 20 bins for the histograms.

Question 6.9 The bulk of the distribution in each of the three histograms is centered around 0. Using what you know about the stated null hypothesis, why do the distributions turn out this way?

Question 6.10 By examining the above three histograms and where the observed value of the test statistic falls, which difference among the three do you think is the most egregious?

Question 6.11 Based on your answer to Question 6.10 , can we say that the hypothesis test brings enough evidence to show that the drop in student scores was deliberate and that the instructor was unfair in grading?

Tutorial Playlist

Statistics tutorial, everything you need to know about the probability density function in statistics, the best guide to understand central limit theorem, an in-depth guide to measures of central tendency : mean, median and mode, the ultimate guide to understand conditional probability.

A Comprehensive Look at Percentile in Statistics

The Best Guide to Understand Bayes Theorem

Everything you need to know about the normal distribution, an in-depth explanation of cumulative distribution function, a complete guide to chi-square test, a complete guide on hypothesis testing in statistics, understanding the fundamentals of arithmetic and geometric progression, the definitive guide to understand spearman’s rank correlation, a comprehensive guide to understand mean squared error, all you need to know about the empirical rule in statistics, the complete guide to skewness and kurtosis, a holistic look at bernoulli distribution.

All You Need to Know About Bias in Statistics

A Complete Guide to Get a Grasp of Time Series Analysis

The Key Differences Between Z-Test Vs. T-Test

The Complete Guide to Understand Pearson's Correlation

A complete guide on the types of statistical studies, everything you need to know about poisson distribution, your best guide to understand correlation vs. regression, the most comprehensive guide for beginners on what is correlation, what is hypothesis testing in statistics types and examples.

Lesson 10 of 24 By Avijeet Biswal

A Complete Guide on Hypothesis Testing in Statistics

Table of Contents

In today’s data-driven world , decisions are based on data all the time. Hypothesis plays a crucial role in that process, whether it may be making business decisions, in the health sector, academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions and making bad decisions. In this tutorial, you will look at Hypothesis Testing in Statistics.

What Is Hypothesis Testing in Statistics?

Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.

Let's discuss few examples of statistical hypothesis from real-life - 

  • A teacher assumes that 60% of his college's students come from lower-middle-class families.
  • A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.

Now that you know about hypothesis testing, look at the two types of hypothesis testing in statistics.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

  • Here, x̅ is the sample mean,
  • μ0 is the population mean,
  • σ is the standard deviation,
  • n is the sample size.

How Hypothesis Testing Works?

An analyst performs hypothesis testing on a statistical sample to present evidence of the plausibility of the null hypothesis. Measurements and analyses are conducted on a random sample of the population to test a theory. Analysts use a random population sample to test two hypotheses: the null and alternative hypotheses.

The null hypothesis is typically an equality hypothesis between population parameters; for example, a null hypothesis may claim that the population means return equals zero. The alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means the return is not equal to zero). As a result, they are mutually exclusive, and only one can be correct. One of the two possibilities, however, will always be correct.

Your Dream Career is Just Around The Corner!

Your Dream Career is Just Around The Corner!

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average. 

To put this company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is determining whether or not a coin is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a show of heads and tails would be very different.

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

Hypothesis Testing Calculation With Examples

Let's consider a hypothesis test for the average height of women in the United States. Suppose our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and determine that their average height is 5'5". The standard deviation of population is 2.

To calculate the z-score, we would use the following formula:

z = ( x̅ – μ0 ) / (σ /√n)

z = (5'5" - 5'4") / (2" / √100)

z = 0.5 / (0.045)

 We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there is evidence to suggest that the average height of women in the US is greater than 5'4".

Steps of Hypothesis Testing

Step 1: specify your null and alternate hypotheses.

It is critical to rephrase your original research hypothesis (the prediction that you wish to study) as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The null hypothesis predicts no link between the variables of interest.

Step 2: Gather Data

For a statistical test to be legitimate, sampling and data collection must be done in a way that is meant to test your hypothesis. You cannot draw statistical conclusions about the population you are interested in if your data is not representative.

Step 3: Conduct a Statistical Test

Other statistical tests are available, but they all compare within-group variance (how to spread out the data inside a category) against between-group variance (how different the categories are from one another). If the between-group variation is big enough that there is little or no overlap between groups, your statistical test will display a low p-value to represent this. This suggests that the disparities between these groups are unlikely to have occurred by accident. Alternatively, if there is a large within-group variance and a low between-group variance, your statistical test will show a high p-value. Any difference you find across groups is most likely attributable to chance. The variety of variables and the level of measurement of your obtained data will influence your statistical test selection.

Step 4: Determine Rejection Of Your Null Hypothesis

Your statistical test results must determine whether your null hypothesis should be rejected or not. In most circumstances, you will base your judgment on the p-value provided by the statistical test. In most circumstances, your preset level of significance for rejecting the null hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would be seen if the null hypothesis were true. In other circumstances, researchers use a lower level of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null hypothesis.

Step 5: Present Your Results 

The findings of hypothesis testing will be discussed in the results and discussion portions of your research paper, dissertation, or thesis. You should include a concise overview of the data and a summary of the findings of your statistical test in the results section. You can talk about whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a must for your statistics assignments.

Types of Hypothesis Testing

To determine whether a discovery or relationship is statistically significant, hypothesis testing uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only when the population standard deviation is known and the sample size is 30 data points or more, can a z-test be applied.

A statistical test called a t-test is employed to compare the means of two groups. To determine whether two groups differ or if a procedure or treatment affects the population of interest, it is frequently used in hypothesis testing.

Chi-Square 

You utilize a Chi-square test for hypothesis testing concerning whether your data is as predicted. To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the differences between categorical variables from a random sample. The test's fundamental premise is that the observed values in your data should be compared to the predicted values that would be present if the null hypothesis were true.

Hypothesis Testing and Confidence Intervals

Both confidence intervals and hypothesis tests are inferential techniques that depend on approximating the sample distribution. Data from a sample is used to estimate a population parameter using confidence intervals. Data from a sample is used in hypothesis testing to examine a given hypothesis. We must have a postulated parameter to conduct hypothesis testing.

Bootstrap distributions and randomization distributions are created using comparable simulation techniques. The observed sample statistic is the focal point of a bootstrap distribution, whereas the null hypothesis value is the focal point of a randomization distribution.

A variety of feasible population parameter estimates are included in confidence ranges. In this lesson, we created just two-tailed confidence intervals. There is a direct connection between these two-tail confidence intervals and these two-tail hypothesis tests. The results of a two-tailed hypothesis test and two-tailed confidence intervals typically provide the same results. In other words, a hypothesis test at the 0.05 level will virtually always fail to reject the null hypothesis if the 95% confidence interval contains the predicted value. A hypothesis test at the 0.05 level will nearly certainly reject the null hypothesis if the 95% confidence interval does not include the hypothesized parameter.

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical hypothesis into two types.

Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.

Composite Hypothesis: A composite hypothesis specifies a range of values.

A company is claiming that their average sales for this quarter are 1000 units. This is an example of a simple hypothesis.

Suppose the company claims that the sales are in the range of 900 to 1000 units. Then this is a case of a composite hypothesis.

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of data that would result in the null hypothesis being rejected if the test sample falls into it, inevitably meaning the acceptance of the alternate hypothesis.

In a one-tailed test, the critical distribution area is one-sided, meaning the test sample is either greater or lesser than a specific value.

In two tails, the test sample is checked to be greater or less than a range of values in a Two-Tailed test, implying that the critical distribution area is two-sided.

If the sample falls within this range, the alternate hypothesis will be accepted, and the null hypothesis will be rejected.

Become a Data Scientist With Real-World Experience

Become a Data Scientist With Real-World Experience

Right Tailed Hypothesis Testing

If the larger than (>) sign appears in your hypothesis statement, you are using a right-tailed test, also known as an upper test. Or, to put it another way, the disparity is to the right. For instance, you can contrast the battery life before and after a change in production. Your hypothesis statements can be the following if you want to know if the battery life is longer than the original (let's say 90 hours):

  • The null hypothesis is (H0 <= 90) or less change.
  • A possibility is that battery life has risen (H1) > 90.

The crucial point in this situation is that the alternate hypothesis (H1), not the null hypothesis, decides whether you get a right-tailed test.

Left Tailed Hypothesis Testing

Alternative hypotheses that assert the true value of a parameter is lower than the null hypothesis are tested with a left-tailed test; they are indicated by the asterisk "<".

Suppose H0: mean = 50 and H1: mean not equal to 50

According to the H1, the mean can be greater than or less than 50. This is an example of a Two-tailed test.

In a similar manner, if H0: mean >=50, then H1: mean <50

Here the mean is less than 50. It is called a One-tailed test.

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite being true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false, unlike a Type-I error.

Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.

H0: Student has passed

H1: Student has failed

Type I error will be the teacher failing the student [rejects H0] although the student scored the passing marks [H0 was true]. 

Type II error will be the case where the teacher passes the student [do not reject H0] although the student did not score the passing marks [H1 is true].

Level of Significance

The alpha value is a criterion for determining whether a test statistic is statistically significant. In a statistical test, Alpha represents an acceptable probability of a Type I error. Because alpha is a probability, it can be anywhere between 0 and 1. In practice, the most commonly used alpha values are 0.01, 0.05, and 0.1, which represent a 1%, 5%, and 10% chance of a Type I error, respectively (i.e. rejecting the null hypothesis when it is in fact correct).

Future-Proof Your AI/ML Career: Top Dos and Don'ts

Future-Proof Your AI/ML Career: Top Dos and Don'ts

A p-value is a metric that expresses the likelihood that an observed difference could have occurred by chance. As the p-value decreases the statistical significance of the observed difference increases. If the p-value is too low, you reject the null hypothesis.

Here you have taken an example in which you are trying to test whether the new advertising campaign has increased the product's sales. The p-value is the likelihood that the null hypothesis, which states that there is no change in the sales due to the new advertising campaign, is true. If the p-value is .30, then there is a 30% chance that there is no increase or decrease in the product's sales.  If the p-value is 0.03, then there is a 3% probability that there is no increase or decrease in the sales value due to the new advertising campaign. As you can see, the lower the p-value, the chances of the alternate hypothesis being true increases, which means that the new advertising campaign causes an increase or decrease in sales.

Why is Hypothesis Testing Important in Research Methodology?

Hypothesis testing is crucial in research methodology for several reasons:

  • Provides evidence-based conclusions: It allows researchers to make objective conclusions based on empirical data, providing evidence to support or refute their research hypotheses.
  • Supports decision-making: It helps make informed decisions, such as accepting or rejecting a new treatment, implementing policy changes, or adopting new practices.
  • Adds rigor and validity: It adds scientific rigor to research using statistical methods to analyze data, ensuring that conclusions are based on sound statistical evidence.
  • Contributes to the advancement of knowledge: By testing hypotheses, researchers contribute to the growth of knowledge in their respective fields by confirming existing theories or discovering new patterns and relationships.

Limitations of Hypothesis Testing

Hypothesis testing has some limitations that researchers should be aware of:

  • It cannot prove or establish the truth: Hypothesis testing provides evidence to support or reject a hypothesis, but it cannot confirm the absolute truth of the research question.
  • Results are sample-specific: Hypothesis testing is based on analyzing a sample from a population, and the conclusions drawn are specific to that particular sample.
  • Possible errors: During hypothesis testing, there is a chance of committing type I error (rejecting a true null hypothesis) or type II error (failing to reject a false null hypothesis).
  • Assumptions and requirements: Different tests have specific assumptions and requirements that must be met to accurately interpret results.

After reading this tutorial, you would have a much better understanding of hypothesis testing, one of the most important concepts in the field of Data Science . The majority of hypotheses are based on speculation about observed behavior, natural phenomena, or established theories.

If you are interested in statistics of data science and skills needed for such a career, you ought to explore Simplilearn’s Post Graduate Program in Data Science.

If you have any questions regarding this ‘Hypothesis Testing In Statistics’ tutorial, do share them in the comment section. Our subject matter expert will respond to your queries. Happy learning!

1. What is hypothesis testing in statistics with example?

Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence. An example: testing if a new drug improves patient recovery (Ha) compared to the standard treatment (H0) based on collected patient data.

2. What is hypothesis testing and its types?

Hypothesis testing is a statistical method used to make inferences about a population based on sample data. It involves formulating two hypotheses: the null hypothesis (H0), which represents the default assumption, and the alternative hypothesis (Ha), which contradicts H0. The goal is to assess the evidence and determine whether there is enough statistical significance to reject the null hypothesis in favor of the alternative hypothesis.

Types of hypothesis testing:

  • One-sample test: Used to compare a sample to a known value or a hypothesized value.
  • Two-sample test: Compares two independent samples to assess if there is a significant difference between their means or distributions.
  • Paired-sample test: Compares two related samples, such as pre-test and post-test data, to evaluate changes within the same subjects over time or under different conditions.
  • Chi-square test: Used to analyze categorical data and determine if there is a significant association between variables.
  • ANOVA (Analysis of Variance): Compares means across multiple groups to check if there is a significant difference between them.

3. What are the steps of hypothesis testing?

The steps of hypothesis testing are as follows:

  • Formulate the hypotheses: State the null hypothesis (H0) and the alternative hypothesis (Ha) based on the research question.
  • Set the significance level: Determine the acceptable level of error (alpha) for making a decision.
  • Collect and analyze data: Gather and process the sample data.
  • Compute test statistic: Calculate the appropriate statistical test to assess the evidence.
  • Make a decision: Compare the test statistic with critical values or p-values and determine whether to reject H0 in favor of Ha or not.
  • Draw conclusions: Interpret the results and communicate the findings in the context of the research question.

4. What are the 2 types of hypothesis testing?

  • One-tailed (or one-sided) test: Tests for the significance of an effect in only one direction, either positive or negative.
  • Two-tailed (or two-sided) test: Tests for the significance of an effect in both directions, allowing for the possibility of a positive or negative effect.

The choice between one-tailed and two-tailed tests depends on the specific research question and the directionality of the expected effect.

5. What are the 3 major types of hypothesis?

The three major types of hypotheses are:

  • Null Hypothesis (H0): Represents the default assumption, stating that there is no significant effect or relationship in the data.
  • Alternative Hypothesis (Ha): Contradicts the null hypothesis and proposes a specific effect or relationship that researchers want to investigate.
  • Nondirectional Hypothesis: An alternative hypothesis that doesn't specify the direction of the effect, leaving it open for both positive and negative possibilities.

Find our Data Analyst Online Bootcamp in top cities:

About the author.

Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Resources

The Key Differences Between Z-Test Vs. T-Test

Free eBook: Top Programming Languages For A Data Scientist

Normality Test in Minitab: Minitab with Statistics

Normality Test in Minitab: Minitab with Statistics

A Comprehensive Look at Percentile in Statistics

Machine Learning Career Guide: A Playbook to Becoming a Machine Learning Engineer

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

Test faster, fix more

Hypothesis for Computer Science Researchers

I’m in the process of trying to turn my work on Hypothesis into a PhD and I realised that I don’t have a good self-contained summary as to why researchers should care about it.

So this is that piece. I’ll try to give a from scratch introduction to the why and what of Hypothesis. It’s primarily intended for potential PhD supervisors, but should be of general interest as well (especially if you work in this field).

Why should I care about Hypothesis from a research point of view?

The short version:

Hypothesis takes an existing effective style of testing (property-based testing) which has proven highly effective in practice and makes it accessible to a much larger audience. It does so by taking several previously unconnected ideas from the existing research literature on testing and verification, and combining them to produce a novel implementation that has proven very effective in practice.

The long version is the rest of this article.

The remainder is divided into several sections:

  • What is Hypothesis? is a from-scratch introduction to Hypothesis. If you are already familiar with property-based testing (e.g. from QuickCheck) you can probably skip this.
  • How is Hypothesis innovative? is about the current state of the art of Hypothesis and why it’s interesting. If you’ve already read How Hypothesis Works this is unlikely to teach you anything new and you can skip it.
  • What prior art is it based on? is a short set of references for some of the inspirations for Hypothesis. You probably shouldn’t skip this, because it’s short and the linked material is all interesting.
  • What are some interesting research directions? explores possible directions I’m looking into for the future of Hypothesis, some of which I would hope to include in any PhD related to it that I worked on. You probably shouldn’t skip this if you care about this document at all.
  • What should you do with this information? simply closes off the article and winds things down.

So, without further ado, the actual content.

What is Hypothesis?

Hypothesis is an implementation of property-based testing , an idea that originated with a Haskell library called QuickCheck.

Property-based testing is a way to augment your unit tests with a source of structured random data that allows a tool to explore the edge cases of your tests and attempt to find errors automatically. I’ve made a longer and more formal discussion of this definition in the past.

An example of a property-based test using Hypothesis:

This exposes a normal function which can be picked up by a standard runner such as py.test. You can also just call it directly:

When the test is run, Hypothesis will generate random lists of integers and pass them to the test. The test sorts the integers, then sorts them again, and asserts that the two results are the same.

As long as the test passes for every input Hypothesis feeds it this will appear to be a normal test. If it fails however, Hypothesis will then repeatedly rerun it with progressively simpler examples to try and find a minimal input that causes the failure.

To see this, suppose we implemented the following rather broken implementation of sorted:

Then on running we would see the following output:

Hypothesis probably started with a much more complicated example (the test fails for essentially any list with more than one element) and then successfully reduced it to the simplest possible example: A list with two distinct elements.

Importantly, when the test is rerun, Hypothesis will start from the falsifying example it found last time rather than trying to generate and shrink a new one from scratch. In this particular case that doesn’t matter very much - the example is found very quickly and it always finds the same one - but for more complex and slower tests this is an vital part of the development work flow: It means that tests run much faster and don’t stop failing until the bug is actually fixed.

Tests can also draw more data as they execute:

This fails because we’ve forgotten than i may be zero, and also about Python’s negative indexing of lists:

Simplification and example saving work as normal for data drawn in this way.

Hypothesis also has a form of model based testing , in which you specify a set of valid operations on your API and it attempts to generate whole programs using those operations and find a simple one that breaks.

How is Hypothesis innovative?

From an end user point of view, Hypothesis adds several important things:

  • It exists at all and people use it. Historically this sort of testing has been found mostly within the functional programming community, and attempts to make it work in other languages have not seen much success or widespread adoption. Some of this is due to novel implementation details in Hypothesis, and some is due to design decisions making it “feel” like normal testing instead of formal methods.
  • Specifying data generators is much easier than in traditional QuickCheck methods, and you get a great deal more functionality “for free” when you do. This is similar to test.check for Clojure, or indeed to the Erlang version of QuickCheck , but some of the design decisions of Hypothesis make it significantly more flexible here.
  • The fact that arbitrary examples can be saved and replayed significantly improves the development work-flow. Other implementations of property-based testing either don’t do this at all, only save the seed, or rely on being able to serialize the generated objects (which can break invariants when reading them back in).
  • The fact that you can generate additional data within the test is often extremely useful, and seems to be unique to Hypothesis in this category of testing tool.

These have worked together well to fairly effectively bring property based testing “to the masses”, and Hypothesis has started to see increasingly widespread use within the Python community, and is being actively used in the development of tools and libraries, as well as in the development of both CPython and pypy, the two major implementations of Python.

Much of this was made possible by Hypothesis’s novel implementation.

From an implementation point of view, the novel feature of Hypothesis is this: Unlike other implementations of property-based testing, it does not need to understand the structure of the data it is generating at all (it sometimes has to make guesses about it, but its correctness is not dependent on the accuracy of those guesses).

Hypothesis is divided into three logically distinct parts:

  • A core engine called Conjecture , which can be thought of as an interactive fuzzer for lightly structured byte streams.
  • A strategy library, which is designed to take Conjecture’s output and turn it into arbitrary values representable in the programming language.
  • An interface to external test runners that takes tests built on top of the strategy library and runs them using Conjecture (in Python this mostly just consists of exposing a function that the test runners can pick up, but in the Java Prototype this is more involved and ends up having to interact with some interesting JUnit specific features.

Conjecture is essentially the interesting part of Hypothesis’s implementation and is what supports most of its functionality: Generation, shrinking, and serialization are all built into the core engine, so implementations of strategies do not require any awareness of these features to be correct. They simply repeatedly ask the Conjecture engine for blocks of bytes, which it duly provides, and they return the desired result.

If you want to know more about this, I have previously written How Hypothesis Works , which provides a bit more detail about Conjecture and how Hypothesis is built on top of it.

What prior art is it based on?

I’ve done a fair bit of general reading of the literature in the course of working on Hypothesis.

The two main papers on which Hypothesis is based are:

  • QuickCheck: a lightweight tool for random testing of Haskell programs essentially started the entire field of property-based testing. Hypothesis began life as a QuickCheck implementation, and its user facing API continues to be heavily based on QuickCheck, even though the implementation has diverged very heavily from it.
  • EXPLODE: a lightweight, general system for finding serious storage system errors provided the key idea on which the Conjecture engine is based - instead of doing static data generation separate from the tests, provide tests with an interactive primitive from which they can draw data.

Additionally, the following are major design inspirations in the Conjecture engine, although their designs are not currently used directly:

  • American Fuzzy Lop is an excellent security-oriented fuzzer, although one without much academic connections. I’ve learned a fair bit about the design of fuzzers from it. For a variety of pragmatic reasons I don’t currently use its most important innovation (branch coverage metrics as a tool for corpus discovery), but I’ve successfully prototyped implementations of that on top of Hypothesis which work pretty well.
  • Swarm Testing drove a lot of the early designs of Hypothesis’s data generation. It is currently not explicitly present in the Conjecture implementation, but some of what Conjecture does to induce deliberate correlations in data is inspired by it.

What are some interesting research directions?

I have a large number of possible directions that my work on Hypothesis could be taken.

None of these are necessarily a thing that would be the focus of a PhD - in doing a PhD I would almost certainly focus on a more specific research question that might include some or all of them. These are just areas that I am interested in exploring which I think might form an interesting starting point, and whatever focus I actually end up with will likely be more carefully tailored in discussion with my potential supervisors.

One thing that’s also worth considering: Most of these research directions are ones that would result in improvements to Hypothesis without changing its public interface. This results in a great practical advantage to performing the research because of the relatively large (and ever-growing) corpus of open source projects which are already using Hypothesis - many of these changes could at least partly be validated by just running peoples’ existing tests and seeing if any new and interesting bugs are found!

Without further ado, here are some of what I think are the most interesting directions to go next.

More structured byte streams

My current immediate research focus on Hypothesis is to replace the core Conjecture primitive with a more structured one that bears a stronger resemblance to its origins in EXPLODE. This is designed to address a number of practical problems that Hypothesis users currently experience (mostly performance related), but it also opens up a number of other novel abstractions that can be built on top of the core engine.

The idea is to pare down the interface so that when calling in to Conjecture you simply draw a single byte, specifying a range of possible valid bytes. This gives Conjecture much more fine-grained information to work with, which opens up a number of additional features and abstractions that can be built on top of it.

From this primitive you can then rebuild arbitrary weighted samplers that shrink correctly (using a variation of the Alias Method), and arbitrary grammars (probably using Boltzmann Samplers or similar).

This will provide a much more thorough basis for high quality data generation than the current rather ad hoc method of specifying byte streams.

This is perhaps more engineering than research, but I think it would at the bare minimum make any paper I wrote about the core approach of Hypothesis significantly more compelling, and it contains a number of interesting applications of the theory.

Glass box testing

Currently Conjecture treats the tests it calls as a black box and does not get much information about what the tests it executes are actually doing.

One obvious thing to do which brings in some more ideas from e.g. American Fuzzy Lop is to use more coverage information, but so far I haven’t had much success with making my prototypes of this idea suitable for real world use. The primary reason for this so far has been that all of the techniques I’ve found have worked well when tests are allowed to run for minutes or hours, but the current design focus of Hypothesis assumes tests have seconds to run at most, which limits the utility of these methods and means they haven’t been a priority so far.

But in principle this should be an extremely profitable line of attack, even with that limitation, and I would like to explore it further.

The main idea would be to add a notion of “tags” to the core Conjecture engine which could be used to guide the search. Coverage would be one source of tags, but others are possible. For example, my previous work on Schroedinteger implements what is essentially a form of lightweight Concolic testing that would be another possibly interesting source of information to use.

Exactly how much of this is original research and how much is just applications of existing research is yet to be determined, but I think it very likely that at the very least figuring out how to make use of this sort of information in sharply bounded time is likely to bear interesting fruit. The opportunity to see how Concolic testing behaves in the wild is also likely to result in a number of additional questions.

Making the Conjecture engine smarter

A thing I’ve looked into in the past is the possible use of grammar inference to improve shrinking and data generation.

At the time the obstacle I ran into was that the algorithm I was using - an optimized variation of L* search - did not get good performance in practice on the problems I tried it on.

Synthesizing Program Input Grammars promises to lift this restriction by providing much better grammar inference in practical scenarios that are quite closely related to this problem domain, so I would like to revisit this and see if it can prove useful.

There are likely a number of other ways that the Conjecture engine can probe the state of the system under test to determine interesting potential behaviours, especially in combination with glass box testing features.

I think there are a lot of potentially interesting research directions in here - especially if this is combined with the glass box testing. Given that I haven’t even been able to make this perform acceptably in the past, the first one would be to see if I can!

This will also require a fair bit of practical experimentation to see what works well at actually finding bugs and what doesn’t. This is one area in particular where a corpus of open source projects tested with Hypothesis will be extremely helpful.

Other testing abstractions

Despite Hypothesis primarily being a library for property based testing, the core Conjecture engine actually has very little to do with property-based testing and is a more powerful low-level testing abstraction. It would be interesting to see how far that could be taken - the existing stateful/model-based testing is one partial step in that direction, but it could also potentially be used more directly for other things. e.g. in tandem with some of the above features it could be used for low-level fuzzing of binaries, or using it to drive thread scheduling.

The nice thing about the Conjecture separation is that because it is so self-contained, it can be used as the core building block on which other tools can be rebuilt and gain a lot of its major features for free.

I don’t currently have any concrete plans in this direction, but it seems likely there are some interesting possibilities here that will emerge after more review of the testing literature.

This is probably just engineering unless some particularly interesting application emerges, but I think the basic potential of the technology would probably give pretty good odds of such an application.

What should you do with this information?

It depends who you are.

  • If I’m already talking to you because you’re a potential PhD supervisor, tell me what about this interests you and ask me lots of questions.
  • If you’re a potential PhD supervisor who I’m not already talking to but you’d like me to, please let me know!
  • If you’re somebody else, it’s rather up to you. Feel free to send me papers, questions, etc.

Whoever you are, if you found this document interesting I’d love to hear from you. Drop me an email at [email protected] .

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: multiple object tracking as id prediction.

Abstract: In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at this https URL .

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. How to do Hypothesis Testing : A Beginner Guide For Data Scientist

    hypothesis testing in computer science

  2. Statistical Hypothesis Testing: Step by Step

    hypothesis testing in computer science

  3. Hypothesis Testing- Meaning, Types & Steps

    hypothesis testing in computer science

  4. Hypothesis Testing Solved Problems

    hypothesis testing in computer science

  5. Hypothesis Testing Solved Examples(Questions and Solutions)

    hypothesis testing in computer science

  6. What is Hypothesis Testing? Types and Methods

    hypothesis testing in computer science

VIDEO

  1. Hypothesis testing #study bs 7 semester statics

  2. Hypothesis Testing

  3. TWO SAMPLE HYPOTHESIS TESTING IN SPSS

  4. Session 8- Hypothesis testing by Non Parametric Tests (7/12/23)

  5. Hypothesis Testing in Statistical Theory

  6. Basics of Hypothesis Testing

COMMENTS

  1. Understanding Hypothesis Testing

    A Computer Science portal for geeks. It contains well written, well thought and well explained computer science and programming articles, quizzes and practice/competitive programming/company interview Questions. ... Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously ...

  2. Hypothesis Testing

    Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

  3. Understanding Hypothesis Testing

    Hypothesis testing is a statistical method to determine whether a hypothesis that you have holds true or not. The hypothesis can be with respect to two variables within a dataset, an association between two groups or a situation. ... with a background in Computer Science and Psychology. Interested in healthcare AI, specifically mental health ...

  4. Hypothesis Testing Guide for Data Science Beginners

    Steps of Hypothesis Testing. The steps of hypothesis testing typically involve the following process: Formulate Hypotheses: State the null hypothesis and the alternative hypothesis.; Choose Significance Level (α): Select a significance level (α), which determines the threshold for rejecting the null hypothesis.Commonly used significance levels include 0.05 and 0.01.

  5. Introduction to Hypothesis Testing

    About this course. Get started with hypothesis testing by examining a one-sample t-test and binomial tests — both used for drawing inference about a population based on a smaller sample from that population.

  6. Hypothesis testing for data scientists

    4. Photo by Anna Nekrashevich from Pexels. Hypothesis testing is a common statistical tool used in research and data science to support the certainty of findings. The aim of testing is to answer how probable an apparent effect is detected by chance given a random data sample. This article provides a detailed explanation of the key concepts in ...

  7. Statistical Inference and Hypothesis Testing in Data Science

    Statistical Inference and Hypothesis Testing in Data Science Applications. This course is part of Data Science Foundations: Statistical Inference Specialization. Taught in English. 22 languages available. Some content may not be translated. Instructor: Jem Corcoran. Enroll for Free. Starts Apr 5.

  8. Hypothesis Testing: a Practical Intro

    Feb 7, 2021. 1. A short primer on why we can reject hypotheses, but cannot accept them, with examples and visuals. Image by the author. Hypothesis testing is the basis of classical statistical inference. It's a framework for making decisions under uncertainty with the goal to prevent you from making stupid decisions — provided there is data ...

  9. Statistical Inference and Hypothesis Testing in Data Science Applications

    Learn hypothesis testing and its application in data science with this 5-week course from the University of Colorado Boulder, also part of their MS in Data Science program. Udemy, Coursera, 2U/edX Face Lawsuits Over Meta Pixel Use ... Computer Science, Information Science, and others. With performance-based admissions and no application process ...

  10. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  11. Hypothesis Testing Steps & Examples

    Hypothesis testing is a technique that helps scientists, researchers, or for that matter, anyone test the validity of their claims or hypotheses about real-world or real-life events in order to establish new knowledge. Hypothesis testing techniques are often used in statistics and data science to analyze whether the claims about the occurrence of the events are true, whether the results ...

  12. Hypothesis Testing

    A hypothesis test is a statistical inference method used to test the significance of a proposed (hypothesized) relation between population statistics (parameters) and their corresponding sample estimators. In other words, hypothesis tests are used to determine if there is enough evidence in a sample to prove a hypothesis true for the entire population. The test considers two hypotheses: the ...

  13. PDF Hypothesis Testing

    Hypothesis Testing • Purpose: make inferences about a population parameter by analyzing differences between observed sample statistics and the results one expects to obtain if some underlying assumption is true. • Null hypothesis: • Alternative hypothesis: • If the null hypothesis is rejected then the alternative hypothesis is accepted ...

  14. Hypothesis Testing in Data Science

    In the world of Data Science, there are two parts to consider when putting together a hypothesis. Hypothesis Testing is when the team builds a strong hypothesis based on the available dataset. This will help direct the team and plan accordingly throughout the data science project. The hypothesis will then be tested with a complete dataset and ...

  15. A Complete Guide to Hypothesis Testing

    Hypothesis testing is a method of statistical inference that considers the null hypothesis H ₀ vs. the alternative hypothesis H a, where we are typically looking to assess evidence against H ₀. Such a test is used to compare data sets against one another, or compare a data set against some external standard. The former being a two sample ...

  16. Chapter 6 Hypothesis Testing

    6 Hypothesis Testing. In the previous chapters, we learned about randomness and sampling. Quite often, a data scientist receives some data and must make some assertion about it. ... A hypothetical Computer Science course had 40 enrolled students and was divided into 3 lab sections. A Teaching Assistant (TA) leads each section. After a midterm ...

  17. Hypothesis testing

    hypothesis testing, In statistics, a method for testing how accurately a mathematical model based on one set of data predicts the nature of other data sets generated by the same process. Hypothesis testing grew out of quality control, in which whole batches of manufactured items are accepted or rejected based on testing relatively small samples.An initial hypothesis (null hypothesis) might ...

  18. What is Hypothesis Testing in Statistics? Types and Examples

    Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence.

  19. Hypothesis for Computer Science Researchers

    Hypothesis for Computer Science Researchers I'm in the process of trying to turn my work on Hypothesis into a PhD and I realised that I don't have a good self-contained summary as to why researchers should care about it. So this is that piece. I'll try to give a from scratch introduction to the why and what of Hypothesis. It's primarily intended for potential PhD supervisors, but ...

  20. Everything You Need To Know about Hypothesis Testing

    6. Test Statistic: The test statistic measures how close the sample has come to the null hypothesis. Its observed value changes randomly from one random sample to a different sample. A test statistic contains information about the data that is relevant for deciding whether to reject the null hypothesis or not.

  21. what is the hyphotesis in computer science research(especifically AI

    For example, if the research is guided by questionnaires to people, you don't ask the people a lot of random questions and try then to figure out what it means. You ask them questions related to the hypothesis so that certain answers (determined in advance) support the hypothesis and the opposite answers work to refute it.

  22. Hypothesis Testing: Data Science

    Image by Author. We reject the null hypothesis(H₀) if the sample mean(x̅ ) lies inside the Critical Region.; We fail to reject the null hypothesis(H₀) if the sample mean(x̅ ) lies outside the Critical Region.; The formulation of the null and alternate hypothesis determines the type of the test and the critical regions' position in the normal distribution.

  23. [2403.16848] Multiple Object Tracking as ID Prediction

    In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of ...

  24. How reliable are time series forecasts?

    From the test, we can extract the p-value, which is the probability that an event as or more extreme as the observed occurs and compare this against a predetermined threshold: if the p-value of the test is less than 0.05, then we reject the hypothesis that the given residual mean is 1.