the null hypothesis stands

Exploring the Null Hypothesis: Definition and Purpose

Updated: July 5, 2023 by Ken Feldman

the null hypothesis stands

Hypothesis testing is a branch of statistics in which, using data from a sample, an inference is made about a population parameter or a population probability distribution .

First, a hypothesis statement and assumption is made about the population parameter or probability distribution. This initial statement is called the Null Hypothesis and is denoted by H o. An alternative or alternate hypothesis (denoted Ha ), is then stated which will be the opposite of the Null Hypothesis.

The hypothesis testing process and analysis involves using sample data to determine whether or not you can be statistically confident that you can reject or fail to reject the H o. If the H o is rejected, the statistical conclusion is that the alternative or alternate hypothesis Ha is true.

Overview: What is the Null Hypothesis (Ho)? 

Hypothesis testing applies to all forms of statistical inquiry. For example, it can be used to determine whether there are differences between population parameters or an understanding about slopes of regression lines or equality of probability distributions.

In all cases, the first thing you do is state the Null and Alternate Hypotheses. The word Null in the context of hypothesis testing means “nothing” or “zero.”  

As an example, if we wanted to test whether there was a difference in two population means based on the calculations from two samples, we would state the Null Hypothesis in the form of: 

Ho: mu1 = mu2 or mu1- mu2 = 0  

In other words, there is no difference, or the difference is zero. Note that the notation is in the form of a population parameter, not a sample statistic. 

Since you are using sample data to make your inferences about the population, it’s possible you’ll make an error. In the case of the Null Hypothesis, we can make one of two errors.

  •   Type 1 , or alpha error: An alpha error is when you mistakenly reject the Null and believe that something significant happened. In other words, you believe that the means of the two populations are different when they aren’t.
  • Type 2, or beta error: A beta error is when you fail to reject the null when you should have.  In this case, you missed something significant and failed to take action. 

A classic example is when you get the results back from your doctor after taking a blood test. If the doctor says you have an infection when you really don’t, that is an alpha error. That is thinking that there is something significant going on when there isn’t. We also call that a false positive. The doctor rejected the null that “there was zero infection” and missed the call.

On the other hand, if the doctor told you that everything was OK when you really did have an infection, then he made a beta, or type 2, error. He failed to reject the Null Hypothesis when he should have. That is called a false negative.

The decision to reject or not to reject the Null Hypothesis is based on three numbers. 

  • Alpha, which you get to choose. Alpha is the risk you are willing to assume of falsely rejecting the Null. The typical values for alpha are 1%, 5%, or 10%. Depending on the importance of the conclusion, you only want to falsely claim a difference when there is none, 1%, 5%, or 10% of the time.
  • Beta, which is typically 20%. This means you’re willing to be wrong 20% of the time in failing to reject the null when you should have. 
  • P-value, which is calculated from the data. The p-value is the actual risk you have in being wrong if you reject the null. You would like that to be low.  

Your decision as to what to do about the null is made by comparing the alpha value (your assumed risk) with the p-value (actual risk). If the actual risk is lower than your assumed risk, you can feel comfortable in rejecting the null and claiming something has happened. But, if the actual risk is higher than your assumed risk you will be taking a bigger risk than you want by rejecting the null.

RELATED: NULL VS. ALTERNATIVE HYPOTHESIS

3 benefits of the null hypothesis .

The stating and testing of the null hypothesis is the foundation of hypothesis testing. By doing so, you set the parameters for your statistical inference.

1. Statistical assurance of determining differences between population parameters

Just looking at the mathematical difference between the means of two samples and making a decision is woefully inadequate. By statistically testing the null hypothesis, you will have more confidence in any inferences you want to make about populations based on your samples.

2. Statistically based estimation of the probability of a population distribution

Many statistical tests require assumptions of specific distributions. Many of these tests assume that the population follows the normal distribution . If it doesn’t, the test may be invalid.  

3. Assess the strength of your conclusions as to what to do with the null hypothesis

Hypothesis testing calculations will provide some relative strength to your decisions as to whether you reject or fail to reject the null hypothesis.

Why is the Null Hypothesis important to understand?

The interpretation of the statistics relative to the null hypothesis is what’s important.

1. Properly write the null hypothesis to properly capture what you are seeking to prove

The null is always written in the same format. That is, the lack of difference or some other condition. The alternative hypothesis can be written in three formats depending on what you want to prove. 

2. Frame your statement and select an appropriate alpha risk

You don’t want to place too big of a hurdle or burden on your decision-making relative to action on the null hypothesis by selecting an alpha value that is too high or too low.

3. There are decision errors when deciding on how to respond to the Null Hypothesis

Since your decision relative to rejecting or not rejecting the null is based on statistical calculations, it is important to understand how that decision works. 

An industry example of using the Null Hypothesis 

The new director of marketing just completed the rollout of a new marketing campaign targeting the Hispanic market. Early indications showed that the campaign was successful in increasing sales in the Hispanic market. 

He came to that conclusion by comparing a sample of sales prior to the campaign and current sales after implementation of the campaign. He was anxious to proudly tell his boss how successful the campaign was. But, he decided to first check with his Lean Six Sigma Black Belt to see whether she agreed with his conclusion.

The Black Belt first asked the director his tolerance for risk of being wrong by telling the boss the campaign was successful when in fact, it wasn’t. That was the alpha value. The Director picked 5% since he was new and didn’t want to make a false claim so early in his career. He also picked 20% as his beta value.  

When the Black Belt was done analyzing the data, she found out that the p-value was 15%.  That meant if the director told the VP the campaign worked, there was a 15% chance he would be wrong and that the campaign probably needed some revising. Since he was only willing to be wrong 5% of the time, the decision was to not reject the null since his 5% assumed risk was less than the 15% actual risk.

3 best practices when thinking about the Null Hypothesis 

Using hypothesis testing to help make better data-driven decisions requires that you properly address the Null Hypothesis. 

1. Always use the proper nomenclature when stating the Null Hypothesis 

The null will always be in the form of decisions regarding the population, not the sample. 

2. The Null Hypothesis will always be written as the absence of some parameter or process characteristic

The writing of the Alternate Hypothesis can vary, so be sure you understand exactly what condition you are testing against. 

3. Pick a reasonable alpha risk so you’re not always failing to reject the Null Hypothesis

Being too cautious will lead you to make beta errors, and you’ll never learn anything about your population data. 

Frequently Asked Questions (FAQ) about the Null Hypothesis

What form should the null hypothesis be written in.

The Null Hypothesis should always be in the form of no difference or zero and always refer to the state of the population, not the sample. 

What is an alpha error? 

An alpha error, or Type 1 error, is rejecting the Null Hypothesis and claiming a significant event has occurred when, in fact, that is not true and the Null should not have been rejected.

How do I use the alpha error and p-value to decide on what decision I should make about the Null Hypothesis? 

The most common way of answering this is, “If the p-value is low (less than the alpha), the Null should be rejected. If the p-value is high (greater than the alpha) then the Null should not be rejected.”

Becoming familiar with the Null Hypothesis (Ho)

The proper writing of the Null Hypothesis is the basis for applying hypothesis testing to help you make better data-driven decisions. The format of the Null will always be in the form of zero, or the non-existence of some condition. It will always refer to a population parameter and not the sample you use to do your hypothesis testing calculations.

Be aware of the two types of errors you can make when deciding on what to do with the Null. Select reasonable risks values for your alpha and beta risks. By comparing your alpha risk with the calculated risk computed from the data, you will have sufficient information to make a wise decision as to whether you should reject the Null Hypothesis or not.

About the Author

' src=

Ken Feldman

  • Null hypothesis

by Marco Taboga , PhD

In a test of hypothesis , a sample of data is used to decide whether to reject or not to reject a hypothesis about the probability distribution from which the sample was extracted.

The hypothesis is called the null hypothesis, or simply "the null".

Things a data scientist should know: 1) the criminal trial analogy; 2) the role of the test statistic; 3) failure to reject may be due to lack of power; 4) Rejection may be due to misspecification.

Table of contents

The null is like the defendant in a criminal trial

How is the null hypothesis tested, example 1 - proportion of defective items, measurement, test statistic, critical region, interpretation, example 2 - reliability of a production plant, rejection and failure to reject, not rejecting and accepting are not the same thing, failure to reject can be due to lack of power, rejections are easier to interpret, but be careful, takeaways - how to (and not to) formulate a null hypothesis, more examples, more details, best practices in science, keep reading the glossary.

Formulating null hypotheses and subjecting them to statistical testing is one of the workhorses of the scientific method.

Scientists in all fields make conjectures about the phenomena they study, translate them into null hypotheses and gather data to test them.

This process resembles a trial:

the defendant (the null hypothesis) is accused of being guilty (wrong);

evidence (data) is gathered in order to prove the defendant guilty (reject the null);

if there is evidence beyond any reasonable doubt, the defendant is found guilty (the null is rejected);

otherwise, the defendant is found not guilty (the null is not rejected).

Keep this analogy in mind because it helps to better understand statistical tests, their limitations, use and misuse, and frequent misinterpretation.

The null hypothesis is like the defendant in a criminal trial.

Before collecting the data:

we decide how to summarize the relevant characteristics of the sample data in a single number, the so-called test statistic ;

we derive the probability distribution of the test statistic under the hypothesis that the null is true (the data is regarded as random; therefore, the test statistic is a random variable);

we decide what probability of incorrectly rejecting the null we are willing to tolerate (the level of significance , or size of the test ); the level of significance is typically a small number, such as 5% or 1%.

we choose one or more intervals of values (collectively called rejection region) such that the probability that the test statistic falls within these intervals is equal to the desired level of significance; the rejection region is often a tail of the distribution of the test statistic (one-tailed test) or the union of the left and right tails (two-tailed test).

The rejection region is a set of values that the test statistic is unlikely to take if the null hypothesis is true.

Then, the data is collected and used to compute the value of the test statistic.

A decision is taken as follows:

if the test statistic falls within the rejection region, then the null hypothesis is rejected;

otherwise, it is not rejected.

The probability distribution of the test statistic and the rejection region depend on the null hypothesis.

We now make two examples of practical problems that lead to formulate and test a null hypothesis.

A new method is proposed to produce light bulbs.

The proponents claim that it produces less defective bulbs than the method currently in use.

To check the claim, we can set up a statistical test as follows.

We keep the light bulbs on for 10 consecutive days, and then we record whether they are still working at the end of the test period.

The probability that a light bulb produced with the new method is still working at the end of the test period is the same as that of a light bulb produced with the old method.

100 light bulbs are tested:

50 of them are produced with the new method (group A)

the remaining 50 are produced with the old method (group B).

The final data comprises 100 observations of:

an indicator variable which is equal to 1 if the light bulb is still working at the end of the test period and 0 otherwise;

a categorical variable that records the group (A or B) to which each light bulb belongs.

We use the data to compute the proportions of working light bulbs in groups A and B.

The proportions are estimates of the probabilities of not being defective, which are equal for the two groups under the null hypothesis.

We then compute a z-statistic (see here for details) by:

taking the difference between the proportion in group A and the proportion in group B;

standardizing the difference:

we subtract the expected value (which is zero under the null hypothesis);

we divide by the standard deviation (it can be derived analytically).

The distribution of the z-statistic can be approximated by a standard normal distribution .

The z-statistic has a normal distribution with zero mean and variance equal to one.

We decide that the level of confidence must be 5%. In other words, we are going to tolerate a 5% probability of incorrectly rejecting the null hypothesis.

The critical region is the right 5%-tail of the normal distribution, that is, the set of all values greater than 1.645 (see the glossary entry on critical values if you are wondering how this value was obtained).

If the test statistic is greater than 1.645, then the null hypothesis is rejected; otherwise, it is not rejected.

A rejection is interpreted as significant evidence that the new production method produces less defective items; failure to reject is interpreted as insufficient evidence that the new method is better.

The null hypothesis is rejected when the test statistic falls in the tails of the distribution.

A production plant incurs high costs when production needs to be halted because some machinery fails.

The plant manager has decided that he is not willing to tolerate more than one halt per year on average.

If the expected number of halts per year is greater than 1, he will make new investments in order to improve the reliability of the plant.

A statistical test is set up as follows.

The reliability of the plant is measured by the number of halts.

The number of halts in a year is assumed to have a Poisson distribution with expected value equal to 1 (using the Poisson distribution is common in reliability testing).

The manager cannot wait more than one year before taking a decision.

There will be a single datum at his disposal: the number of halts observed during one year.

The number of halts is used as a test statistic. By assumption, it has a Poisson distribution under the null hypothesis.

The manager decides that the probability of incorrectly rejecting the null can be at most 10%.

A Poisson random variable with expected value equal to 1 takes values:

larger than 1 with probability 26.42%;

larger than 2 with probability 8.03%.

Therefore, it is decided that the critical region will be the set of all values greater than or equal to 3.

If the test statistic is strictly greater than or equal to 3, then the null is rejected; otherwise, it is not rejected.

A rejection is interpreted as significant evidence that the production plant is not reliable enough (the average number of halts per year is significantly larger than tolerated).

Failure to reject is interpreted as insufficient evidence that the plant is unreliable.

Failure to reject the null hypothesis is interpreted as insufficient evidence.

This section discusses the main problems that arise in the interpretation of the outcome of a statistical test (reject / not reject).

When the test statistic does not fall within the critical region, then we do not reject the null hypothesis.

Does this mean that we accept the null? Not really.

In general, failure to reject does not constitute, per se, strong evidence that the null hypothesis is true .

Remember the analogy between hypothesis testing and a criminal trial. In a trial, when the defendant is declared not guilty, this does not mean that the defendant is innocent. It only means that there was not enough evidence (not beyond any reasonable doubt) against the defendant.

In turn, lack of evidence can be due:

either to the fact that the defendant is innocent ;

or to the fact that the prosecution has not been able to provide enough evidence against the defendant, even if the latter is guilty .

This is the very reason why courts do not declare defendants innocent, but they use the locution "not guilty".

In a similar fashion, statisticians do not say that the null hypothesis has been accepted, but they say that it has not been rejected.

Failure to reject does not imply acceptance.

To better understand why failure to reject does not in general constitute strong evidence that the null hypothesis is true, we need to use the concept of statistical power .

The power of a test is the probability (calculated ex-ante, i.e., before observing the data) that the null will be rejected when another hypothesis (called the alternative hypothesis ) is true.

Let's consider the first of the two examples above (the production of light bulbs).

In that example, the null hypothesis is: the probability that a light bulb is defective does not decrease after introducing a new production method.

Let's make the alternative hypothesis that the probability of being defective is 1% smaller after changing the production process (assume that a 1% decrease is considered a meaningful improvement by engineers).

How much is the ex-ante probability of rejecting the null if the alternative hypothesis is true?

If this probability (the power of the test) is small, then it is very likely that we will not reject the null even if it is wrong.

If we use the analogy with criminal trials, low power means that most likely the prosecution will not be able to provide sufficient evidence, even if the defendant is guilty.

Thus, in the case of lack of power, failure to reject is almost meaningless (it was anyway highly likely).

This is why, before performing a test, it is good statistical practice to compute its power against a relevant alternative .

If the power is found to be too small, there are usually remedies. In particular, statistical power can usually be increased by increasing the sample size (see, e.g., the lecture on hypothesis tests about the mean ).

The best practice is to compute the power of the test, that is, the probability of rejecting the null hypothesis when the alternative is true.

As we have explained above, interpreting a failure to reject the null hypothesis is not always straightforward. Instead, interpreting a rejection is somewhat easier.

When we reject the null, we know that the data has provided a lot of evidence against the null. In other words, it is unlikely (how unlikely depends on the size of the test) that the null is true given the data we have observed.

There is an important caveat though. The null hypothesis is often made up of several assumptions, including:

the main assumption (the one we are testing);

other assumptions (e.g., technical assumptions) that we need to make in order to set up the hypothesis test.

For instance, in Example 2 above (reliability of a production plant), the main assumption is that the expected number of production halts per year is equal to 1. But there is also a technical assumption: the number of production halts has a Poisson distribution.

It must be kept in mind that a rejection is always a joint rejection of the main assumption and all the other assumptions .

Therefore, we should always ask ourselves whether the null has been rejected because the main assumption is wrong or because the other assumptions are violated.

In the case of Example 2 above, is a rejection of the null due to the fact that the expected number of halts is greater than 1 or is it due to the fact that the distribution of the number of halts is very different from a Poisson distribution?

When we suspect that a rejection is due to the inappropriateness of some technical assumption (e.g., assuming a Poisson distribution in the example), we say that the rejection could be due to misspecification of the model .

The right thing to do when these kind of suspicions arise is to conduct so-called robustness checks , that is, to change the technical assumptions and carry out the test again.

In our example, we could re-run the test by assuming a different probability distribution for the number of halts (e.g., a negative binomial or a compound Poisson - do not worry if you have never heard about these distributions).

If we keep obtaining a rejection of the null even after changing the technical assumptions several times, the we say that our rejection is robust to several different specifications of the model .

Even if the null hypothesis is true, a wrong technical assumption can lead to reject the null too often.

What are the main practical implications of everything we have said thus far? How does the theory above help us to set up and test a null hypothesis?

What we said can be summarized in the following guiding principles:

A test of hypothesis is like a criminal trial and you are the prosecutor . You want to find evidence that the defendant (the null hypothesis) is guilty. Your job is not to prove that the defendant is innocent. If you find yourself hoping that the defendant is found not guilty (i.e., the null is not rejected) then something is wrong with the way you set up the test. Remember: you are the prosecutor.

Compute the power of your test against one or more relevant alternative hypotheses. Do not run a test if you know ex-ante that it is unlikely to reject the null when the alternative hypothesis is true.

Beware of technical assumptions that you add to the main assumption you want to test. Make robustness checks in order to verify that the outcome of the test is not biased by model misspecification.

$H_{0}$

More examples of null hypotheses and how to test them can be found in the following lectures.

The lecture on Hypothesis testing provides a more detailed mathematical treatment of null hypotheses and how they are tested.

This lecture on the null hypothesis was featured in Stanford University's Best practices in science .

Stanford University Best Practices in Science.

Previous entry: Normal equations

Next entry: Parameter

How to cite

Please cite as:

Taboga, Marco (2021). "Null hypothesis", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/glossary/null-hypothesis.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • Permutations
  • Characteristic function
  • Almost sure convergence
  • Likelihood ratio test
  • Uniform distribution
  • Bernoulli distribution
  • Multivariate normal distribution
  • Chi-square distribution
  • Maximum likelihood
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Precision matrix
  • Distribution function
  • Mean squared error
  • IID sequence
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.1 Null and Alternative Hypothesis

  • Last updated
  • Save as PDF
  • Page ID 36507

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

Section 9.1 Null and Alternative Hypothesis

Learning Objective:

In this section, you will:

• Understand the general concept and use the terminology of hypothesis testing

I claim that my coin is a fair coin. This means that the probability of heads and the probability of tails are both 50% or 0.50.

  • Out of 200 flips of the coin, tails is tossed 102 times. What can we conclude about my claim?
  • Out of 200 flips of the coin, tails is tossed 21 times. What can we conclude about my claim?

Hypothesis is a claim about the value of a population parameter.

Hypothesis Testing is a procedure for determining whether the hypothesis stated is a reasonable statement and should not be rejected, or is unreasonable and should be rejected.

Hypothesis testing begins by considering two hypotheses. They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

  • The null hypothesis , typically denoted with H 0 . The null is not rejected unless the hypothesis test shows otherwise. The null statement must always contain some form of equality (=, ≤ or ≥)
  • The alternative hypothesis , typically denoted with H a or H 1 , using less than, greater than, or not equals symbols, (≠, >, or <).
  • If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative hypothesis.
  • Never state that a claim is proven true or false. Keep in mind the underlying fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-absolute certainties.

Example 1: We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

Example 2: We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

Example 3: In an issue of U.S. News and World Report, an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

For more information and examples see online textbook OpenStax Introductory Statistics pages 505-508.

“ Introduction to Statistics ” by OpenStax , used is licensed under a Creative Commons Attribution License 4.0 license

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base
  • Null and Alternative Hypotheses | Definitions & Examples

Null and Alternative Hypotheses | Definitions & Examples

Published on 5 October 2022 by Shaun Turney . Revised on 6 December 2022.

The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test :

  • Null hypothesis (H 0 ): There’s no effect in the population .
  • Alternative hypothesis (H A ): There’s an effect in the population.

The effect is usually the effect of the independent variable on the dependent variable .

Table of contents

Answering your research question with hypotheses, what is a null hypothesis, what is an alternative hypothesis, differences between null and alternative hypotheses, how to write null and alternative hypotheses, frequently asked questions about null and alternative hypotheses.

The null and alternative hypotheses offer competing answers to your research question . When the research question asks “Does the independent variable affect the dependent variable?”, the null hypothesis (H 0 ) answers “No, there’s no effect in the population.” On the other hand, the alternative hypothesis (H A ) answers “Yes, there is an effect in the population.”

The null and alternative are always claims about the population. That’s because the goal of hypothesis testing is to make inferences about a population based on a sample . Often, we infer whether there’s an effect in the population by looking at differences between groups or relationships between variables in the sample.

You can use a statistical test to decide whether the evidence favors the null or alternative hypothesis. Each type of statistical test comes with a specific way of phrasing the null and alternative hypothesis. However, the hypotheses can also be phrased in a general way that applies to any test.

The null hypothesis is the claim that there’s no effect in the population.

If the sample provides enough evidence against the claim that there’s no effect in the population ( p ≤ α), then we can reject the null hypothesis . Otherwise, we fail to reject the null hypothesis.

Although “fail to reject” may sound awkward, it’s the only wording that statisticians accept. Be careful not to say you “prove” or “accept” the null hypothesis.

Null hypotheses often include phrases such as “no effect”, “no difference”, or “no relationship”. When written in mathematical terms, they always include an equality (usually =, but sometimes ≥ or ≤).

Examples of null hypotheses

The table below gives examples of research questions and null hypotheses. There’s always more than one way to answer a research question, but these null hypotheses can help you get started.

*Note that some researchers prefer to always write the null hypothesis in terms of “no effect” and “=”. It would be fine to say that daily meditation has no effect on the incidence of depression and p 1 = p 2 .

The alternative hypothesis (H A ) is the other answer to your research question . It claims that there’s an effect in the population.

Often, your alternative hypothesis is the same as your research hypothesis. In other words, it’s the claim that you expect or hope will be true.

The alternative hypothesis is the complement to the null hypothesis. Null and alternative hypotheses are exhaustive, meaning that together they cover every possible outcome. They are also mutually exclusive, meaning that only one can be true at a time.

Alternative hypotheses often include phrases such as “an effect”, “a difference”, or “a relationship”. When alternative hypotheses are written in mathematical terms, they always include an inequality (usually ≠, but sometimes > or <). As with null hypotheses, there are many acceptable ways to phrase an alternative hypothesis.

Examples of alternative hypotheses

The table below gives examples of research questions and alternative hypotheses to help you get started with formulating your own.

Null and alternative hypotheses are similar in some ways:

  • They’re both answers to the research question
  • They both make claims about the population
  • They’re both evaluated by statistical tests.

However, there are important differences between the two types of hypotheses, summarized in the following table.

To help you write your hypotheses, you can use the template sentences below. If you know which statistical test you’re going to use, you can use the test-specific template sentences. Otherwise, you can use the general template sentences.

The only thing you need to know to use these general template sentences are your dependent and independent variables. To write your research question, null hypothesis, and alternative hypothesis, fill in the following sentences with your variables:

Does independent variable affect dependent variable ?

  • Null hypothesis (H 0 ): Independent variable does not affect dependent variable .
  • Alternative hypothesis (H A ): Independent variable affects dependent variable .

Test-specific

Once you know the statistical test you’ll be using, you can write your hypotheses in a more precise and mathematical way specific to the test you chose. The table below provides template sentences for common statistical tests.

Note: The template sentences above assume that you’re performing one-tailed tests . One-tailed tests are appropriate for most studies.

The null hypothesis is often abbreviated as H 0 . When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).

The alternative hypothesis is often abbreviated as H a or H 1 . When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (‘ x affects y because …’).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses. In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Turney, S. (2022, December 06). Null and Alternative Hypotheses | Definitions & Examples. Scribbr. Retrieved 14 May 2024, from https://www.scribbr.co.uk/stats/null-and-alternative-hypothesis/

Is this article helpful?

Shaun Turney

Shaun Turney

Other students also liked, levels of measurement: nominal, ordinal, interval, ratio, the standard normal distribution | calculator, examples & uses, types of variables in research | definitions & examples.

Statology

Statistics Made Easy

How to Write a Null Hypothesis (5 Examples)

A hypothesis test uses sample data to determine whether or not some claim about a population parameter is true.

Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms:

H 0 (Null Hypothesis): Population parameter =,  ≤, ≥ some value

H A  (Alternative Hypothesis): Population parameter <, >, ≠ some value

Note that the null hypothesis always contains the equal sign .

We interpret the hypotheses as follows:

Null hypothesis: The sample data provides no evidence to support some claim being made by an individual.

Alternative hypothesis: The sample data  does provide sufficient evidence to support the claim being made by an individual.

For example, suppose it’s assumed that the average height of a certain species of plant is 20 inches tall. However, one botanist claims the true average height is greater than 20 inches.

To test this claim, she may go out and collect a random sample of plants. She can then use this sample data to perform a hypothesis test using the following two hypotheses:

H 0 : μ ≤ 20 (the true mean height of plants is equal to or even less than 20 inches)

H A : μ > 20 (the true mean height of plants is greater than 20 inches)

If the sample data gathered by the botanist shows that the mean height of this species of plants is significantly greater than 20 inches, she can reject the null hypothesis and conclude that the mean height is greater than 20 inches.

Read through the following examples to gain a better understanding of how to write a null hypothesis in different situations.

Example 1: Weight of Turtles

A biologist wants to test whether or not the true mean weight of a certain species of turtles is 300 pounds. To test this, he goes out and measures the weight of a random sample of 40 turtles.

Here is how to write the null and alternative hypotheses for this scenario:

H 0 : μ = 300 (the true mean weight is equal to 300 pounds)

H A : μ ≠ 300 (the true mean weight is not equal to 300 pounds)

Example 2: Height of Males

It’s assumed that the mean height of males in a certain city is 68 inches. However, an independent researcher believes the true mean height is greater than 68 inches. To test this, he goes out and collects the height of 50 males in the city.

H 0 : μ ≤ 68 (the true mean height is equal to or even less than 68 inches)

H A : μ > 68 (the true mean height is greater than 68 inches)

Example 3: Graduation Rates

A university states that 80% of all students graduate on time. However, an independent researcher believes that less than 80% of all students graduate on time. To test this, she collects data on the proportion of students who graduated on time last year at the university.

H 0 : p ≥ 0.80 (the true proportion of students who graduate on time is 80% or higher)

H A : μ < 0.80 (the true proportion of students who graduate on time is less than 80%)

Example 4: Burger Weights

A food researcher wants to test whether or not the true mean weight of a burger at a certain restaurant is 7 ounces. To test this, he goes out and measures the weight of a random sample of 20 burgers from this restaurant.

H 0 : μ = 7 (the true mean weight is equal to 7 ounces)

H A : μ ≠ 7 (the true mean weight is not equal to 7 ounces)

Example 5: Citizen Support

A politician claims that less than 30% of citizens in a certain town support a certain law. To test this, he goes out and surveys 200 citizens on whether or not they support the law.

H 0 : p ≥ .30 (the true proportion of citizens who support the law is greater than or equal to 30%)

H A : μ < 0.30 (the true proportion of citizens who support the law is less than 30%)

Additional Resources

Introduction to Hypothesis Testing Introduction to Confidence Intervals An Explanation of P-Values and Statistical Significance

Featured Posts

5 Regularization Techniques You Should Know

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike.  My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

2 Replies to “How to Write a Null Hypothesis (5 Examples)”

you are amazing, thank you so much

Say I am a botanist hypothesizing the average height of daisies is 20 inches, or not? Does T = (ave – 20 inches) / √ variance / (80 / 4)? … This assumes 40 real measures + 40 fake = 80 n, but that seems questionable. Please advise.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Join the Statology Community

Sign up to receive Statology's exclusive study resource: 100 practice problems with step-by-step solutions. Plus, get our latest insights, tutorials, and data analysis tips straight to your inbox!

By subscribing you accept Statology's Privacy Policy.

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

AP®︎/College Statistics

Course: ap®︎/college statistics   >   unit 10.

  • Idea behind hypothesis testing

Examples of null and alternative hypotheses

  • Writing null and alternative hypotheses
  • P-values and significance tests
  • Comparing P-values to different significance levels
  • Estimating a P-value from a simulation
  • Estimating P-values from simulations
  • Using P-values to make conclusions

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page

Good Answer

Video transcript

Logo for UH Pressbooks

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Hypothesis Testing with One Sample

Null and Alternative Hypotheses

OpenStaxCollege

[latexpage]

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 : The null hypothesis: It is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.

H a : The alternative hypothesis: It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 .

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are “reject H 0 ” if the sample information favors the alternative hypothesis or “do not reject H 0 ” or “decline to reject H 0 ” if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

H 0 : No more than 30% of the registered voters in Santa Clara County voted in the primary election. p ≤ 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

H 0 : The drug reduces cholesterol by 25%. p = 0.25

H a : The drug does not reduce cholesterol by 25%. p ≠ 0.25

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

H 0 : μ = 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ = 66
  • H a : μ ≠ 66

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

H 0 : μ ≥ 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ ≥ 45
  • H a : μ < 45

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

H 0 : p ≤ 0.066

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p = 0.40
  • H a : p > 0.40

<!– ??? –>

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

Chapter Review

In a hypothesis test , sample data is evaluated in order to arrive at a decision about some type of claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we:

Formula Review

H 0 and H a are contradictory.

If α ≤ p -value, then do not reject H 0 .

If α > p -value, then reject H 0 .

α is preconceived. Its value is set before the hypothesis test starts. The p -value is calculated from the data.

You are testing that the mean speed of your cable Internet connection is more than three Megabits per second. What is the random variable? Describe in words.

The random variable is the mean Internet speed in Megabits per second.

You are testing that the mean speed of your cable Internet connection is more than three Megabits per second. State the null and alternative hypotheses.

The American family has an average of two children. What is the random variable? Describe in words.

The random variable is the mean number of children an American family has.

The mean entry level salary of an employee at a company is 💲58,000. You believe it is higher for IT professionals in the company. State the null and alternative hypotheses.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the proportion is actually less. What is the random variable? Describe in words.

The random variable is the proportion of people picked at random in Times Square visiting the city.

A sociologist claims the probability that a person picked at random in Times Square in New York City is visiting the area is 0.83. You want to test to see if the claim is correct. State the null and alternative hypotheses.

In a population of fish, approximately 42% are female. A test is conducted to see if, in fact, the proportion is less. State the null and alternative hypotheses.

Suppose that a recent article stated that the mean time spent in jail by a first–time convicted burglar is 2.5 years. A study was then done to see if the mean time has increased in the new century. A random sample of 26 first-time convicted burglars in a recent year was picked. The mean length of time in jail from the survey was 3 years with a standard deviation of 1.8 years. Suppose that it is somehow known that the population standard deviation is 1.5. If you were conducting a hypothesis test to determine if the mean length of jail time has increased, what would the null and alternative hypotheses be? The distribution of the population is normal.

A random survey of 75 death row inmates revealed that the mean length of time on death row is 17.4 years with a standard deviation of 6.3 years. If you were conducting a hypothesis test to determine if the population mean time on death row could likely be 15 years, what would the null and alternative hypotheses be?

  • H 0 : __________
  • H a : __________
  • H 0 : μ = 15
  • H a : μ ≠ 15

The National Institute of Mental Health published an article stating that in any one-year period, approximately 9.5 percent of American adults suffer from depression or a depressive illness. Suppose that in a survey of 100 people in a certain town, seven of them suffered from depression or a depressive illness. If you were conducting a hypothesis test to determine if the true proportion of people in that town suffering from depression or a depressive illness is lower than the percent in the general adult American population, what would the null and alternative hypotheses be?

Some of the following statements refer to the null hypothesis, some to the alternate hypothesis.

State the null hypothesis, H 0 , and the alternative hypothesis. H a , in terms of the appropriate parameter ( μ or p ).

  • The mean number of years Americans work before retiring is 34.
  • At most 60% of Americans vote in presidential elections.
  • The mean starting salary for San Jose State University graduates is at least 💲100,000 per year.
  • Twenty-nine percent of high school seniors get drunk each month.
  • Fewer than 5% of adults ride the bus to work in Los Angeles.
  • The mean number of cars a person owns in her lifetime is not more than ten.
  • About half of Americans prefer to live away from cities, given the choice.
  • Europeans have a mean paid vacation each year of six weeks.
  • The chance of developing breast cancer is under 11% for women.
  • Private universities’ mean tuition cost is more than 💲20,000 per year.
  • H 0 : μ = 34; H a : μ ≠ 34
  • H 0 : p ≤ 0.60; H a : p > 0.60
  • H 0 : μ ≥ 100,000; H a : μ < 100,000
  • H 0 : p = 0.29; H a : p ≠ 0.29
  • H 0 : p = 0.05; H a : p < 0.05
  • H 0 : μ ≤ 10; H a : μ > 10
  • H 0 : p = 0.50; H a : p ≠ 0.50
  • H 0 : μ = 6; H a : μ ≠ 6
  • H 0 : p ≥ 0.11; H a : p < 0.11
  • H 0 : μ ≤ 20,000; H a : μ > 20,000

Over the past few decades, public health officials have examined the link between weight concerns and teen girls’ smoking. Researchers surveyed a group of 273 randomly selected teen girls living in Massachusetts (between 12 and 15 years old). After four years the girls were surveyed again. Sixty-three said they smoked to stay thin. Is there good evidence that more than thirty percent of the teen girls smoke to stay thin? The alternative hypothesis is:

  • p < 0.30
  • p > 0.30

A statistics instructor believes that fewer than 20% of Evergreen Valley College (EVC) students attended the opening night midnight showing of the latest Harry Potter movie. She surveys 84 of her students and finds that 11 attended the midnight showing. An appropriate alternative hypothesis is:

  • p > 0.20
  • p < 0.20

Previously, an organization reported that teenagers spent 4.5 hours per week, on average, on the phone. The organization thinks that, currently, the mean is higher. Fifteen randomly chosen teenagers were asked how many hours per week they spend on the phone. The sample mean was 4.75 hours with a sample standard deviation of 2.0. Conduct a hypothesis test. The null and alternative hypotheses are:

  • H o : \(\overline{x}\) = 4.5, H a : \(\overline{x}\) > 4.5
  • H o : μ ≥ 4.5, H a : μ < 4.5
  • H o : μ = 4.75, H a : μ > 4.75
  • H o : μ = 4.5, H a : μ > 4.5

Data from the National Institute of Mental Health. Available online at http://www.nimh.nih.gov/publicat/depression.cfm.

Null and Alternative Hypotheses Copyright © 2013 by OpenStaxCollege is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Null Hypothesis Examples

ThoughtCo / Hilary Allison

  • Scientific Method
  • Chemical Laws
  • Periodic Table
  • Projects & Experiments
  • Biochemistry
  • Physical Chemistry
  • Medical Chemistry
  • Chemistry In Everyday Life
  • Famous Chemists
  • Activities for Kids
  • Abbreviations & Acronyms
  • Weather & Climate
  • Ph.D., Biomedical Sciences, University of Tennessee at Knoxville
  • B.A., Physics and Mathematics, Hastings College

In statistical analysis, the null hypothesis assumes there is no meaningful relationship between two variables. Testing the null hypothesis can tell you whether your results are due to the effect of manipulating ​a dependent variable or due to chance. It's often used in conjunction with an alternative hypothesis, which assumes there is, in fact, a relationship between two variables.

The null hypothesis is among the easiest hypothesis to test using statistical analysis, making it perhaps the most valuable hypothesis for the scientific method. By evaluating a null hypothesis in addition to another hypothesis, researchers can support their conclusions with a higher level of confidence. Below are examples of how you might formulate a null hypothesis to fit certain questions.

What Is the Null Hypothesis?

The null hypothesis states there is no relationship between the measured phenomenon (the dependent variable ) and the independent variable , which is the variable an experimenter typically controls or changes. You do not​ need to believe that the null hypothesis is true to test it. On the contrary, you will likely suspect there is a relationship between a set of variables. One way to prove that this is the case is to reject the null hypothesis. Rejecting a hypothesis does not mean an experiment was "bad" or that it didn't produce results. In fact, it is often one of the first steps toward further inquiry.

To distinguish it from other hypotheses , the null hypothesis is written as ​ H 0  (which is read as “H-nought,” "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the null hypothesis is not true, perhaps because the experimenter did not account for a critical factor or because of chance. This is one reason why it's important to repeat experiments.

Examples of the Null Hypothesis

To write a null hypothesis, first start by asking a question. Rephrase that question in a form that assumes no relationship between the variables. In other words, assume a treatment has no effect. Write your hypothesis in a way that reflects this.

Other Types of Hypotheses

In addition to the null hypothesis, the alternative hypothesis is also a staple in traditional significance tests . It's essentially the opposite of the null hypothesis because it assumes the claim in question is true. For the first item in the table above, for example, an alternative hypothesis might be "Age does have an effect on mathematical ability."

Key Takeaways

  • In hypothesis testing, the null hypothesis assumes no relationship between two variables, providing a baseline for statistical analysis.
  • Rejecting the null hypothesis suggests there is evidence of a relationship between variables.
  • By formulating a null hypothesis, researchers can systematically test assumptions and draw more reliable conclusions from their experiments.
  • Difference Between Independent and Dependent Variables
  • Examples of Independent and Dependent Variables
  • What Is a Hypothesis? (Science)
  • Definition of a Hypothesis
  • What 'Fail to Reject' Means in a Hypothesis Test
  • Null Hypothesis Definition and Examples
  • Scientific Method Vocabulary Terms
  • Null Hypothesis and Alternative Hypothesis
  • Hypothesis Test for the Difference of Two Population Proportions
  • How to Conduct a Hypothesis Test
  • What Is a P-Value?
  • What Are the Elements of a Good Hypothesis?
  • What Is the Difference Between Alpha and P-Values?
  • Hypothesis Test Example
  • Understanding Path Analysis
  • An Example of a Hypothesis Test

What Is a Null Hypothesis?

A null hypothesis may be a sort of hypothesis utilized in statistics that proposes that there’s no difference between certain characteristics of a population (or data-generating process). For example, a gambler could also be curious about whether a game of chance is fair. If it’s fair, then the expected earnings per play is 0 for […]

A null hypothesis may be a sort of hypothesis utilized in statistics that proposes that there’s no difference between certain characteristics of a population (or data-generating process).

For example, a gambler could also be curious about whether a game of chance is fair. If it’s fair, then the expected earnings per play is 0 for both players. If the sport isn’t fair, then the expected earnings are positive for one player and negative for the opposite. to check whether the sport is fair, the gambler collects earnings data from many repetitions of the sport , calculates the typical earnings from these data, then tests the null hypothesis that the expected earnings isn’t different from zero.

If the typical earnings from the sample data is sufficiently faraway from zero, then the gambler will reject the null hypothesis and conclude the choice hypothesis; namely, that the expected earnings per play is different from zero. If the typical earnings from the sample data are on the brink of zero, then the gambler won’t reject the null hypothesis, concluding instead that the difference between the typical from the info and 0 is explainable accidentally alone.

KEY TAKEAWAYS

A null hypothesis may be a sort of conjecture utilized in statistics that proposes that there’s no difference between certain characteristics of a population or data-generating process.

The alternative hypothesis proposes that there’s a difference.

Hypothesis testing provides a way to reject a null hypothesis within a particular confidence level. (Null hypotheses can’t be proven, though.)

How a Null Hypothesis Works

The null hypothesis, also referred to as the conjecture, assumes that any quite difference between the chosen characteristics that you simply see during a set of knowledge is thanks to chance. For instance, if the expected earnings for the game of chance are actually adequate to 0, then any difference between the typical earnings within the data and 0 is thanks to chance.

Statistical hypotheses are tested employing a four-step process. The primary step is for the analyst to state the 2 hypotheses in order that just one is often right. Subsequent step is to formulate an analysis plan, which outlines how the info is going to be evaluated. The third step is to hold out the plan and physically analyze the sample data. The fourth and final step is to research the results and either reject the null hypothesis, or claim that the observed differences are explainable accidentally alone.

Analysts look to reject the null hypothesis because it’s a robust conclusion. The choice conclusion, that the results are “explainable accidentally alone,” could also be a weak conclusion because it allows that factors aside from chance may be at work.

Analysts look to reject the null hypothesis to rule out some variable(s) as explaining the phenomena of interest.

Null Hypothesis Example

Here may be a simple example: a faculty principal reports that students in her school score a mean of seven out of 10 in exams. The null hypothesis is that the population mean is 7.0. To check this null hypothesis, we record marks of say 30 students (sample) from the whole student population of the varsity (say 300) and calculate the mean of that sample. We will then compare the (calculated) sample mean to the (claimed) population mean of seven .0 and plan to reject the null hypothesis. (The null hypothesis that the population mean is 7.0 can’t be proven using the sample data; it can only be rejected.)

Take another example: The annual return of a specific open-end fund is claimed to be 8%. Assume that open-end fund has been alive for 20 years. The null hypothesis is that the mean return is 8% for the open-end fund. We take a random sample of annual returns of the open-end fund for, say, five years (sample) and calculate the sample mean. We then compare the (calculated) sample mean to the (claimed) population mean (8%) to test the null hypothesis.

For the above examples, null hypotheses are:

Example A: Students within the school score a mean of seven out of 10 in exams.

Example B: Mean annual return of the open-end fund is 8% once a year.

For the needs of determining whether to reject the null hypothesis, the null hypothesis (abbreviated H0) is assumed, for the sake of argument, to be true. Then the likely range of possible values of the calculated statistic (e.g., average score on 30 students’ tests) is decided under this presumption (e.g., the range of plausible averages may range from 6.2 to 7.8 if the population mean is 7.0). Then, if the sample average is outside of this range, the null hypothesis is rejected. Otherwise, the difference is claimed to be “explainable accidentally alone,” being within the range that’s determined accidentally alone.

An important point to notice is that we are testing the null hypothesis because there’s a component of doubt about its validity. Whatever information that’s against the stated null hypothesis is captured within the Alternative Hypothesis (H1). For the above examples, the choice hypothesis would be:

Students score a mean that’s not adequate to 7.

The mean annual return of the open-end fund isn’t adequate to 8% once a year.

In other words, the choice hypothesis may be a direct contradiction of the null hypothesis.

Hypothesis Testing for Investments

As an example associated with financial markets, assume Alice sees that her investment strategy produces higher average returns than simply buying and holding a stock. The null hypothesis states that there’s no difference between the 2 average returns, and Alice is inclined to believe this until she proves otherwise. Refuting the null hypothesis would require showing statistical significance, which may be found employing a sort of tests. The choice hypothesis would state that the investment strategy features a higher average return than a standard buy-and-hold strategy.

The p-value is employed to work out the statistical significance of the results. A p-value that’s but or adequate to 0.05 is usually wont to indicate whether there’s evidence against the null hypothesis. If Alice conducts one among these tests, like a test using the traditional model, and proves that the difference between her returns and therefore the buy-and-hold returns is critical (p-value is a smaller amount than or adequate to 0.05), she will then refute the null hypothesis and conclude the choice hypothesis.

Weekly newsletter

No spam. Just the latest releases and tips, interesting articles, and exclusive interviews in your inbox every week.

You may also like

False negative.

While understanding the hypothesis, two errors can be quite confusing. These two errors are false negative and false positive. You can also refer to the false-negative error as type II error and false-positive as type I error. While you are learning, you might think these errors have no use and will only waste your time […]

Box Plot Review

A box plot or box and whisker plot help you display the database distribution on a five-number summary. The first quartile Q1 will be the minimum, the third quartile Q3 will be the median, and the fifth quartile Q5 will be the maximum. You can find the outliers and their values by using a box […]

the null hypothesis stands

Bayesian Networks

Creating a probabilistic model can be challenging but proves helpful in machine learning. To create such a graphical model, you need to find the probabilistic relationships between variables. Suppose you are creating a graphical representation of the variables. You need to represent the variables as nodes and conditional independence as the absence of edges. Graphical […]

the null hypothesis stands

Privacy Overview

  • Math Article

Null Hypothesis

In mathematics, Statistics deals with the study of research and surveys on the numerical data. For taking surveys, we have to define the hypothesis. Generally, there are two types of hypothesis. One is a null hypothesis, and another is an alternative hypothesis .

In probability and statistics, the null hypothesis is a comprehensive statement or default status that there is zero happening or nothing happening. For example, there is no connection among groups or no association between two measured events. It is generally assumed here that the hypothesis is true until any other proof has been brought into the light to deny the hypothesis. Let us learn more here with definition, symbol, principle, types and example, in this article.

Table of contents:

  • Comparison with Alternative Hypothesis

Null Hypothesis Definition

The null hypothesis is a kind of hypothesis which explains the population parameter whose purpose is to test the validity of the given experimental data. This hypothesis is either rejected or not rejected based on the viability of the given population or sample . In other words, the null hypothesis is a hypothesis in which the sample observations results from the chance. It is said to be a statement in which the surveyors wants to examine the data. It is denoted by H 0 .

Null Hypothesis Symbol

In statistics, the null hypothesis is usually denoted by letter H with subscript ‘0’ (zero), such that H 0 . It is pronounced as H-null or H-zero or H-nought. At the same time, the alternative hypothesis expresses the observations determined by the non-random cause. It is represented by H 1 or H a .

Null Hypothesis Principle

The principle followed for null hypothesis testing is, collecting the data and determining the chances of a given set of data during the study on some random sample, assuming that the null hypothesis is true. In case if the given data does not face the expected null hypothesis, then the outcome will be quite weaker, and they conclude by saying that the given set of data does not provide strong evidence against the null hypothesis because of insufficient evidence. Finally, the researchers tend to reject that.

Null Hypothesis Formula

Here, the hypothesis test formulas are given below for reference.

The formula for the null hypothesis is:

H 0 :  p = p 0

The formula for the alternative hypothesis is:

H a = p >p 0 , < p 0 ≠ p 0

The formula for the test static is:

Remember that,  p 0  is the null hypothesis and p – hat is the sample proportion.

Also, read:

Types of Null Hypothesis

There are different types of hypothesis. They are:

Simple Hypothesis

It completely specifies the population distribution. In this method, the sampling distribution is the function of the sample size.

Composite Hypothesis

The composite hypothesis is one that does not completely specify the population distribution.

Exact Hypothesis

Exact hypothesis defines the exact value of the parameter. For example μ= 50

Inexact Hypothesis

This type of hypothesis does not define the exact value of the parameter. But it denotes a specific range or interval. For example 45< μ <60

Null Hypothesis Rejection

Sometimes the null hypothesis is rejected too. If this hypothesis is rejected means, that research could be invalid. Many researchers will neglect this hypothesis as it is merely opposite to the alternate hypothesis. It is a better practice to create a hypothesis and test it. The goal of researchers is not to reject the hypothesis. But it is evident that a perfect statistical model is always associated with the failure to reject the null hypothesis.

How do you Find the Null Hypothesis?

The null hypothesis says there is no correlation between the measured event (the dependent variable) and the independent variable. We don’t have to believe that the null hypothesis is true to test it. On the contrast, you will possibly assume that there is a connection between a set of variables ( dependent and independent).

When is Null Hypothesis Rejected?

The null hypothesis is rejected using the P-value approach. If the P-value is less than or equal to the α, there should be a rejection of the null hypothesis in favour of the alternate hypothesis. In case, if P-value is greater than α, the null hypothesis is not rejected.

Null Hypothesis and Alternative Hypothesis

Now, let us discuss the difference between the null hypothesis and the alternative hypothesis.

Null Hypothesis Examples

Here, some of the examples of the null hypothesis are given below. Go through the below ones to understand the concept of the null hypothesis in a better way.

If a medicine reduces the risk of cardiac stroke, then the null hypothesis should be “the medicine does not reduce the chance of cardiac stroke”. This testing can be performed by the administration of a drug to a certain group of people in a controlled way. If the survey shows that there is a significant change in the people, then the hypothesis is rejected.

Few more examples are:

1). Are there is 100% chance of getting affected by dengue?

Ans: There could be chances of getting affected by dengue but not 100%.

2). Do teenagers are using mobile phones more than grown-ups to access the internet?

Ans: Age has no limit on using mobile phones to access the internet.

3). Does having apple daily will not cause fever?

Ans: Having apple daily does not assure of not having fever, but increases the immunity to fight against such diseases.

4). Do the children more good in doing mathematical calculations than grown-ups?

Ans: Age has no effect on Mathematical skills.

In many common applications, the choice of the null hypothesis is not automated, but the testing and calculations may be automated. Also, the choice of the null hypothesis is completely based on previous experiences and inconsistent advice. The choice can be more complicated and based on the variety of applications and the diversity of the objectives. 

The main limitation for the choice of the null hypothesis is that the hypothesis suggested by the data is based on the reasoning which proves nothing. It means that if some hypothesis provides a summary of the data set, then there would be no value in the testing of the hypothesis on the particular set of data. 

Frequently Asked Questions on Null Hypothesis

What is meant by the null hypothesis.

In Statistics, a null hypothesis is a type of hypothesis which explains the population parameter whose purpose is to test the validity of the given experimental data.

What are the benefits of hypothesis testing?

Hypothesis testing is defined as a form of inferential statistics, which allows making conclusions from the entire population based on the sample representative.

When a null hypothesis is accepted and rejected?

The null hypothesis is either accepted or rejected in terms of the given data. If P-value is less than α, then the null hypothesis is rejected in favor of the alternative hypothesis, and if the P-value is greater than α, then the null hypothesis is accepted in favor of the alternative hypothesis.

Why is the null hypothesis important?

The importance of the null hypothesis is that it provides an approximate description of the phenomena of the given data. It allows the investigators to directly test the relational statement in a research study.

How to accept or reject the null hypothesis in the chi-square test?

If the result of the chi-square test is bigger than the critical value in the table, then the data does not fit the model, which represents the rejection of the null hypothesis.

Quiz Image

Put your understanding of this concept to test by answering a few MCQs. Click ‘Start Quiz’ to begin!

Select the correct answer and click on the “Finish” button Check your score and answers at the end of the quiz

Visit BYJU’S for all Maths related queries and study materials

Your result is as below

Request OTP on Voice Call

the null hypothesis stands

  • Share Share

Register with BYJU'S & Download Free PDFs

Register with byju's & watch live videos.

close

  • Search Search Please fill out this field.
  • Corporate Finance
  • Financial Ratios

Null Hypothesis: What Is It and How Is It Used in Investing?

Adam Hayes, Ph.D., CFA, is a financial writer with 15+ years Wall Street experience as a derivatives trader. Besides his extensive derivative trading expertise, Adam is an expert in economics and behavioral finance. Adam received his master's in economics from The New School for Social Research and his Ph.D. from the University of Wisconsin-Madison in sociology. He is a CFA charterholder as well as holding FINRA Series 7, 55 & 63 licenses. He currently researches and teaches economic sociology and the social studies of finance at the Hebrew University in Jerusalem.

the null hypothesis stands

What Is a Null Hypothesis?

A null hypothesis is a type of statistical hypothesis that proposes that no statistical significance exists in a set of given observations. Hypothesis testing is used to assess the credibility of a hypothesis by using sample data. Sometimes referred to simply as the "null," it is represented as H 0 .

The null hypothesis, also known as the conjecture, is used in quantitative analysis to test theories about markets, investing strategies, or economies to decide if an idea is true or false.

Key Takeaways

  • A null hypothesis is a type of conjecture in statistics that proposes that there is no difference between certain characteristics of a population or data-generating process.
  • The alternative hypothesis proposes that there is a difference.
  • Hypothesis testing provides a method to reject a null hypothesis within a certain confidence level.
  • If you can reject the null hypothesis, it provides support for the alternative hypothesis.
  • Null hypothesis testing is the basis of the principle of falsification in science.

Investopedia / Alex Dos Diaz

How a Null Hypothesis Works

A null hypothesis is a type of conjecture in statistics that proposes that there is no difference between certain characteristics of a population or data-generating process. For example, a gambler may be interested in whether a game of chance is fair. If it is fair, then the expected earnings per play come to zero for both players. If the game is not fair, then the expected earnings are positive for one player and negative for the other. To test whether the game is fair, the gambler collects earnings data from many repetitions of the game, calculates the average earnings from these data, then tests the null hypothesis that the expected earnings are not different from zero.

If the average earnings from the sample data are sufficiently far from zero, then the gambler will reject the null hypothesis and conclude the alternative hypothesis—namely, that the expected earnings per play are different from zero. If the average earnings from the sample data are near zero, then the gambler will not reject the null hypothesis, concluding instead that the difference between the average from the data and zero is explainable by chance alone.

The null hypothesis assumes that any kind of difference between the chosen characteristics that you see in a set of data is due to chance. For example, if the expected earnings for the gambling game are truly equal to zero, then any difference between the average earnings in the data and zero is due to chance.

Analysts look to reject   the null hypothesis because doing so is a strong conclusion. This requires strong evidence in the form of an observed difference that is too large to be explained solely by chance. Failing to reject the null hypothesis—that the results are explainable by chance alone—is a weak conclusion because it allows that factors other than chance may be at work but may not be strong enough for the statistical test to detect them.

A null hypothesis can only be rejected, not proven.

The Alternative Hypothesis

An important point to note is that we are testing the null hypothesis because there is an element of doubt about its validity. Whatever information that is against the stated null hypothesis is captured in the alternative (alternate) hypothesis (H1).

For the above examples, the alternative hypothesis would be:

  • Students score an average that is  not  equal to seven.
  • The mean annual return of the mutual fund is  not  equal to 8% per year.

In other words, the alternative hypothesis is a direct contradiction of the null hypothesis.

Examples of a Null Hypothesis

Here is a simple example: A school principal claims that students in her school score an average of seven out of 10 in exams. The null hypothesis is that the population mean is 7.0. To test this null hypothesis, we record marks of, say, 30 students (sample) from the entire student population of the school (say 300) and calculate the mean of that sample.

We can then compare the (calculated) sample mean to the (hypothesized) population mean of 7.0 and attempt to reject the null hypothesis. (The null hypothesis here—that the population mean is 7.0—cannot be proved using the sample data. It can only be rejected.)

Take another example: The annual return of a particular  mutual fund  is claimed to be 8%. Assume that a mutual fund has been in existence for 20 years. The null hypothesis is that the mean return is 8% for the mutual fund. We take a random sample of annual returns of the mutual fund for, say, five years (sample) and calculate the sample mean. We then compare the (calculated) sample mean to the (claimed) population mean (8%) to test the null hypothesis.

For the above examples, null hypotheses are:

  • Example A : Students in the school score an average of seven out of 10 in exams.
  • Example B: Mean annual return of the mutual fund is 8% per year.

For the purposes of determining whether to reject the null hypothesis, the null hypothesis (abbreviated H 0 ) is assumed, for the sake of argument, to be true. Then the likely range of possible values of the calculated statistic (e.g., the average score on 30 students’ tests) is determined under this presumption (e.g., the range of plausible averages might range from 6.2 to 7.8 if the population mean is 7.0). Then, if the sample average is outside of this range, the null hypothesis is rejected. Otherwise, the difference is said to be “explainable by chance alone,” being within the range that is determined by chance alone.

How Null Hypothesis Testing Is Used in Investments

As an example related to financial markets, assume Alice sees that her investment strategy produces higher average returns than simply buying and holding a stock . The null hypothesis states that there is no difference between the two average returns, and Alice is inclined to believe this until she can conclude contradictory results.

Refuting the null hypothesis would require showing statistical significance, which can be found by a variety of tests. The alternative hypothesis would state that the investment strategy has a higher average return than a traditional buy-and-hold strategy.

One tool that can determine the statistical significance of the results is the p-value. A p-value represents the probability that a difference as large or larger than the observed difference between the two average returns could occur solely by chance.

A p-value that is less than or equal to 0.05 often indicates whether there is evidence against the null hypothesis. If Alice conducts one of these tests, such as a test using the normal model, resulting in a significant difference between her returns and the buy-and-hold returns (the p-value is less than or equal to 0.05), she can then reject the null hypothesis and conclude the alternative hypothesis.

How Is the Null Hypothesis Identified?

The analyst or researcher establishes a null hypothesis based on the research question or problem that they are trying to answer. Depending on the question, the null may be identified differently. For example, if the question is simply whether an effect exists (e.g., does X influence Y?) the null hypothesis could be H 0 : X = 0. If the question is instead, is X the same as Y, the H0 would be X = Y. If it is that the effect of X on Y is positive, H0 would be X > 0. If the resulting analysis shows an effect that is statistically significantly different from zero, the null can be rejected.

How Is Null Hypothesis Used in Finance?

In finance, a null hypothesis is used in quantitative analysis. A null hypothesis tests the premise of an investing strategy, the markets, or an economy to determine if it is true or false. For instance, an analyst may want to see if two stocks, ABC and XYZ, are closely correlated. The null hypothesis would be ABC ≠ XYZ.

How Are Statistical Hypotheses Tested?

Statistical hypotheses are tested by a four-step process . The first step is for the analyst to state the two hypotheses so that only one can be right. The next step is to formulate an analysis plan, which outlines how the data will be evaluated. The third step is to carry out the plan and physically analyze the sample data. The fourth and final step is to analyze the results and either reject the null hypothesis or claim that the observed differences are explainable by chance alone.

What Is an Alternative Hypothesis?

An alternative hypothesis is a direct contradiction of a null hypothesis. This means that if one of the two hypotheses is true, the other is false.

Sage Publishing. " Chapter 8: Introduction to Hypothesis Testing ," Pages 4–7.

Sage Publishing. " Chapter 8: Introduction to Hypothesis Testing ," Page 4.

Sage Publishing. " Chapter 8: Introduction to Hypothesis Testing ," Page 7.

the null hypothesis stands

  • Terms of Service
  • Editorial Policy
  • Privacy Policy
  • Your Privacy Choices

9.1 Null and Alternative Hypotheses

The actual test begins by considering two hypotheses . They are called the null hypothesis and the alternative hypothesis . These hypotheses contain opposing viewpoints.

H 0 : The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

H a : The alternative hypothesis: It is a claim about the population that is contradictory to H 0 and what we conclude when we reject H 0 . This is usually what the researcher is trying to prove.

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data.

After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are "reject H 0 " if the sample information favors the alternative hypothesis or "do not reject H 0 " or "decline to reject H 0 " if the sample information is insufficient to reject the null hypothesis.

Mathematical Symbols Used in H 0 and H a :

H 0 always has a symbol with an equal in it. H a never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example 9.1

H 0 : No more than 30% of the registered voters in Santa Clara County voted in the primary election. p ≤ .30 H a : More than 30% of the registered voters in Santa Clara County voted in the primary election. p > 30

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

Example 9.2

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are: H 0 : μ = 2.0 H a : μ ≠ 2.0

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 66
  • H a : μ __ 66

Example 9.3

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are: H 0 : μ ≥ 5 H a : μ < 5

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : μ __ 45
  • H a : μ __ 45

Example 9.4

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses. H 0 : p ≤ 0.066 H a : p > 0.066

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (=, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • H 0 : p __ 0.40
  • H a : p __ 0.40

Collaborative Exercise

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

This book may not be used in the training of large language models or otherwise be ingested into large language models or generative AI offerings without OpenStax's permission.

Want to cite, share, or modify this book? This book uses the Creative Commons Attribution License and you must attribute OpenStax.

Access for free at https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Authors: Barbara Illowsky, Susan Dean
  • Publisher/website: OpenStax
  • Book title: Introductory Statistics 2e
  • Publication date: Dec 13, 2023
  • Location: Houston, Texas
  • Book URL: https://openstax.org/books/introductory-statistics-2e/pages/1-introduction
  • Section URL: https://openstax.org/books/introductory-statistics-2e/pages/9-1-null-and-alternative-hypotheses

© Dec 6, 2023 OpenStax. Textbook content produced by OpenStax is licensed under a Creative Commons Attribution License . The OpenStax name, OpenStax logo, OpenStax book covers, OpenStax CNX name, and OpenStax CNX logo are not subject to the Creative Commons license and may not be reproduced without the prior and express written consent of Rice University.

  • School Guide
  • Mathematics
  • Number System and Arithmetic
  • Trigonometry
  • Probability
  • Mensuration
  • Maths Formulas
  • Integration Formulas
  • Differentiation Formulas
  • Trigonometry Formulas
  • Algebra Formulas
  • Mensuration Formula
  • Statistics Formulas
  • Trigonometric Table

Null Hypothesis

  • Hypothesis Testing Formula
  • SQLite IS NULL
  • MySQL NOT NULL Constraint
  • Real-life Applications of Hypothesis Testing
  • Null in JavaScript
  • SQL NOT NULL Constraint
  • Null Cipher
  • Hypothesis in Machine Learning
  • Understanding Hypothesis Testing
  • Difference between Null and Alternate Hypothesis
  • Hypothesis Testing in R Programming
  • SQL | NULL functions
  • Kotlin Null Safety
  • PHP | is_null() Function
  • MySQL | NULLIF( ) Function
  • Null-Coalescing Operator in C#
  • Nullity of a Matrix

Null Hypothesis , often denoted as H 0, is a foundational concept in statistical hypothesis testing. It represents an assumption that no significant difference, effect, or relationship exists between variables within a population. It serves as a baseline assumption, positing no observed change or effect occurring. The null is t he truth or falsity of an idea in analysis.

In this article, we will discuss the null hypothesis in detail, along with some solved examples and questions on the null hypothesis.

Table of Content

What is Null Hypothesis?

Null hypothesis symbol, formula of null hypothesis, types of null hypothesis, null hypothesis examples, principle of null hypothesis, how do you find null hypothesis, null hypothesis in statistics, null hypothesis and alternative hypothesis, null hypothesis and alternative hypothesis examples, null hypothesis – practice problems.

Null Hypothesis in statistical analysis suggests the absence of statistical significance within a specific set of observed data. Hypothesis testing, using sample data, evaluates the validity of this hypothesis. Commonly denoted as H 0 or simply “null,” it plays an important role in quantitative analysis, examining theories related to markets, investment strategies, or economies to determine their validity.

Null Hypothesis Meaning

Null Hypothesis represents a default position, often suggesting no effect or difference, against which researchers compare their experimental results. The Null Hypothesis, often denoted as H 0 asserts a default assumption in statistical analysis. It posits no significant difference or effect, serving as a baseline for comparison in hypothesis testing.

The null Hypothesis is represented as H 0 , the Null Hypothesis symbolizes the absence of a measurable effect or difference in the variables under examination.

Certainly, a simple example would be asserting that the mean score of a group is equal to a specified value like stating that the average IQ of a population is 100.

The Null Hypothesis is typically formulated as a statement of equality or absence of a specific parameter in the population being studied. It provides a clear and testable prediction for comparison with the alternative hypothesis. The formulation of the Null Hypothesis typically follows a concise structure, stating the equality or absence of a specific parameter in the population.

Mean Comparison (Two-sample t-test)

H 0 : μ 1 = μ 2

This asserts that there is no significant difference between the means of two populations or groups.

Proportion Comparison

H 0 : p 1 − p 2 = 0

This suggests no significant difference in proportions between two populations or conditions.

Equality in Variance (F-test in ANOVA)

H 0 : σ 1 = σ 2

This states that there’s no significant difference in variances between groups or populations.

Independence (Chi-square Test of Independence):

H 0 : Variables are independent

This asserts that there’s no association or relationship between categorical variables.

Null Hypotheses vary including simple and composite forms, each tailored to the complexity of the research question. Understanding these types is pivotal for effective hypothesis testing.

Equality Null Hypothesis (Simple Null Hypothesis)

The Equality Null Hypothesis, also known as the Simple Null Hypothesis, is a fundamental concept in statistical hypothesis testing that assumes no difference, effect or relationship between groups, conditions or populations being compared.

Non-Inferiority Null Hypothesis

In some studies, the focus might be on demonstrating that a new treatment or method is not significantly worse than the standard or existing one.

Superiority Null Hypothesis

The concept of a superiority null hypothesis comes into play when a study aims to demonstrate that a new treatment, method, or intervention is significantly better than an existing or standard one.

Independence Null Hypothesis

In certain statistical tests, such as chi-square tests for independence, the null hypothesis assumes no association or independence between categorical variables.

Homogeneity Null Hypothesis

In tests like ANOVA (Analysis of Variance), the null hypothesis suggests that there’s no difference in population means across different groups.

  • Medicine: Null Hypothesis: “No significant difference exists in blood pressure levels between patients given the experimental drug versus those given a placebo.”
  • Education: Null Hypothesis: “There’s no significant variation in test scores between students using a new teaching method and those using traditional teaching.”
  • Economics: Null Hypothesis: “There’s no significant change in consumer spending pre- and post-implementation of a new taxation policy.”
  • Environmental Science: Null Hypothesis: “There’s no substantial difference in pollution levels before and after a water treatment plant’s establishment.”

The principle of the null hypothesis is a fundamental concept in statistical hypothesis testing. It involves making an assumption about the population parameter or the absence of an effect or relationship between variables.

In essence, the null hypothesis (H 0 ) proposes that there is no significant difference, effect, or relationship between variables. It serves as a starting point or a default assumption that there is no real change, no effect or no difference between groups or conditions.

The null hypothesis is usually formulated to be tested against an alternative hypothesis (H 1 or H [Tex]\alpha [/Tex] ) which suggests that there is an effect, difference or relationship present in the population.

Null Hypothesis Rejection

Rejecting the Null Hypothesis occurs when statistical evidence suggests a significant departure from the assumed baseline. It implies that there is enough evidence to support the alternative hypothesis, indicating a meaningful effect or difference. Null Hypothesis rejection occurs when statistical evidence suggests a deviation from the assumed baseline, prompting a reconsideration of the initial hypothesis.

Identifying the Null Hypothesis involves defining the status quotient, asserting no effect and formulating a statement suitable for statistical analysis.

When is Null Hypothesis Rejected?

The Null Hypothesis is rejected when statistical tests indicate a significant departure from the expected outcome, leading to the consideration of alternative hypotheses. It occurs when statistical evidence suggests a deviation from the assumed baseline, prompting a reconsideration of the initial hypothesis.

In statistical hypothesis testing, researchers begin by stating the null hypothesis, often based on theoretical considerations or previous research. The null hypothesis is then tested against an alternative hypothesis (Ha), which represents the researcher’s claim or the hypothesis they seek to support.

The process of hypothesis testing involves collecting sample data and using statistical methods to assess the likelihood of observing the data if the null hypothesis were true. This assessment is typically done by calculating a test statistic, which measures the difference between the observed data and what would be expected under the null hypothesis.

In the realm of hypothesis testing, the null hypothesis (H 0 ) and alternative hypothesis (H₁ or Ha) play critical roles. The null hypothesis generally assumes no difference, effect, or relationship between variables, suggesting that any observed change or effect is due to random chance. Its counterpart, the alternative hypothesis, asserts the presence of a significant difference, effect, or relationship between variables, challenging the null hypothesis. These hypotheses are formulated based on the research question and guide statistical analyses.

Difference Between Null Hypothesis and Alternative Hypothesis

The null hypothesis (H 0 ) serves as the baseline assumption in statistical testing, suggesting no significant effect, relationship, or difference within the data. It often proposes that any observed change or correlation is merely due to chance or random variation. Conversely, the alternative hypothesis (H 1 or Ha) contradicts the null hypothesis, positing the existence of a genuine effect, relationship or difference in the data. It represents the researcher’s intended focus, seeking to provide evidence against the null hypothesis and support for a specific outcome or theory. These hypotheses form the crux of hypothesis testing, guiding the assessment of data to draw conclusions about the population being studied.

Let’s envision a scenario where a researcher aims to examine the impact of a new medication on reducing blood pressure among patients. In this context:

Null Hypothesis (H 0 ): “The new medication does not produce a significant effect in reducing blood pressure levels among patients.”

Alternative Hypothesis (H 1 or Ha): “The new medication yields a significant effect in reducing blood pressure levels among patients.”

The null hypothesis implies that any observed alterations in blood pressure subsequent to the medication’s administration are a result of random fluctuations rather than a consequence of the medication itself. Conversely, the alternative hypothesis contends that the medication does indeed generate a meaningful alteration in blood pressure levels, distinct from what might naturally occur or by random chance.

People Also Read:

Mathematics Maths Formulas Probability and Statistics

Example 1: A researcher claims that the average time students spend on homework is 2 hours per night.

Null Hypothesis (H 0 ): The average time students spend on homework is equal to 2 hours per night. Data: A random sample of 30 students has an average homework time of 1.8 hours with a standard deviation of 0.5 hours. Test Statistic and Decision: Using a t-test, if the calculated t-statistic falls within the acceptance region, we fail to reject the null hypothesis. If it falls in the rejection region, we reject the null hypothesis. Conclusion: Based on the statistical analysis, we fail to reject the null hypothesis, suggesting that there is not enough evidence to dispute the claim of the average homework time being 2 hours per night.

Example 2: A company asserts that the error rate in its production process is less than 1%.

Null Hypothesis (H 0 ): The error rate in the production process is 1% or higher. Data: A sample of 500 products shows an error rate of 0.8%. Test Statistic and Decision: Using a z-test, if the calculated z-statistic falls within the acceptance region, we fail to reject the null hypothesis. If it falls in the rejection region, we reject the null hypothesis. Conclusion: The statistical analysis supports rejecting the null hypothesis, indicating that there is enough evidence to dispute the company’s claim of an error rate of 1% or higher.

Q1. A researcher claims that the average time spent by students on homework is less than 2 hours per day. Formulate the null hypothesis for this claim?

Q2. A manufacturing company states that their new machine produces widgets with a defect rate of less than 5%. Write the null hypothesis to test this claim?

Q3. An educational institute believes that their online course completion rate is at least 60%. Develop the null hypothesis to validate this assertion?

Q4. A restaurant claims that the waiting time for customers during peak hours is not more than 15 minutes. Formulate the null hypothesis for this claim?

Q5. A study suggests that the mean weight loss after following a specific diet plan for a month is more than 8 pounds. Construct the null hypothesis to evaluate this statement?

Summary – Null Hypothesis and Alternative Hypothesis

The null hypothesis (H 0 ) and alternative hypothesis (H a ) are fundamental concepts in statistical hypothesis testing. The null hypothesis represents the default assumption, stating that there is no significant effect, difference, or relationship between variables. It serves as the baseline against which the alternative hypothesis is tested. In contrast, the alternative hypothesis represents the researcher’s hypothesis or the claim to be tested, suggesting that there is a significant effect, difference, or relationship between variables. The relationship between the null and alternative hypotheses is such that they are complementary, and statistical tests are conducted to determine whether the evidence from the data is strong enough to reject the null hypothesis in favor of the alternative hypothesis. This decision is based on the strength of the evidence and the chosen level of significance. Ultimately, the choice between the null and alternative hypotheses depends on the specific research question and the direction of the effect being investigated.

FAQs on Null Hypothesis

What does null hypothesis stands for.

The null hypothesis, denoted as H 0 ​, is a fundamental concept in statistics used for hypothesis testing. It represents the statement that there is no effect or no difference, and it is the hypothesis that the researcher typically aims to provide evidence against.

How to Form a Null Hypothesis?

A null hypothesis is formed based on the assumption that there is no significant difference or effect between the groups being compared or no association between variables being tested. It often involves stating that there is no relationship, no change, or no effect in the population being studied.

When Do we reject the Null Hypothesis?

In statistical hypothesis testing, if the p-value (the probability of obtaining the observed results) is lower than the chosen significance level (commonly 0.05), we reject the null hypothesis. This suggests that the data provides enough evidence to refute the assumption made in the null hypothesis.

What is a Null Hypothesis in Research?

In research, the null hypothesis represents the default assumption or position that there is no significant difference or effect. Researchers often try to test this hypothesis by collecting data and performing statistical analyses to see if the observed results contradict the assumption.

What Are Alternative and Null Hypotheses?

The null hypothesis (H0) is the default assumption that there is no significant difference or effect. The alternative hypothesis (H1 or Ha) is the opposite, suggesting there is a significant difference, effect or relationship.

What Does it Mean to Reject the Null Hypothesis?

Rejecting the null hypothesis implies that there is enough evidence in the data to support the alternative hypothesis. In simpler terms, it suggests that there might be a significant difference, effect or relationship between the groups or variables being studied.

How to Find Null Hypothesis?

Formulating a null hypothesis often involves considering the research question and assuming that no difference or effect exists. It should be a statement that can be tested through data collection and statistical analysis, typically stating no relationship or no change between variables or groups.

How is Null Hypothesis denoted?

The null hypothesis is commonly symbolized as H 0 in statistical notation.

What is the Purpose of the Null hypothesis in Statistical Analysis?

The null hypothesis serves as a starting point for hypothesis testing, enabling researchers to assess if there’s enough evidence to reject it in favor of an alternative hypothesis.

What happens if we Reject the Null hypothesis?

Rejecting the null hypothesis implies that there is sufficient evidence to support an alternative hypothesis, suggesting a significant effect or relationship between variables.

What are Test for Null Hypothesis?

Various statistical tests, such as t-tests or chi-square tests, are employed to evaluate the validity of the Null Hypothesis in different scenarios.

Please Login to comment...

Similar reads.

  • Geeks Premier League 2023
  • Math-Concepts
  • Geeks Premier League
  • School Learning

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

7.1: Null and Alternative Hypotheses

  • Last updated
  • Save as PDF
  • Page ID 21577

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

The actual test begins by considering two hypotheses. They are called the null hypothesis and the alternative hypothesis. These hypotheses contain opposing viewpoints.

  • The null hypothesis (\(H_{0}\)) is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.
  • The alternative hypothesis (\(H_{a}\)) is a claim about the population that is contradictory to \(H_{0}\) and what we conclude when we reject \(H_{0}\).

Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample data. After you have determined which hypothesis the sample supports, you make a decision. There are two options for a decision. They are "reject \(H_{0}\)" if the sample information favors the alternative hypothesis or "do not reject \(H_{0}\)" or "decline to reject \(H_{0}\)" if the sample information is insufficient to reject the null hypothesis.

\(H_{0}\) always has a symbol with an equal in it. \(H_{a}\) never has a symbol with an equal in it. The choice of symbol depends on the wording of the hypothesis test. However, be aware that many researchers (including one of the co-authors in research work) use = in the null hypothesis, even with > or < as the symbol in the alternative hypothesis. This practice is acceptable because we only make the decision to reject or not reject the null hypothesis.

Example \(\PageIndex{1}\)

  • \(H_{0}\): No more than 30% of the registered voters in Santa Clara County voted in the primary election. \(p \leq 30\)
  • \(H_{a}\): More than 30% of the registered voters in Santa Clara County voted in the primary election. \(p > 30\)

Exercise \(\PageIndex{1}\)

A medical trial is conducted to test whether or not a new medicine reduces cholesterol by 25%. State the null and alternative hypotheses.

  • \(H_{0}\): The drug reduces cholesterol by 25%. \(p = 0.25\)
  • \(H_{a}\): The drug does not reduce cholesterol by 25%. \(p \neq 0.25\)

Example \(\PageIndex{2}\)

We want to test whether the mean GPA of students in American colleges is different from 2.0 (out of 4.0). The null and alternative hypotheses are:

  • \(H_{0}: \mu = 2.0\)
  • \(H_{a}: \mu \neq 2.0\)

Exercise \(\PageIndex{2}\)

We want to test whether the mean height of eighth graders is 66 inches. State the null and alternative hypotheses. Fill in the correct symbol \((=, \neq, \geq, <, \leq, >)\) for the null and alternative hypotheses.

  • \(H_{0}: \mu_ 66\)
  • \(H_{a}: \mu_ 66\)
  • \(H_{0}: \mu = 66\)
  • \(H_{a}: \mu \neq 66\)

Example \(\PageIndex{3}\)

We want to test if college students take less than five years to graduate from college, on the average. The null and alternative hypotheses are:

  • \(H_{0}: \mu \geq 66\)
  • \(H_{a}: \mu < 66\)

Exercise \(\PageIndex{3}\)

We want to test if it takes fewer than 45 minutes to teach a lesson plan. State the null and alternative hypotheses. Fill in the correct symbol ( =, ≠, ≥, <, ≤, >) for the null and alternative hypotheses.

  • \(H_{0}: \mu_ 45\)
  • \(H_{a}: \mu_ 45\)
  • \(H_{0}: \mu \geq 45\)
  • \(H_{a}: \mu < 45\)

Example \(\PageIndex{4}\)

In an issue of U. S. News and World Report , an article on school standards stated that about half of all students in France, Germany, and Israel take advanced placement exams and a third pass. The same article stated that 6.6% of U.S. students take advanced placement exams and 4.4% pass. Test if the percentage of U.S. students who take advanced placement exams is more than 6.6%. State the null and alternative hypotheses.

  • \(H_{0}: p \leq 0.066\)
  • \(H_{a}: p > 0.066\)

Exercise \(\PageIndex{4}\)

On a state driver’s test, about 40% pass the test on the first try. We want to test if more than 40% pass on the first try. Fill in the correct symbol (\(=, \neq, \geq, <, \leq, >\)) for the null and alternative hypotheses.

  • \(H_{0}: p_ 0.40\)
  • \(H_{a}: p_ 0.40\)
  • \(H_{0}: p = 0.40\)
  • \(H_{a}: p > 0.40\)

COLLABORATIVE EXERCISE

Bring to class a newspaper, some news magazines, and some Internet articles . In groups, find articles from which your group can write null and alternative hypotheses. Discuss your hypotheses with the rest of the class.

Chapter Review

In a hypothesis test , sample data is evaluated in order to arrive at a decision about some type of claim. If certain conditions about the sample are satisfied, then the claim can be evaluated for a population. In a hypothesis test, we:

  • Evaluate the null hypothesis , typically denoted with \(H_{0}\). The null is not rejected unless the hypothesis test shows otherwise. The null statement must always contain some form of equality \((=, \leq \text{or} \geq)\)
  • Always write the alternative hypothesis , typically denoted with \(H_{a}\) or \(H_{1}\), using less than, greater than, or not equals symbols, i.e., \((\neq, >, \text{or} <)\).
  • If we reject the null hypothesis, then we can assume there is enough evidence to support the alternative hypothesis.
  • Never state that a claim is proven true or false. Keep in mind the underlying fact that hypothesis testing is based on probability laws; therefore, we can talk only in terms of non-absolute certainties.

Formula Review

\(H_{0}\) and \(H_{a}\) are contradictory.

  • If \(\alpha \leq p\)-value, then do not reject \(H_{0}\).
  • If\(\alpha > p\)-value, then reject \(H_{0}\).

\(\alpha\) is preconceived. Its value is set before the hypothesis test starts. The \(p\)-value is calculated from the data.References

Data from the National Institute of Mental Health. Available online at http://www.nimh.nih.gov/publicat/depression.cfm .

Contributors

  • Template:ContribOpenStax

Miles D. Williams

Logo

Visiting Assistant Professor | Denison University

  • Download My CV
  • Send Me an Email
  • View My LinkedIn
  • Follow Me on Twitter

When the Research Hypothesis Is the Null

Posted on May 13, 2024 by Miles Williams in Methods   Statistics  

Back to Blog

What should you do if your research hypothesis is the null hypothesis? In other words, how should you approach hypothesis testing if your theory predicts no effect between two variables? I and a coauthor are working on a paper where a couple of our proposed hypotheses look like this, and we got some push-back from a reviewer about it. This prompted me to go down a rabbit hole of journal articles and message boards to see how others handle this situation. I quickly found that I waded into a contentious issue that’s connected to a bigger philosophical debate about the merits of hypothesis testing in general and whether the null hypothesis in particular as a bench-mark for hypothesis testing is even logically sound.

There’s too much to unpack with this debate for me to cover in a single blog post (and I’m sure I’d get some of the key points wrong anyway if I tried). The main issue I want to explore in this post is the practical problem of how to approach testing a null research hypothesis. From an applied perspective, this is a tricky problem that raises issues with how we calculate and interpret p-values. Thankfully, there is a sound solution for the null research hypothesis which I explore in greater detail below. It’s called a two one-sided test, and it’s easy to implement once you know what it is.

The usual approach

Most of the time when doing research, a scientist usually has a research hypothesis that goes something like X has a positive effect on Y . For example, a political scientist might propose that a get-out-the-vote (GOTV) campaign ( X ) will increase voter turnout ( Y ).

The typical approach for testing this claim might be to estimate a regression model with voter turnout as the outcome and the GOTV campaign as the explanatory variable of interest:

Y = α + β X + ε

If the parameter β > 0, this would support the hypothesis that GOTV campaigns improve voter turnout. To test this hypothesis, in practice the researcher would actually test a different hypothesis that we call the null hypothesis. This is the hypothesis that says there is no true effect of GOTV campaigns on voter turnout.

By proposing and testing the null, we now have a point of reference for calculating a measure of uncertainty—that is, the probability of observing an empirical effect of a certain magnitude or greater if the null hypothesis is true. This probability is called a p-value, and by convention if it is less than 0.05 we say that we can reject the null hypothesis.

For the hypothetical regression model proposed above, to get this p-value we’d estimate β, then calculate its standard error, and then we’d take the ratio of the former to the latter giving us what’s called a t-statistic or t-value. Under the null hypothesis, the t-value has a known distribution which makes it really easy to map any t-value to a p-value. The below figure illustrates using a hypothetical data sample of size N = 200. You can see that the t-statistic’s distribution has a distinct bell shape centered around 0. You can also see the range of t-values in blue where if we observed them in our empirical data we’d fail to reject the null hypothesis at the p < 0.05 level. Values in gray are t-values that would lead us to reject the null hypothesis at this same level.

the null hypothesis stands

When the null is the research hypothesis we want to test

There’s nothing new or special here. If you have even a basic stats background (particularly with Frequentist statistics), the conventional approach to hypothesis testing is pretty ubiquitous. Things get more tricky when our research hypothesis is that there is no effect. Say for a certain set of theoretical reasons we think that GOTV campaigns are basically useless at increasing voter turnout. If this argument is true, then if we estimate the following regression model, we’d expect β = 0.

The problem here is that our substantive research hypothesis is also the one that we want to try to find evidence against. We could just proceed like usual and just say that if we fail to reject the null this is evidence in support of our theory, but the problem with doing this is that failure to reject the null is not the same thing as finding support for the null hypothesis.

There are a few ideas in the literature for how we should approach this instead. Many of these approaches are Bayesian, but most of my research relies on Frequentist statistics, so these approaches were a no-go for me. However, there is one really simple approach that is consistent with the Frequentist paradigm: equivalence testing . The idea is simple. Propose some absolute effect size that is of minimal interest and then test whether an observed effect is different from it. This minimum effect is called the “smallest effect size of interest” (SESOI). I read about the approach in an article by Harms and Lakens (2018) in the Journal of Clinical and Translational Research .

Say, for example, that we deemed a t-value of +/-1.96 (the usual threshold for rejecting the null hypothesis) as extreme enough to constitute good evidence of a non-zero effect. We could make the appropriate adjustments to our t-distribution to identify a new range of t-values that would allow us to reject the hypothesis that an effect is non-zero. This is illustrated in the below figure. We can now see a range of t-values in the middle where we’d have t-values such that we could reject the non-zero hypothesis at the p < 0.05 level. This distribution looks like it’s been inverted relative to the usual null distribution. The reason is that with this approach what we’re doing is conducting a pair of alternative one-tailed tests. We’re testing both the hypothesis that β / se(β) - 1.96 > 0 and β / se(β) + 1.96 < 0. In the Harms and Lakens paper cited above, they call this approach two one-sided tests or TOST (I’m guessing this is pronounced “toast”).

the null hypothesis stands

Something to pay attention to with this approach is that the observed t-statistic needs to be very small in absolute magnitude for us to reject the hypothesis of a non-zero effect. This means that the bar for testing a null research hypothesis is actually quite high. This is demonstrated using the following simulation in R. Using the {seerrr} package, I had R generate 1,000 random draws (each of size 200) for a pair of variables x and y where the former is a binary “treatment” and the latter is a random normal “outcome.” By design, there is no true causal relationship between these variables. Once I simulated the data, I then generated a set of estimates of the effect of x on y for each simulated dataset and collected the results in an object called sim_ests . I then visualized two metrics that that I calculated with the simulated results: (1) the rejection rate for the null hypothesis test and (2) the rejection rate for the two one-sided equivalence tests. As you can see, if we were to try to test a research null hypothesis the usual way, we’d expect to be able to fail to reject the null about 95% of the time. Conversely, if we were to use the two one-sided equivalence tests, we’d expect to reject the non-zero alternative hypothesis only about 25% of the time. I tested out a few additional simulations to see if a larger sample size would lead to improvements in power (not shown), but no dice.

the null hypothesis stands

The two one-sided tests approach strikes me as a nice method when dealing with a null research hypothesis. It’s actually pretty easy to implement, too. The one downside is that this test is under-powered. If the null is true, it will only reject the alternative 25% of the time (though you could select a different non-zero alternative which would possibly give you more power). However, this isn’t all bad. The flip side of the coin is that this is a really conservative test, so if you can reject the alternative that puts you on solid rhetorical footing to show the data really do seem consistent with the null.

  • Open access
  • Published: 13 May 2024

SCIPAC: quantitative estimation of cell-phenotype associations

  • Dailin Gan 1 ,
  • Yini Zhu 2 ,
  • Xin Lu 2 , 3 &
  • Jun Li   ORCID: orcid.org/0000-0003-4353-5761 1  

Genome Biology volume  25 , Article number:  119 ( 2024 ) Cite this article

175 Accesses

2 Altmetric

Metrics details

Numerous algorithms have been proposed to identify cell types in single-cell RNA sequencing data, yet a fundamental problem remains: determining associations between cells and phenotypes such as cancer. We develop SCIPAC, the first algorithm that quantitatively estimates the association between each cell in single-cell data and a phenotype. SCIPAC also provides a p -value for each association and applies to data with virtually any type of phenotype. We demonstrate SCIPAC’s accuracy in simulated data. On four real cancerous or noncancerous datasets, insights from SCIPAC help interpret the data and generate new hypotheses. SCIPAC requires minimum tuning and is computationally very fast.

Single-cell RNA sequencing (scRNA-seq) technologies are revolutionizing biomedical research by providing comprehensive characterizations of diverse cell populations in heterogeneous tissues [ 1 , 2 ]. Unlike bulk RNA sequencing (RNA-seq), which measures the average expression profile of the whole tissue, scRNA-seq gives the expression profiles of thousands of individual cells in the tissue [ 3 , 4 , 5 , 6 , 7 ]. Based on this rich data, cell types may be discovered/determined in an unsupervised (e.g., [ 8 , 9 ]), semi-supervised (e.g., [ 10 , 11 , 12 , 13 ]), or supervised manner (e.g., [ 14 , 15 , 16 ]). Despite the fast development, there are still limitations with scRNA-seq technologies. Notably, the cost for each scRNA-seq experiment is still high; as a result, most scRNA-seq data are from a single or a few biological samples/tissues. Very few datasets consist of large numbers of samples with different phenotypes, e.g., cancer vs. normal. This places great difficulties in determining how a cell type contributes to a phenotype based on single-cell studies (especially if the cell type is discovered in a completely unsupervised manner or if people have limited knowledge of this cell type). For example, without having single-cell data from multiple cancer patients and multiple normal controls, it could be hard to computationally infer whether a cell type may promote or inhibit cancer development. However, such association can be critical for cancer research [ 17 ], disease diagnosis [ 18 ], cell-type targeted therapy development [ 19 ], etc.

Fortunately, this difficulty may be overcome by borrowing information from bulk RNA-seq data. Over the past decade, a considerable amount of bulk RNA-seq data from a large number of samples with different phenotypes have been accumulated and made available through databases like The Cancer Genome Atlas (TCGA) [ 20 ] and cBioPortal [ 21 , 22 ]. Data in these databases often contain comprehensive patient phenotype information, such as cancer status, cancer stages, survival status and time, and tumor metastasis. Combining single-cell data from a single or a few individuals and bulk data from a relatively large number of individuals regarding a particular phenotype can be a cost-effective way to determine how a cell type contributes to the phenotype. A recent method Scissor [ 23 ] took an essential step in this direction. It uses single-cell and bulk RNA-seq data with phenotype information to classify the cells into three discrete categories: Scissor+, Scissor−, and null cells, corresponding to cells that are positively associated, negatively associated, and not associated with the phenotype.

Here, we present a method that takes another big step in this direction, which is called Single-Cell and bulk data-based Identifier for Phenotype Associated Cells or SCIPAC for short. SCIPAC enables quantitative estimation of the strength of association between each cell in a scRNA-seq data and a phenotype, with the help of bulk RNA-seq data with phenotype information. Moreover, SCIPAC also enables the estimation of the statistical significance of the association. That is, it gives a p -value for the association between each cell and the phenotype. Furthermore, SCIPAC enables the estimation of association between cells and an ordinal phenotype (e.g., different stages of cancer), which could be informative as people may not only be interested in the emergence/existence of cancer (cancer vs. healthy, a binary problem) but also in the progression of cancer (different stages of cancer, an ordinal problem).

To study the performance of SCIPAC, we first apply SCIPAC to simulated data under three schemes. SCIPAC shows high accuracy with low false positive rates. We further show the broad applicability of SCIPAC on real datasets across various diseases, including prostate cancer, breast cancer, lung cancer, and muscular dystrophy. The association inferred by SCIPAC is highly informative. In real datasets, some cell types have definite and well-studied functions, while others are less well-understood: their functions may be disease-dependent or tissue-dependent, and they may contain different sub-types with distinct functions. In the former case, SCIPAC’s results agree with current biological knowledge. In the latter case, SCIPAC’s discoveries inspire the generation of new hypotheses regarding the roles and functions of cells under different conditions.

An overview of the SCIPAC algorithm

SCIPAC is a computational method that identifies cells in single-cell data that are associated with a given phenotype. This phenotype can be binary (e.g., cancer vs. normal), ordinal (e.g., cancer stage), continuous (e.g., quantitative traits), or survival (i.e., survival time and status). SCIPAC uses input data consisting of three parts: single-cell RNA-seq data that measures the expression of p genes in m cells, bulk RNA-seq data that measures the expression of the same set of p genes in n samples/tissues, and the statuses/values of the phenotype of the n bulk samples/tissues. The output of SCIPAC is the strength and the p -value of the association between each cell and the phenotype.

SCIPAC proposes the following definition of “association” between a cell and a phenotype: A group of cells that are likely to play a similar role in the phenotype (such as cells of a specific cell type or sub-type, cells in a particular state, cells in a cluster, cells with similar expression profiles, or cells with similar functions) is considered to be positively/negatively associated with a phenotype if an increase in their proportion within the tissue likely indicates an increased/decreased probability of the phenotype’s presence. SCIPAC assigns the same association to all cells within such a group. Taking cancer as the phenotype as an example, if increasing the proportion of a cell type indicates a higher chance of having cancer (binary), having a higher cancer stage (ordinal), or a higher hazard rate (survival), all cells in this cell type is positively associated with cancer.

The algorithm of SCIPAC follows the following four steps. First, the cells in the single-cell data are grouped into clusters according to their expression profiles. The Louvain algorithm from the Seurat package [ 24 , 25 ] is used as the default clustering algorithm, but the user may choose any clustering algorithm they prefer. Or if information of the cell types or other groupings of cells is available a priori, it may be supplied to SCIPAC as the cell clusters, and this clustering step can be skipped. In the second step, a regression model is learned from bulk gene expression data with the phenotype. Depending on the type of the phenotype, this model can be logistic regression, ordinary linear regression, proportional odds model, or Cox proportional hazards model. To achieve a higher prediction power with less variance, by default, the elastic net (a blender of Lasso and ridge regression [ 26 ]) is used to fit the model. In the third step, SCIPAC computes the association strength \(\Lambda\) between each cell cluster and the phenotype based on a mathematical formula that we derive. Finally, the p -values are computed. The association strength and its p -value between a cell cluster and the phenotype are given to all cells in the cluster.

SCIPAC requires minimum tuning. When the cell-type information is given in step 1, SCIPAC does not have any (hyper)parameter. Otherwise, the Louvain algorithm used in step 1 has a “resolution” parameter that controls the number of cell clusters: a larger resolution results in more clusters. SCIPAC inherits this parameter as its only parameter. Since SCIPAC gives the same association strength and p -value to cells from the same cluster, this parameter also determines the resolution of results provided by SCIPAC. Thus, we still call it “resolution” in SCIPAC. Because of its meaning, we recommend setting it so that the number of cell clusters given by the clustering algorithm is comparable to, or reasonably larger than, the number of cell types (or sub-types) in the data. We will see that the performance of SCIPAC is insensitive to this resolution parameter, and the default value 2.0 typically works well.

The details of the SCIPAC algorithm are given in the “ Methods ” section.

Performance in simulated data

We assess the performance of SCIPAC in simulated data under three different schemes. The first scheme is simple and consists of only three cell types. The second scheme is more complicated and consists of seven cell types, which better imitates actual scRNA-seq data. In the third scheme, we simulate cells under different cell development stages to test the performance of SCIPAC under an ordinal phenotype. Details of the simulation are given in Additional file 1.

Simulation scheme I

Under this scheme, the single-cell data consists of three cell types: one is positively associated with the phenotype, one is negatively associated, and the third is not associated (we call it “null”). Figure 1 a gives the UMAP [ 27 ] plot of the three cell types, and Fig. 1 b gives the true associations of these three cell types with the phenotype, with red, blue, and light gray denoting positive, negative, and null associations.

figure 1

UMAP visualization and numeric measures of the simulated data under scheme I. All the plots in a–e  are scatterplots of the two dimensional single-cell data given by UMAP. The x and y axes represent the two dimensions, and their scales are not shown as their specific values are not directly relevant. Points in the plots represents single cells, and they are colored differently in each subplot to reflect different information/results. a  Cell types. b  True associations. The association between cell types 1, 2, and 3 and the phenotype are positive, negative, and null, respectively. c  Association strengths \(\Lambda\) given by SCIPAC under different resolutions. Red/blue represents the sign of \(\Lambda\) , and the shade gives the absolute value of \(\Lambda\) . Every cell is colored red or blue since no \(\Lambda\) is exactly zero. Below each subplot, Res stands for resolution, and K stands for the number of cell clusters given by this resolution. d   p -values given by SCIPAC. Only cells with p -value \(< 0.05\) are colored red (positive association) or blue (negative association); others are colored white. e  Results given by Scissor under different \(\alpha\) values. Red, blue, and light gray stands for Scissor+, Scissor−, and background (i.e., null) cells. f  F1 scores and g  FSC for SCIPAC and Scissor under different parameter values. For SCIPAC, each bar is the value under a resolution/number of clusters. For Scissor, each bar is the value under an \(\alpha\)

We apply SCIPAC to the simulated data. For the resolution parameter (see the “ Methods ” section), values 0.5, 1.0, and 1.5 give 3, 4, and 4 clusters, respectively, close to the actual number of cell types. They are good choices based on the guidance for choosing this parameter. To show how SCIPAC behaves under parameter misspecification, we also set the resolution up to 4.0, which gives a whopping 61 clusters. Figure 1 c and d give the association strengths \(\Lambda\) and the p -values given by four different resolutions (results under other resolutions are provided in Additional file 1: Fig. S1 and S2). In Fig. 1 c, red and blue denote positive and negative associations, respectively, and the shade of the color represents the strength of the association, i.e., the absolute value of \(\Lambda\) . Every cell is colored blue or red, as none of \(\Lambda\) is exactly zero. In Fig. 1 d, red and blue denote positive and negative associations that are statistically significant ( p -value \(< 0.05\) ). Cells whose associations are not statistically significant ( p -value \(\ge 0.05\) ) are shown in white. To avoid confusion, it is worth repeating that cells that are colored in red/blue in Fig. 1 c are shown in red/blue in Fig. 1 d only if they are statistically significant; otherwise, they are colored white in Fig. 1 d.

From Fig. 1 c, d (as well as Additional file 1: Fig. S1 and S2), it is clear that the results of SCIPAC are highly consistent under different resolution values, including both the estimated association strengths and the p -values. It is also clear that SCIPAC is highly accurate: most truly associated cells are identified as significant, and most, if not all, truly null cells are identified as null.

As the first algorithm that quantitatively estimates the association strength and the first algorithm that gives the p -value of the association, SCIPAC does not have a real competitor. A previous algorithm, Scissor, is able to classify cells into three discrete categories according to their associations with the phenotype. So, we compare SCIPAC with Scissor in respect of the ability to differentiate positively associated, negatively associated, and null cells.

Running Scissor requires tuning a parameter called \(\alpha\) , which is a number between 0 and 1 that balances the amount of regularization for the smoothness and for the sparsity of the associations. The Scissor R package does not provide a default value for this \(\alpha\) or a function to help select this value. The Scissor paper suggests the following criterion: “the number of Scissor-selected cells should not exceed a certain percentage of total cells (default 20%) in the single-cell data. In each experiment, a search on the above searching list is performed from the smallest to the largest until a value of \(\alpha\) meets the above criteria.” In practice, we have found that this criterion does not often work properly, as the truly associated cells may not compose 20% of all cells in actual data. Therefore, instead of setting \(\alpha\) to any particular value, we set \(\alpha\) values that span the whole range of \(\alpha\) to see the best possible performance of Scissor.

The performance of Scissor in our simulation data under four different \(\alpha\) values are shown in Fig. 1 e, and results under more \(\alpha\) values are shown in Additional file 1: Fig. S3. In the figures, red, blue, and light gray denote Scissor+, Scissor−, and null (called “background” in Scissor) cells, respectively. The results of Scissor have several characteristics different from SCIPAC. First, Scissor does not give the strength or statistical significance of the association, and thus the colors of the cells in the figures do not have different shades. Second, different \(\alpha\) values give very different results. Greater \(\alpha\) values generally give fewer Scissor+ and Scissor− cells, but there are additional complexities. One complexity is that the Scissor+ (or Scissor−) cells under a greater \(\alpha\) value are not a strict subset of Scissor+ (or Scissor−) cells under a smaller \(\alpha\) value. For example, the number of truly negatively associated cells detected as Scissor− increases when \(\alpha\) increases from 0.01 to 0.30. Another complexity is that the direction of the association may flip as \(\alpha\) increases. For example, most cells of cell type 2 are identified as Scissor+ under \(\alpha =0.01\) , but many of them are identified as Scissor− under larger \(\alpha\) values. Third, Scissor does not achieve high power and low false-positive rate at the same time under any \(\alpha\) . No matter what the \(\alpha\) value is, there is only a small proportion of cells from cell type 2 that are correctly identified as negatively associated, and there is always a non-negligible proportion of null cells (i.e., cells from cell type 3) that are incorrectly identified as positively or negatively associated. Fourth, Scissor+ and Scissor− cells can be close to each other in the figure, even under a large \(\alpha\) value. This means that cells with nearly identical expression profiles are detected to be associated with the phenotype in opposite directions, which can place difficulties in interpreting the results.

SCIPAC overcomes the difficulties of Scissor and gives results that are more informative (quantitative strengths with p -values), more accurate (both high power and low false-positive rate), less sensitive to the tuning parameter, and easier to interpret (cells with similar expression typically have similar associations to the phenotype).

SCIPAC’s higher accuracy in differentiating positively associated, negatively associated, and null cells than Scissors can also be measured numerically using the F1 score and the fraction of sign correctness (FSC). F1, which is the harmonic mean of precision and recall, is a commonly used measure of calling accuracy. Note that precision and recall are only defined for two-class problems, which try to classify desired signals/discoveries (so-called “positives”) against noises/trivial results (so-called “negatives”). Our case, on the other hand, is a three-class problem: positive association, negative association, and null. To compute F1, we combine the positive and negative associations and treat them as “positives,” and treat null as “negatives.” This F1 score ignores the direction of the association; thus, it alone is not enough to describe the performance of an association-detection algorithm. For example, an algorithm may have a perfect F1 score even if it incorrectly calls all negative associations positive. To measure an algorithm’s ability to determine the direction of the association, we propose a statistic called FSC, defined as the fraction of true discoveries that also have the correct direction of the association. The F1 score and FSC are numbers between 0 and 1, and higher values are preferred. A mathematical definition of these two measures is given in Additional file 1.

Figure 1 f, g show the F1 score and FSC of SCIPAC and Scissor under different values of tuning parameters. The F1 score of Scissor is between 0.2 and 0.7 under different \(\alpha\) ’s. The FSC of Scissor increases from around 0.5 to nearly 1 as \(\alpha\) increases, but Scissor does not achieve high F1 and FSC scores at the same time under any \(\alpha\) . On the other hand, the F1 score of SCIPAC is close to perfection when the resolution parameter is properly set, and it is still above 0.90 even if the resolution parameter is set too large. The FSC of SCIPAC is always above 0.96 under different resolutions. That is, SCIPAC achieves high F1 and FSC scores simultaneously under a wide range of resolutions, representing a much higher accuracy than Scissor.

Simulation scheme II

This more complicated simulation scheme has seven cell types, which are shown in Fig. 2 a. As shown in Fig. 2 b, cell types 1 and 3 are negatively associated (colored blue), 2 and 4 are positively associated (colored red), and 5, 6, and 7 are not associated (colored light gray).

figure 2

UMAP visualization of the simulated data under a–g  scheme II and h–k  scheme III. a  Cell types. b  True associations. c , d  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. e  Results given by Scissor under different \(\alpha\) values. f  F1 scores and g  FSC for SCIPAC and Scissor under different parameter values. h  Cell differentiation paths. The four paths have the same starting location, which is in the center, but different ending locations. This can be considered as a progenitor cell type differentiating into four specialized cell types. i  Cell differentiation steps. These steps are used to create four stages, each containing 500 steps. Thus, this plot of differentiation steps can also be viewed as the plot of true association strengths. j , k  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution

The association strengths and p -values given by SCIPAC under the default resolution are illustrated in Fig. 2 c, d, respectively. Results under several other resolutions are given in Additional file 1: Fig. S4 and S5. Again, we find that SCIPAC gives highly consistent results under different resolutions. SCIPAC successfully identifies three out of the four truly associated cell types. For the other truly associated cell type, cell type 1, SCIPAC correctly recognizes its association with the phenotype as negative, although the p -values are not significant enough. The F1 score is 0.85, and the FSC is greater than 0.99, as shown in Fig. 2 f, g.

The results of Scissor under four different \(\alpha\) values are given in Fig. 2 e. (More shown in Additional file 1: Fig. S6.) Under this highly challenging simulation scheme, Scissor can only identify one out of four truly associated cell types. Its F1 score is below 0.4.

Simulation scheme III

This simulation scheme is to assess the performance of SCIPAC for ordinal phenotypes. We simulate cells along four cell-differentiation paths with the same starting location but different ending locations, as shown in Fig. 2 h. These cells can be considered as a progenitor cell population differentiating into four specialized cell types. In Fig. 2 i, the “step” reflects their position in the differentiation path, with step 0 meaning the start and step 2000 meaning the end of the differentiation. Then, the “stage” is generated according to the step: cells in steps 0 \(\sim\) 500, 501 \(\sim\) 1000, 1001 \(\sim\) 1500, and 1501 \(\sim\) 2000 are assigned to stages I, II, III, and IV, respectively. This stage is treated as the ordinal phenotype. Under this simulation scheme, Fig. 2 i also gives the actual associations, and all cells are associated with the phenotype.

The results of SCIPAC under the default resolution are shown in Fig. 2 j, k. Clearly, the associations SCIPAC identifies are highly consistent with the truth. Particularly, it successfully identifies the cells in the center as early-stage cells and most cells at the end of branches as last-stage cells. The results of SCIPAC under other resolutions are given in Additional file 1: Fig. S7 and S8, which are highly consistent. Scissor does not work with ordinal phenotypes; thus, no results are reported here.

Performance in real data

We consider four real datasets: a prostate cancer dataset, a breast cancer dataset, a lung cancer dataset, and a muscular dystrophy dataset. The bulk RNA-seq data of the three cancer datasets are obtained from the TCGA database, and that of the muscular dystrophy dataset is obtained from a published paper [ 28 ]. A detailed description of these datasets is given in Additional file 1. We will use these datasets to assess the performance of SCIPAC on different types of phenotypes. The cell type information (i.e., which cell belongs to which cell type) is available for the first three datasets, but we ignore this information so that we can make a fair comparison with Scissor, which cannot utilize this information.

Prostate cancer data with a binary phenotype

We use the single-cell expression of 8,700 cells from prostate-cancer tumors sequenced by [ 29 ]. The cell types of these cells are known and given in Fig. 3 a. The bulk data comprises 550 TCGA-PRAD (prostate adenocarcinoma) samples with phenotype (cancer vs. normal) information. Here the phenotype is cancer, and it is binary: present or absent.

figure 3

UMAP visualization of the prostate cancer data, with a zoom-in view for the red-circled region (cell type MNP). a  True cell types. BE, HE, and CE stand for basal, hillock, club epithelial cells, LE-KLK3 and LE-KLK4 stand for luminal epithelial cells with high levels of kallikrein related peptidase 3 and 4, and MNP stands for mononuclear phagocytes. In the zoom-in view, the sub-types of MNP cells are given. b  Association strengths \(\Lambda\) given by SCIPAC under the default resolution. The cyan-circled cells are B cells, which are estimated by SCIPAC as negatively associated with cancer but estimated by Scissor as Scissor+ or null. c   p -values given by SCIPAC. The MNP cell type, which is red-circled in the plot, is estimated by SCIPAC to be strongly negatively associated with cancer but estimated by Scissor to be positively associated with cancer. d  Results given by Scissor under different \(\alpha\) values

Results from SCIPAC with the default resolution are shown in Fig. 3 b, c (results with other resolutions, given in Additional file 1: Fig. S9 and S10, are highly consistent with results here.) Compared with results from Scissor, shown in Fig. 3 d, results from SCIPAC again show three advantages. First, results from SCIPAC are richer and more comprehensive. SCIPAC gives estimated associations and the corresponding p -values, and the estimated associations are quantitative (shown in Fig. 3 b as different shades to the red or blue color) instead of discrete (shown in Fig. 3 d as a uniform shade to the red, blue, or light gray color). Second, SCIPAC’s results can be easier to interpret as the red and blue colors are more block-wise instead of scattered. Third, unlike Scissor, which produces multiple sets of results varying based on the parameter \(\alpha\) —a parameter without a default value or tuning guidance—typically, a single set of results from SCIPAC under its default settings suffices.

Comparing the results from our SCIPAC method with those from Scissor is non-trivial, as the latter’s outcomes are scattered and include multiple sets. We propose the following solutions to summarize the inferred association of a known cell type with the phenotype using a specific method (Scissor under a specific \(\alpha\) value, or SCIPAC with the default setting). We first calculate the proportion of cells in this cell type identified as Scissor+ (by Scissor at a specific \(\alpha\) value) or as significantly positively associated (by SCIPAC), denoted by \(p_{+}\) . We also calculate the proportion of all cells, encompassing any cell type, which are identified as Scissor+ or significantly positively associated, serving as the average background strength, denoted by \(p_{a}\) . Then, we compute the log odds ratio for this cell type to be positively associated with the phenotype compared to the background, represented as:

Similarly, the log odds ratio for the cell type to be negatively associated with the phenotype, \(\rho _-\) , is computed in a parallel manner.

For SCIPAC, a cell type is summarized as positively associated with the phenotype if \(\rho _+ \ge 1\) and \(\rho _- < 1\)  and negatively associated if \(\rho _- \ge 1\) and \(\rho _+ < 1\) . If neither condition is met, the association is inconclusive. For Scissor, we apply it under six different \(\alpha\) values: 0.01, 0.05, 0.10, 0.15, 0.20, and 0.25. A cell type is summarized as positively associated with the phenotype if \(\rho _+ \ge 1\) and \(\rho _- < 1\) in at least four of these \(\alpha\) values and negatively associated if \(\rho _- \ge 1\) and \(\rho _+ < 1\) in at least four \(\alpha\) values. If these criteria are not met, the association is deemed inconclusive. The above computation of log odds ratios and the determination of associations are performed only on cell types that each compose at least 1% of the cell population, to ensure adequate power.

For the prostate cancer data, the log odds ratios for each cell type using each method are presented in Tables S1 and S2. The final associations determined for each cell type are summarized in Table S3. In the last column of this table, we also indicate whether the conclusions drawn from SCIPAC and Scissor are consistent or not.

We find that SCIPAC’s results agree with Scissor on most cell types. However, there are three exceptions: mononuclear phagocytes (MNPs), B cells, and LE-KLK4.

MNPs are red-circled and zoomed in in each sub-figure of Fig. 3 . Most cells in this cell type are colored red in Fig. 3 d but colored dark blue in Fig. 3 b. In other words, while Scissor determines that this cell type is Scissor+, SCIPAC makes the opposite inference. Moreover, SCIPAC is confident about its judgment by giving small p -values, as shown in Fig. 3 c. To see which inference is closer to the biological fact is not easy, as biologically MNPs contain a number of sub-types that each have different functions [ 30 , 31 ]. Fortunately, this cell population has been studied in detail in the original paper that generated this dataset [ 29 ], and the sub-type information of each cell is provided there: this MNP population contains six sub-types, which are dendritic cells (DC), M1 macrophages (Mac1), metallothionein-expressing macrophages (Mac-MT), M2 macrophages (Mac2), proliferating macrophages (Mac-cycling), and monocytes (Mono), as shown in the zoom-in view of Fig. 3 a. Among these six sub-types, DC, Mac1, and Mac-MT are believed to inhibit cancer development and can serve as targets in cancer immunotherapy [ 29 ]; they compose more than 60% of all MNP cells in this dataset. SCIPAC makes the correct inference on this majority of MNP cells. Another cell type, Mac2, is reported to promote tumor development [ 32 ], but it only composes less than \(15\%\) of the MNPs. How the other two cell types, Mac-cycling and Mono, are associated with cancer is less studied. Overall, the results given by SCIPAC are more consistent with the current biological knowledge.

B cells are cyan-circled in Fig. 3 b. B cells are generally believed to have anti-tumor activity by producing tumor-reactive antibodies and forming tertiary lymphoid structures [ 29 , 33 ]. This means that B cells are likely to be negatively associated with cancer. SCIPAC successfully identifies this negative association, while Scissor fails.

LE-KLK4, a subtype of cancer cells, is thought to be positively associated with the tumor phenotype [ 29 ]. SCIPAC successfully identified this positive association, in contrast to Scissor, which failed to do so (in the figure, a proportion of LE-KLK4 cells are identified as Scissor+, especially under the smallest \(\alpha\) value; however, this proportion is not significantly higher than the background Scissor+ level under the majority of \(\alpha\) values).

In summary, across all three cell types, the results from SCIPAC appear to be more consistent with current biological knowledge. For more discussions regarding this dataset, refer to Additional file 1.

Breast cancer data with an ordinal phenotype

The scRNA-seq data for breast cancer are from [ 34 ], and we use the 19,311 cells from the five HER2+ tumor tissues. The true cell types are shown in Fig. 4 a. The bulk data include 1215 TCGA-BRCA samples with information on the cancer stage (I, II, III, or IV), which is treated as an ordinal phenotype.

figure 4

UMAP visualization of the breast cancer data. a  True cell types. CAFs stand for cancer-associated fibroblasts, PB stands for plasmablasts and PVL stands for perivascular-like cells. b , c  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. Cyan-circled are a group of T cells that are estimated by SCIPAC to be most significantly associated with the cancer stage in the negative direction, and orange-circled are a group of T cells that are estimated by SCIPAC to be significantly positively associated with the cancer stage. d  DE analysis of the cyan-circled T cells vs. all the other T cells. e  DE analysis of the cyan-circled T cells vs. all the other cells. f  Expression of CD8+ T cell marker genes in the cyan-circled cells and all the other cells. g  DE analysis of the orange-circled T cells vs. all the other cells. h  Expression of regulatory T cell marker genes in the orange-circled cells and all the other cells

Association strengths and p -values given by SCIPAC under the default resolution are shown in Fig. 4 b, c. Results under other resolutions are given in Additional file 1: Fig. S11 and S12, and again they are highly consistent with results under the default resolution. We do not present the results from Scissor, as Scissor does not take ordinal phenotypes.

In the SCIPAC results, cells that are most strongly and statistically significantly associated with the phenotype in the positive direction are the cancer-associated fibroblasts (CAFs). This finding agrees with the literature: CAFs contribute to therapy resistance and metastasis of cancer cells via the production of secreted factors and direct interaction with cancer cells [ 35 ], and they are also active players in breast cancer initiation and progression [ 36 , 37 , 38 , 39 ]. Another large group of cells identified as positively associated with the phenotype is the cancer epithelial cells. They are malignant cells in breast cancer tissues and are thus expected to be associated with severe cancer stages.

Of the cells identified as negatively associated with severe cancer stages, a large portion of T cells is the most noticeable. Biologically, T cells contain many sub-types, including CD4+, CD8+, regulatory T cells, and more, and their functions are diverse in the tumor microenvironment [ 40 ]. To explore SCIPAC’s discoveries, we compare T cells that are identified as most statistically significant, with p -values \(< 10^{-6}\) and circled in Fig. 4 d, with the other T cells. Differential expression (DE) analysis (details about DE analysis and other analyses are given in Additional file 1) identifies seven genes upregulated in these most significant T cells. Of these seven genes, at least five are supported by the literature: CCL4, XCL1, IFNG, and GZMB are associated with CD8+ T cell infiltration; they have been shown to have anti-tumor functions and are involved in cancer immunotherapy [ 41 , 42 , 43 ]. Also, IL2 has been shown to serve an important role in combination therapies for autoimmunity and cancer [ 44 ]. We also perform an enrichment analysis [ 45 ], in which a pathway called Myc stands out with a \(\textit{p}\text{-value}<10^{-7}\) , much smaller than all other pathways. Myc is downregulated in the T cells that are identified as most negatively associated with cancer stage progress. This agrees with current biological knowledge about this pathway: Myc is known to contribute to malignant cell transformation and tumor metastasis [ 46 , 47 , 48 ].

On the above, we have compared T cells that are most significantly associated with cancer stages in the negative direction with the other T cells using DE and pathway analysis, and the results could suggest that these cells are tumor-infiltrated CD8+ T cells with tumor-inhibition functions. To check this hypothesis, we perform DE analysis of these cells against all other cells (i.e., the other T cells and all the other cell types). The DE genes are shown in Fig. 4 e. It can be noted that CD8+ T cell marker genes such as CD8A, CD8B, and GZMK are upregulated. We further obtain CD8+ T cell marker genes from CellMarker [ 49 ] and check their expression, as illustrated in Fig. 4 f. Marker genes CD8A, CD8B, CD3D, GZMK, and CD7 show significantly higher expression in these T cells. This again supports our hypothesis that these cells are tumor-infiltrated CD8+ T cells that have anti-tumor functions.

Interestingly, not all T cells are identified as negatively associated with severe cancer stages; a group of T cells is identified as positively associated, as circled in Fig. 4 c. To explore the function of this group of T cells, we perform DE analysis of these T cells against the other T cells. The DE genes are shown in Fig. 4 g. Based on the literature, six out of eight over-expressed genes are associated with cancer development. The high expression of NUSAP1 gene is associated with poor patient overall survival, and this gene also serves as a prognostic factor in breast cancer [ 50 , 51 , 52 ]. Gene MKI67 has been treated as a candidate prognostic prediction for cancer proliferation [ 53 , 54 ]. The over-expression of RRM2 has been linked to higher proliferation and invasiveness of malignant cells [ 55 , 56 ], and the upregulation of RRM2 in breast cancer suggests it to be a possible prognostic indicator [ 57 , 58 , 59 , 60 , 61 , 62 ]. The high expression of UBE2C gene always occurs in cancers with a high degree of malignancy, low differentiation, and high metastatic tendency [ 63 ]. For gene TOP2A, it has been proposed that the HER2 amplification in HER2 breast cancers may be a direct result of the frequent co-amplification of TOP2A [ 64 , 65 , 66 ], and there is a high correlation between the high expressions of TOP2A and the oncogene HER2 [ 67 , 68 ]. Gene CENPF is a cell cycle-associated gene, and it has been identified as a marker of cell proliferation in breast cancers [ 69 ]. The over-expression of these genes strongly supports the correctness of the association identified by SCIPAC. To further validate this positive association, we perform DE analysis of these cells against all the other cells. We find that the top marker genes obtained from CellMarker [ 49 ] for the regulatory T cells, which are known to be immunosuppressive and promote cancer progression [ 70 ], are over-expressed with statistical significance, as shown in Fig. 4 h. This finding again provides strong evidence that the positive association identified by SCIPAC for this group of T cells is correct.

Lung cancer data with survival information

The scRNA-seq data for lung cancer are from [ 71 ], and we use two lung adenocarcinoma (LUAD) patients’ data with 29,888 cells. The true cell types are shown in Fig. 5 a. The bulk data consist of 576 TCGA-LUAD samples with survival status and time.

figure 5

UMAP visualization of a–d  the lung cancer data and e–g  the muscular dystrophy data. a  True cell types. b , c  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. d  Results given by Scissor under different \(\alpha\) values. e , f  Association strengths \(\Lambda\) and p -values given by SCIPAC under the default resolution. Circled are a group of cells that are identified by SCIPAC as significantly positively associated with the disease but identified by Scissor as null. g  Results given by Scissor under different \(\alpha\) values

Association strengths and p -values given by SCIPAC are given in Fig. 5 b, c (results under other resolutions are given in Additional file 1: Fig. S13 and S14). In Fig. 5 c, most cells with statistically significant associations are CD4+ T cells or B cells. These associations are negative, meaning that the abundance of these cells is associated with a reduced death rate, i.e., longer survival time. This agrees with the literature: CD4+ T cells primarily mediate anti-tumor immunity and are associated with favorable prognosis in lung cancer patients [ 72 , 73 , 74 ]; B cells also show anti-tumor functions in all stages of human lung cancer development and play an essential role in anti-tumor responses [ 75 , 76 ].

The results by Scissor under different \(\alpha\) values are shown in Fig. 5 d. The highly scattered Scissor+ and Scissor− cells make identifying and interpreting meaningful phenotype-associated cell groups difficult.

Muscular dystrophy data with a binary phenotype

This dataset contains cells from four facioscapulohumeral muscular dystrophy (FSHD) samples and two control samples [ 77 ]. We pool all the 7047 cells from these six samples together. The true cell types of these cells are unknown. The bulk data consists of 27 FSHD patients and eight controls from [ 28 ]. Here the phenotype is FSHD, and it is binary: present or absent.

The results of SCIPAC with the default resolution are given in Fig. 5 e, f. Results under other resolutions are highly similar (shown in Additional file 1: Fig. S15 and S16). For comparison, results given by Scissor under different \(\alpha\) values are presented in Fig. 5 g. The agreements between the results of SCIPAC and Scissor are clear. For example, both methods identify cells located at the top and lower left part of UMAP plots to be negatively associated with FSHD, and cells located at the center and right parts of UMAP plots to be positively associated. However, the discrepancies in their results are also evident. The most pronounced one is a large group of cells (circled in Fig. 5 f) that are identified by SCIPAC as significantly positively associated but are completely ignored by Scissor. Checking into this group of cells, we find that over 90% (424 out of 469) come from the FSHD patients, and less than 10% come from the control samples. However, cells from FSHD patients only compose 73% (5133) of all the 7047 cells. This statistically significant ( p -value \(<10^{-15}\) , Fisher’s exact test) over-representation (odds ratio = 3.51) suggests that the positive association identified SCIPAC is likely to be correct.

SCIPAC is computationally highly efficient. On an 8-core machine with 2.50 GHz CPU and 16 GB RAM, SCIPAC takes 7, 24, and 2 s to finish all the computation and give the estimated association strengths and p -values on the prostate cancer, lung cancer, and muscular dystrophy datasets, respectively. As a reference, Scissor takes 314, 539, and 171 seconds, respectively.

SCIPAC works with various phenotype types, including binary, continuous, survival, and ordinal. It can easily accommodate other types by using a proper regression model with a systematic component in the form of Eq. 3 (see the “ Methods ” section). For example, a Poisson or negative binomial log-linear model can be used if the phenotype is a count (i.e., non-negative integer).

In SCIPAC’s definition of association, a cell type is associated with the phenotype if increasing the proportion of this cell type leads to a change of probability of the phenotype occurring. The strength of association represents the extent of the increase or decrease in this probability. In the case of binary-response, this change is measured by the log odds ratio. For example, if the association strength of cell type A is twice that of cell type B, increasing cell type A by a certain proportion leads to twice the amount of change in the log odds ratio of having the phenotype compared to increasing cell type B by the same proportion. The association strength under other types of phenotypes can be interpreted similarly, with the major difference lying in the measure of change in probability. For quantitative, ordinal, and survival outcomes, the difference in the quantitative outcome, log odds ratio of the right-tail probability, and log hazard ratio respectively are used. Despite the differences in the exact form of the association strength under different types of phenotypes, the underlying concept remains the same: a larger (absolute value of) association strength indicates that the same increase/decrease in a cell type leads to a larger change in the occurrence of the phenotype.

As SCIPAC utilizes both bulk RNA-seq data with phenotype and single-cell RNA-seq data, the estimated associations for the cells are influenced by the choice of the bulk data. Although different bulk data can yield varying estimations of the association for the same single cells, the estimated associations appear to be reasonably robust even when minor changes are made to the bulk data. See Additional file 1 for further discussions.

When using the Louvain algorithm in the Seurat package to cluster cells, SCIPAC’s default resolution is 2.0, larger than the default setting of Seurat. This allows for the identification of potential subtypes within the major cell type and enables the estimation of individual association strengths. Consequently, a more detailed and comprehensive description of the association between single cells and the phenotype can be obtained by SCIPAC.

When applying SCIPAC to real datasets, we made a deliberate choice to disregard the cell annotation provided by the original publications and instead relied on the inferred cell clusters produced by the Louvain algorithm. We made this decision for several reasons. Firstly, we aimed to ensure a fair comparison with Scissor, as it does not utilize cell-type annotations. Secondly, the original annotation might not be sufficiently comprehensive or detailed. Presumed cell types could potentially encompass multiple subtypes, each of which may exhibit distinct associations with the phenotype under investigation. In such cases, employing the Louvain algorithm with a relatively high resolution, which is the default setting in SCIPAC, enables us to differentiate between these subtypes and allows SCIPAC to assign varying association strengths to each subtype.

SCIPAC fits the regression model using the elastic net, a machine-learning algorithm that maximizes a penalized version of the likelihood. The elastic net can be replaced by other penalized estimates of regression models, such as SCAD [ 78 ], without altering the rest of the SCIPAC algorithm. The combination of a regression model and a penalized estimation algorithm such as the elastic net has shown comparable or higher prediction power than other sophisticated methods such as random forests, boosting, or neural networks in numerous applications, especially for gene expression data [ 79 ]. However, there can still be datasets where other models have higher prediction power. It will be future work to incorporate these models into SCIPAC.

The use of metacells is becoming an efficient way to handle large single-cell datasets [ 80 , 81 , 82 , 83 ]. Conceptually, SCIPAC can incorporate metacells and their representatives as an alternative to its default setting of using cell clusters/types and their centroids. We have explored this aspect using metacells provided by SEACells [ 81 ]. Details are given in Additional file 1. Our comparative analysis reveals that combining SCIPAC with SEACells results in significantly reduced performance compared to using SCIPAC directly on original single-cell data. The primary reason for this appears to be the subpar performance of SEACells in cell grouping, especially when contrasted with the Louvain algorithm. Given these findings, we do not suggest using metacells provided by SEACells for SCIPAC applications in the current stage.

Conclusions

SCIPAC is a novel algorithm for studying the associations between cells and phenotypes. Compared to the previous algorithm, SCIPAC gives a much more detailed and comprehensive description of the associations by enabling a quantitative estimation of the association strength and by providing a quality control—the p -value. Underlying SCIPAC are a general statistical model that accommodates virtually all types of phenotypes, including ordinal (and potentially count) phenotypes that have never been considered before, and a concise and closed-form mathematical formula that quantifies the association, which minimizes the computational load. The mathematical conciseness also largely frees SCIPAC from parameter tuning. The only parameter (i.e., the resolution) barely changes the results given by SCIPAC. Overall, compared with its predecessor, SCIPAC represents a substantially more capable software by being much more informative, versatile, robust, and user-friendly.

The improvement in accuracy is also remarkable. In simulated data, SCIPAC achieves high power and low false positives, which is evident from the UMAP plot, F1 score, and FSC score. In real data, SCIPAC gives results that are consistent with current biological knowledge for cell types whose functions are well understood. For cell types whose functions are less studied or more multifaceted, SCIPAC gives support to certain biological hypotheses or helps identify/discover cell sub-types.

SCIPAC’s identification of cell-phenotype associations closely follows its definition of association: when increasing the fraction of a cell type increases (or decreases) the probability for a phenotype to be present, this cell type is positively (or negatively) associated with the phenotype.

The increase of the fraction of a cell type

For a bulk sample, let vector \(\varvec{G} \in \mathbb {R}^p\) be its expression profile, that is, its expression on the p genes. Suppose there are K cell types in the tissue, and let \(\varvec{g}_{k}\) be the representative expression of the k ’th cell type. Usually, people assume that \(\varvec{G}\) can be decomposed by

where \(\gamma _{k}\) is the proportion of cell type k in the bulk tissue, with \(\sum _{k = 1}^{K}\gamma _{k} = 1\) . This equation links the bulk and single-cell expression data.

Now consider increasing cells from cell type k by \(\Delta \gamma\) proportion of the original number of cells. Then, the new proportion of cell type k becomes \(\frac{\gamma _{k} + \Delta \gamma }{1 + \Delta \gamma }\) , and the new proportion of cell type \(j \ne k\) becomes \(\frac{\gamma _{j}}{1 + \Delta \gamma }\)  (note that the new proportions of all cell types should still add up to 1). Thus, the bulk expression profile with the increase of cell type k becomes

Plugging Eq. 1 , we get

Interestingly, this expression of \(\varvec{G}^*\) does not include \(\gamma _{1}, \ldots , \gamma _{K}\) . This means that there is no need actually to compute \(\gamma _{1}, \ldots , \gamma _{K}\) in Eq. 1 , which could otherwise be done using a cell-type-decomposition software, but an accurate and robust decomposition is non-trivial [ 84 , 85 , 86 ]. See Additional file 1 for a more in-depth discussion on the connections of SCIPAC with decomposition/deconvolution.

The change in chance of a phenotype

In this section, we consider how the increase in the fraction of a cell type will change the chance for a binary phenotype such as cancer to occur. Other types of phenotypes will be considered in the next section.

Let \(\pi (\varvec{G})\) be the chance of an individual with gene expression profile \(\varvec{G}\) for this phenotype to occur. We assume a logistic regression model to describe the relationship between \(\pi (\varvec{G})\) and \(\varvec{G}\) :

here the left-hand side is the log odds of \(\pi (\varvec{G})\) , \(\beta _{0}\) is the intercept, and \(\varvec{\beta }\) is a length- p vector of coefficients. In the section after the next, we will describe how we obtain \(\beta _{0}\) and \(\varvec{\beta }\) from the data.

When increasing cells from cell type k by \(\Delta \gamma\) , \(\varvec{G}\) becomes \(\varvec{G}^*\) in Eq. 3 . Plugging Eq. 2 , we get

We further take the difference between Eqs. 4 and 3 and get

The left-hand side of this equation is the log odds ratio (i.e., the change of log odds). On the right-hand side, \(\frac{\Delta \gamma }{1 + \Delta \gamma }\) is an increasing function with respect to \(\Delta \gamma\) , and \(\varvec{\beta }^T(\varvec{g}_{k} - \varvec{G})\) is independent of \(\Delta \gamma\) . This indicates that given any specific \(\Delta \gamma\) , the log odds ratio under over-representation of cell type k is proportional to

\(\lambda _k\) describes the strength of the effect of increasing cell type k to a bulk sample with expression profile \(\varvec{G}\) . Given the presence of numerous bulk samples, employing multiple \(\lambda _k\) ’s could be cumbersome and obscure the overall effect of a particular cell type. To concisely summarize the association of cell type k , we propose averaging their effects. The average effect on all bulk samples can be obtained by

where \(\bar{\varvec{G}}\) is the average expression profile of all bulk samples.

\(\Lambda _k\) gives an overall impression of how strong the effect is when cell type k over-represents to the probability for the phenotype to be present. Its sign represents the direction of the change: a positive value means an increase in probability, and a negative value means a decrease in probability. Its absolute value represents the strength of the effect. In SCIPAC, we call \(\Lambda _k\) the association strength of cell type k and the phenotype.

Note that this derivation does not involve likelihood, although the computation of \(\varvec{\beta }\) does. Here, it serves more as a definitional approach.

Definition of the association strength for other types of phenotype

Our definition of \(\Lambda _k\) relies on vector \(\varvec{\beta }\) . In the case of a binary phenotype, \(\varvec{\beta }\) are the coefficients of a logistic regression that describes a linear relationship between the expression profile and the log odds of having the phenotype, as shown in Eq. 3 . For other types of phenotype, \(\varvec{\beta }\) can be defined/computed similarly.

For a quantitative (i.e., continuous) phenotype, an ordinary linear regression can be used, and the left-hand side of Eq. 3 is changed to the quantitative value of the phenotype.

For a survival phenotype, a Cox proportional hazards model can be used, and the left-hand side of Eq. 3 is changed to the log hazard ratio.

For an ordinal phenotype, we use a proportional odds model

where \(j \in \{1, 2, ..., (J - 1)\}\) and J is the number of ordinal levels. It should be noted that here we use the right-tail probability \(\Pr (Y_{i} \ge j + 1 | X)\) instead of the commonly used cumulative probability (left-tail probability) \(\Pr (Y_{i} \le j | X)\) . Such a change makes the interpretation consistent with other types of phenotypes: in our model, a larger value on the right-hand side indicates a larger chance for \(Y_{i}\) to have a higher level, which in turn guarantees that the sign of the association strength defined according to this \(\varvec{\beta }\) has the usual meaning: a positive \(\Lambda _k\) value means a positive association with the phenotype-using the cancer stage as an example. A positive \(\Lambda _k\) means the over-representation of cell type k increases the chance of a higher cancer stage. In contrast, using the commonly used cumulative probability leads to a counter-intuitive, reversed interpretation.

Computation of the association strength in practice

In practice, \(\varvec{\beta }\) in Eq. 3 needs to be learned from the bulk data. By default, SCIPAC uses the elastic net, a popular and powerful penalized regression method:

In this model, \(l(\beta _{0}, \varvec{\beta })\) is a log-likelihood of the linear model (i.e., logistic regression for a binary phenotype, ordinary linear regression for a quantitative phenotype, Cox proportional odds model for a survival phenotype, and proportional odds model for an ordinal phenotype). \(\alpha\) is a number between 0 and 1, denoting a combination of \(\ell _1\) and \(\ell _2\) penalties, and \(\lambda\) is the penalty strength. SCIPAC fixes \(\alpha\) to be 0.4 (see Additional file 1 for discussions on this choice) and uses 10-fold cross-validation to decide \(\lambda\) automatically. This way, they do not become hyperparameters.

In SCIPAC, the fitting and cross-validation of the elastic net are done by calling the ordinalNet [ 87 ] R package for the ordinal phenotype and by calling the glmnet R package [ 88 , 89 , 90 , 91 ] for other types of phenotypes.

The computation of the association strength, as defined by Eq. 7 , does not only require \(\varvec{\beta }\) , but also \(\varvec{g}_k\) and \(\bar{\varvec{G}}\) . \(\bar{\varvec{G}}\) is simply the average expression profile of all bulk samples. On the other hand, \(\varvec{g}_k\) requires knowing the cell type of each cell. By default, SCIPAC does not assume this information to be given, and it uses the Louvain clustering implemented in the Seurat [ 24 , 25 ] R package to infer it. This clustering algorithm has one tuning parameter called “resolution.” SCIPAC sets its default value as 2.0, and the user can use other values. With the inferred or given cell types, \(\varvec{g}_k\) is computed as the centroid (i.e., the mean expression profile) of cells in cluster k .

Given \(\varvec{\beta }\) , \(\bar{\varvec{G}}\) , and \(\varvec{g}_k\) , the association strength can be computed using Eq. 7 . Knowing the association strength for each cell type and the cell-type label for each cell, we also know the association strength for every single cell. In practice, we standardize the association strengths for all cells. That is, we compute the mean and standard deviation of the association strengths of all cells and use them to centralize and scale the association strength, respectively. We have found such standardization makes SCIPAC more robust to the possible unbalance in sample size of bulk data in different phenotype groups.

Computation of the p -value

SCIPAC uses non-parametric bootstrap [ 92 ] to compute the standard deviation and hence the p -value of the association. Fifty bootstrap samples, which are believed to be enough to compute the standard error of most statistics [ 93 ], are generated for the bulk expression data, and each is used to compute (standardized) \(\Lambda\) values for all the cells. For cell i , let its original \(\Lambda\) values be \(\Lambda _i\) , and the bootstrapped values be \(\Lambda _i^{(1)}, \ldots , \Lambda _i^{(50)}\) . A z -score is then computed using

and then the p -value is computed according to the cumulative distribution function of the standard Gaussian distribution. See Additional file 1 for more discussions on the calculation of p -value.

Availability of data and materials

The simulated datasets [ 94 ] under three schemes are available at Zenodo with DOI 10.5281/zenodo.11013320 [ 95 ]. The SCIPAC package is available at GitHub website https://github.com/RavenGan/SCIPAC under the MIT license [ 96 ]. The source code of SCIPAC is also deposited at Zenodo with DOI 10.5281/zenodo.11013696 [ 97 ]. A vignette of the R package is available on the GitHub page and in the Additional file 2. The prostate cancer scRNA-seq data is obtained from the Prostate Cell Atlas https://www.prostatecellatlas.org [ 29 ]; the scRNA-seq data for the breast cancer are from the Gene Expression Omnibus (GEO) under accession number GSE176078 [ 34 , 98 ]; the scRNA-seq data for the lung cancer are from E-MTAB-6149 [ 99 ] and E-MTAB-6653 [ 71 , 100 ]; the scRNA-seq data for facioscapulohumeral muscular dystrophy data are from the GEO under accession number GSE122873 [ 101 ]. The bulk RNA-seq data are obtained from the TCGA database via TCGAbiolinks (ver. 2.25.2) R package [ 102 ]. More details about the simulated and real scRNA-seq and bulk RNA-seq data can be found in the Additional file 1.

Yofe I, Dahan R, Amit I. Single-cell genomic approaches for developing the next generation of immunotherapies. Nat Med. 2020;26(2):171–7.

Article   CAS   PubMed   Google Scholar  

Zhang Q, He Y, Luo N, Patel SJ, Han Y, Gao R, et al. Landscape and dynamics of single immune cells in hepatocellular carcinoma. Cell. 2019;179(4):829–45.

Fan J, Slowikowski K, Zhang F. Single-cell transcriptomics in cancer: computational challenges and opportunities. Exp Mol Med. 2020;52(9):1452–65.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;161(5):1187–201.

Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14.

Rosenberg AB, Roco CM, Muscat RA, Kuchina A, Sample P, Yao Z, et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science. 2018;360(6385):176–82.

Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8(1):1–12.

Article   Google Scholar  

Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJ, et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20(1):1–19.

Article   CAS   Google Scholar  

Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019;15(6):e8746.

Article   PubMed   PubMed Central   Google Scholar  

Guo H, Li J. scSorter: assigning cells to known cell types according to marker genes. Genome Biol. 2021;22(1):1–18.

Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16(10):983–6.

Zhang AW, O’Flanagan C, Chavez EA, Lim JL, Ceglia N, McPherson A, et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nat Methods. 2019;16(10):1007–15.

Zhang Z, Luo D, Zhong X, Choi JH, Ma Y, Wang S, et al. SCINA: a semi-supervised subtyping algorithm of single cells and bulk samples. Genes. 2019;10(7):531.

Johnson TS, Wang T, Huang Z, Yu CY, Wu Y, Han Y, et al. LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics. 2019;35(22):4696–706.

Ma F, Pellegrini M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics. 2020;36(2):533–8.

Tan Y, Cahan P. SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species. Cell Syst. 2019;9(2):207–13.

Salcher S, Sturm G, Horvath L, Untergasser G, Kuempers C, Fotakis G, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer Cell. 2022;40(12):1503–20.

Good Z, Sarno J, Jager A, Samusik N, Aghaeepour N, Simonds EF, et al. Single-cell developmental classification of B cell precursor acute lymphoblastic leukemia at diagnosis reveals predictors of relapse. Nat Med. 2018;24(4):474–83.

Wagner J, Rapsomaniki MA, Chevrier S, Anzeneder T, Langwieder C, Dykgers A, et al. A single-cell atlas of the tumor and immune ecosystem of human breast cancer. Cell. 2019;177(5):1330–45.

Weinstein JN, Collisson EA, Mills GB, Shaw KR, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45(10):1113–20.

Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Disc. 2012;2(5):401–4.

Gao J, Aksoy BA, Dogrusoz U, Dresdner G, Gross B, Sumer SO, et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal. 2013;6(269):1.

Sun D, Guan X, Moran AE, Wu LY, Qian DZ, Schedin P, et al. Identifying phenotype-associated subpopulations by integrating bulk and single-cell sequencing data. Nat Biotechnol. 2022;40(4):527–38.

Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):P10008.

Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM III, et al. Comprehensive integration of single-cell data. Cell. 2019;177(7):1888–902.

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67(2):301–20.

McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018. arXiv preprint arXiv:1802.03426 .

Wong CJ, Wang LH, Friedman SD, Shaw D, Campbell AE, Budech CB, et al. Longitudinal measures of RNA expression and disease activity in FSHD muscle biopsies. Hum Mol Genet. 2020;29(6):1030–43.

Tuong ZK, Loudon KW, Berry B, Richoz N, Jones J, Tan X, et al. Resolving the immune landscape of human prostate at a single-cell level in health and cancer. Cell Rep. 2021;37(12):110132.

Hume DA. The mononuclear phagocyte system. Curr Opin Immunol. 2006;18(1):49–53.

Hume DA, Ross IL, Himes SR, Sasmono RT, Wells CA, Ravasi T. The mononuclear phagocyte system revisited. J Leukoc Biol. 2002;72(4):621–7.

Raggi F, Bosco MC. Targeting mononuclear phagocyte receptors in cancer immunotherapy: new perspectives of the triggering receptor expressed on myeloid cells (TREM-1). Cancers. 2020;12(5):1337.

Largeot A, Pagano G, Gonder S, Moussay E, Paggetti J. The B-side of cancer immunity: the underrated tune. Cells. 2019;8(5):449.

Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet. 2021;53(9):1334–47.

Fernández-Nogueira P, Fuster G, Gutierrez-Uzquiza Á, Gascón P, Carbó N, Bragado P. Cancer-associated fibroblasts in breast cancer treatment response and metastasis. Cancers. 2021;13(13):3146.

Ao Z, Shah SH, Machlin LM, Parajuli R, Miller PC, Rawal S, et al. Identification of cancer-associated fibroblasts in circulating blood from patients with metastatic breast cancer. Identification of cCAFs from metastatic cancer patients. Cancer Res. 2015;75(22):4681–7.

Arcucci A, Ruocco MR, Granato G, Sacco AM, Montagnani S. Cancer: an oxidative crosstalk between solid tumor cells and cancer associated fibroblasts. BioMed Res Int. 2016;2016.  https://pubmed.ncbi.nlm.nih.gov/27595103/ .

Buchsbaum RJ, Oh SY. Breast cancer-associated fibroblasts: where we are and where we need to go. Cancers. 2016;8(2):19.

Ruocco MR, Avagliano A, Granato G, Imparato V, Masone S, Masullo M, et al. Involvement of breast cancer-associated fibroblasts in tumor development, therapy resistance and evaluation of potential therapeutic strategies. Curr Med Chem. 2018;25(29):3414–34.

Savas P, Virassamy B, Ye C, Salim A, Mintoff CP, Caramia F, et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat Med. 2018;24(7):986–93.

Bassez A, Vos H, Van Dyck L, Floris G, Arijs I, Desmedt C, et al. A single-cell map of intratumoral changes during anti-PD1 treatment of patients with breast cancer. Nat Med. 2021;27(5):820–32.

Romero JM, Grünwald B, Jang GH, Bavi PP, Jhaveri A, Masoomian M, et al. A four-chemokine signature is associated with a T-cell-inflamed phenotype in primary and metastatic pancreatic cancer. Chemokines in Pancreatic Cancer. Clin Cancer Res. 2020;26(8):1997–2010.

Tamura R, Yoshihara K, Nakaoka H, Yachida N, Yamaguchi M, Suda K, et al. XCL1 expression correlates with CD8-positive T cells infiltration and PD-L1 expression in squamous cell carcinoma arising from mature cystic teratoma of the ovary. Oncogene. 2020;39(17):3541–54.

Hernandez R, Põder J, LaPorte KM, Malek TR. Engineering IL-2 for immunotherapy of autoimmunity and cancer. Nat Rev Immunol. 2022:22:1–15.  https://pubmed.ncbi.nlm.nih.gov/35217787/ .

Korotkevich G, Sukhov V, Budin N, Shpak B, Artyomov MN, Sergushichev A. Fast gene set enrichment analysis. BioRxiv. 2016:060012.  https://www.biorxiv.org/content/10.1101/060012v3.abstract .

Dang CV. MYC on the path to cancer. Cell. 2012;149(1):22–35.

Gnanaprakasam JR, Wang R. MYC in regulating immunity: metabolism and beyond. Genes. 2017;8(3):88.

Oshi M, Takahashi H, Tokumaru Y, Yan L, Rashid OM, Matsuyama R, et al. G2M cell cycle pathway score as a prognostic biomarker of metastasis in estrogen receptor (ER)-positive breast cancer. Int J Mol Sci. 2020;21(8):2921.

Zhang X, Lan Y, Xu J, Quan F, Zhao E, Deng C, et al. Cell Marker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019;47(D1):D721–8.

Chen L, Yang L, Qiao F, Hu X, Li S, Yao L, et al. High levels of nucleolar spindle-associated protein and reduced levels of BRCA1 expression predict poor prognosis in triple-negative breast cancer. PLoS ONE. 2015;10(10):e0140572.

Li M, Yang B. Prognostic value of NUSAP1 and its correlation with immune infiltrates in human breast cancer. Crit Rev TM Eukaryot Gene Expr. 2022;32(3).  https://pubmed.ncbi.nlm.nih.gov/35695609/ .

Zhang X, Pan Y, Fu H, Zhang J. Nucleolar and spindle associated protein 1 (NUSAP1) inhibits cell proliferation and enhances susceptibility to epirubicin in invasive breast cancer cells by regulating cyclin D kinase (CDK1) and DLGAP5 expression. Med Sci Monit: Int Med J Exp Clin Res. 2018;24:8553.

Geyer FC, Rodrigues DN, Weigelt B, Reis-Filho JS. Molecular classification of estrogen receptor-positive/luminal breast cancers. Adv Anat Pathol. 2012;19(1):39–53.

Karamitopoulou E, Perentes E, Tolnay M, Probst A. Prognostic significance of MIB-1, p53, and bcl-2 immunoreactivity in meningiomas. Hum Pathol. 1998;29(2):140–5.

Duxbury MS, Whang EE. RRM2 induces NF- \(\kappa\) B-dependent MMP-9 activation and enhances cellular invasiveness. Biochem Biophys Res Commun. 2007;354(1):190–6.

Zhou BS, Tsai P, Ker R, Tsai J, Ho R, Yu J, et al. Overexpression of transfected human ribonucleotide reductase M2 subunit in human cancer cells enhances their invasive potential. Clin Exp Metastasis. 1998;16(1):43–9.

Zhang H, Liu X, Warden CD, Huang Y, Loera S, Xue L, et al. Prognostic and therapeutic significance of ribonucleotide reductase small subunit M2 in estrogen-negative breast cancers. BMC Cancer. 2014;14(1):1–16.

Putluri N, Maity S, Kommagani R, Creighton CJ, Putluri V, Chen F, et al. Pathway-centric integrative analysis identifies RRM2 as a prognostic marker in breast cancer associated with poor survival and tamoxifen resistance. Neoplasia. 2014;16(5):390–402.

Koleck TA, Conley YP. Identification and prioritization of candidate genes for symptom variability in breast cancer survivors based on disease characteristics at the cellular level. Breast Cancer Targets Ther. 2016;8:29.

Li Jp, Zhang Xm, Zhang Z, Zheng Lh, Jindal S, Liu Yj. Association of p53 expression with poor prognosis in patients with triple-negative breast invasive ductal carcinoma. Medicine. 2019;98(18).  https://pubmed.ncbi.nlm.nih.gov/31045815/ .

Gong MT, Ye SD, Lv WW, He K, Li WX. Comprehensive integrated analysis of gene expression datasets identifies key anti-cancer targets in different stages of breast cancer. Exp Ther Med. 2018;16(2):802–10.

PubMed   PubMed Central   Google Scholar  

Chen Wx, Yang Lg, Xu Ly, Cheng L, Qian Q, Sun L, et al. Bioinformatics analysis revealing prognostic significance of RRM2 gene in breast cancer. Biosci Rep. 2019;39(4).  https://pubmed.ncbi.nlm.nih.gov/30898978/ .

Hao Z, Zhang H, Cowell J. Ubiquitin-conjugating enzyme UBE2C: molecular biology, role in tumorigenesis, and potential as a biomarker. Tumor Biol. 2012;33(3):723–30.

Arriola E, Rodriguez-Pinilla SM, Lambros MB, Jones RL, James M, Savage K, et al. Topoisomerase II alpha amplification may predict benefit from adjuvant anthracyclines in HER2 positive early breast cancer. Breast Cancer Res Treat. 2007;106(2):181–9.

Knoop AS, Knudsen H, Balslev E, Rasmussen BB, Overgaard J, Nielsen KV, et al. Retrospective analysis of topoisomerase IIa amplifications and deletions as predictive markers in primary breast cancer patients randomly assigned to cyclophosphamide, methotrexate, and fluorouracil or cyclophosphamide, epirubicin, and fluorouracil: Danish Breast Cancer Cooperative Group. J Clin Oncol. 2005;23(30):7483–90.

Tanner M, Isola J, Wiklund T, Erikstein B, Kellokumpu-Lehtinen P, Malmstrom P, et al. Topoisomerase II \(\alpha\) gene amplification predicts favorable treatment response to tailored and dose-escalated anthracycline-based adjuvant chemotherapy in HER-2/neu-amplified breast cancer: Scandinavian Breast Group Trial 9401. J Clin Oncol. 2006;24(16):2428–36.

Arriola E, Moreno A, Varela M, Serra JM, Falo C, Benito E, et al. Predictive value of HER-2 and topoisomerase II \(\alpha\) in response to primary doxorubicin in breast cancer. Eur J Cancer. 2006;42(17):2954–60.

Järvinen TA, Tanner M, Bärlund M, Borg Å, Isola J. Characterization of topoisomerase II \(\alpha\) gene amplification and deletion in breast cancer. Gene Chromosome Cancer. 1999;26(2):142–50.

Landberg G, Erlanson M, Roos G, Tan EM, Casiano CA. Nuclear autoantigen p330d/CENP-F: a marker for cell proliferation in human malignancies. Cytom J Int Soc Anal Cytol. 1996;25(1):90–8.

CAS   Google Scholar  

Bettelli E, Carrier Y, Gao W, Korn T, Strom TB, Oukka M, et al. Reciprocal developmental pathways for the generation of pathogenic effector TH17 and regulatory T cells. Nature. 2006;441(7090):235–8.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Nat Med. 2018;24(8):1277–89.

Bremnes RM, Busund LT, Kilvær TL, Andersen S, Richardsen E, Paulsen EE, et al. The role of tumor-infiltrating lymphocytes in development, progression, and prognosis of non-small cell lung cancer. J Thorac Oncol. 2016;11(6):789–800.

Article   PubMed   Google Scholar  

Schalper KA, Brown J, Carvajal-Hausdorf D, McLaughlin J, Velcheti V, Syrigos KN, et al. Objective measurement and clinical significance of TILs in non–small cell lung cancer. J Natl Cancer Inst. 2015;107(3):dju435.

Tay RE, Richardson EK, Toh HC. Revisiting the role of CD4+ T cells in cancer immunotherapy—new insights into old paradigms. Cancer Gene Ther. 2021;28(1):5–17.

Dieu-Nosjean MC, Goc J, Giraldo NA, Sautès-Fridman C, Fridman WH. Tertiary lymphoid structures in cancer and beyond. Trends Immunol. 2014;35(11):571–80.

Wang Ss, Liu W, Ly D, Xu H, Qu L, Zhang L. Tumor-infiltrating B cells: their role and application in anti-tumor immunity in lung cancer. Cell Mol Immunol. 2019;16(1):6–18.

van den Heuvel A, Mahfouz A, Kloet SL, Balog J, van Engelen BG, Tawil R, et al. Single-cell RNA sequencing in facioscapulohumeral muscular dystrophy disease etiology and development. Hum Mol Genet. 2019;28(7):1064–75.

Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.

Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction, vol. 2. New York: Springer; 2009.

Book   Google Scholar  

Baran Y, Bercovich A, Sebe-Pedros A, Lubling Y, Giladi A, Chomsky E, et al. MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol. 2019;20(1):1–19.

Persad S, Choo ZN, Dien C, Sohail N, Masilionis I, Chaligné R, et al. SEACells infers transcriptional and epigenomic cellular states from single-cell genomics data. Nat Biotechnol. 2023;41:1–12.  https://pubmed.ncbi.nlm.nih.gov/36973557/ .

Ben-Kiki O, Bercovich A, Lifshitz A, Tanay A. Metacell-2: a divide-and-conquer metacell algorithm for scalable scRNA-seq analysis. Genome Biol. 2022;23(1):100.

Bilous M, Tran L, Cianciaruso C, Gabriel A, Michel H, Carmona SJ, et al. Metacells untangle large and complex single-cell transcriptome networks. BMC Bioinformatics. 2022;23(1):336.

Avila Cobos F, Alquicira-Hernandez J, Powell JE, Mestdagh P, De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun. 2020;11(1):1–14.

Jin H, Liu Z. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22(1):1–23.

Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat Commun. 2019;10(1):380.

Wurm MJ, Rathouz PJ, Hanlon BM. Regularized ordinal regression and the ordinalNet R package. 2017. arXiv preprint arXiv:1706.05003 .

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1.

Simon N, Friedman J, Hastie T. A blockwise descent algorithm for group-penalized multiresponse and multinomial regression. 2013. arXiv preprint arXiv:1311.6529 .

Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1.

Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong rules for discarding predictors in lasso-type problems. J R Stat Soc Ser B Stat Methodol. 2012;74(2):245–66.

Efron B. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics. New York: Springer; 1992. pp. 569–593.

Efron B, Tibshirani RJ. An introduction to the bootstrap. London: CRC Press; 1994.

Zappia L, Phipson B, Oshlack A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017;18(1):174.

Gan D, Zhu Y, Lu X, Li J. Simulated datasets used in SCIPAC analysis. Zenodo. 2024. https://doi.org/10.5281/zenodo.11013320 .

Gan D, Zhu Y, Lu X, Li J. SCIPAC R package. GitHub. 2024. https://github.com/RavenGan/SCIPAC . Accessed 24 Apr 2024.

Gan D, Zhu Y, Lu X, Li J. SCIPAC source code. Zenodo. 2024. https://doi.org/10.5281/zenodo.11013696 .

Wu SZ, Al-Eryani G, Roden DL, Junankar S, Harvey K, Andersson A, et al. A single-cell and spatially resolved atlas of human breast cancers. Datasets. 2021. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176078 . Gene Expression Omnibus. Accessed 1 Oct 2022.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Datasets. 2018. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6149 . ArrayExpress. Accessed 24 July 2022.

Lambrechts D, Wauters E, Boeckx B, Aibar S, Nittner D, Burton O, et al. Phenotype molding of stromal cells in the lung tumor microenvironment. Datasets. 2018. https://www.ebi.ac.uk/biostudies/arrayexpress/studies/E-MTAB-6653 . ArrayExpress. Accessed 24 July 2022.

van den Heuvel A, Mahfouz A, Kloet SL, Balog J, van Engelen BG, Tawil R, et al. Single-cell RNA sequencing in facioscapulohumeral muscular dystrophy disease etiology and development. Datasets. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122873 . Gene Expression Omnibus. Accessed 13 Aug 2022.

Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8):e71.

Download references

Review history

The review history is available as Additional file 3.

Peer review information

Veronique van den Berghe was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

This work is supported by the National Institutes of Health (R01CA280097 to X.L. and J.L, R01CA252878 to J.L.) and the DOD BCRP Breakthrough Award, Level 2 (W81XWH2110432 to J.L.).

Author information

Authors and affiliations.

Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, 46556, IN, USA

Dailin Gan & Jun Li

Department of Biological Sciences, Boler-Parseghian Center for Rare and Neglected Diseases, Harper Cancer Research Institute, Integrated Biomedical Sciences Graduate Program, University of Notre Dame, Notre Dame, 46556, IN, USA

Yini Zhu & Xin Lu

Tumor Microenvironment and Metastasis Program, Indiana University Melvin and Bren Simon Comprehensive Cancer Center, Indianapolis, 46202, IN, USA

You can also search for this author in PubMed   Google Scholar

Contributions

J.L. conceived and supervised the study. J.L. and D.G. proposed the methods. D.G. implemented the methods and analyzed the data. D.G. and J.L. drafted the paper. D.G., Y.Z., X.L., and J.L. interpreted the results and revised the paper.

Corresponding author

Correspondence to Jun Li .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1. supplementary materials that include additional results and plots., additional file 2. a vignette of the scipac package., additional file 3. review history., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Gan, D., Zhu, Y., Lu, X. et al. SCIPAC: quantitative estimation of cell-phenotype associations. Genome Biol 25 , 119 (2024). https://doi.org/10.1186/s13059-024-03263-1

Download citation

Received : 30 January 2023

Accepted : 30 April 2024

Published : 13 May 2024

DOI : https://doi.org/10.1186/s13059-024-03263-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Phenotype association
  • Single cell
  • RNA sequencing
  • Cancer research

Genome Biology

ISSN: 1474-760X

the null hypothesis stands

IMAGES

  1. 15 Null Hypothesis Examples (2024)

    the null hypothesis stands

  2. Null hypothesis

    the null hypothesis stands

  3. How to Write a Null Hypothesis (with Examples and Templates)

    the null hypothesis stands

  4. How to Write a Null Hypothesis (with Examples and Templates)

    the null hypothesis stands

  5. Null Hypothesis: What Is It and How Is It Used in Investing?

    the null hypothesis stands

  6. Null Hypothesis Examples

    the null hypothesis stands

VIDEO

  1. Null Hypothesis (Ho)

  2. Null vs Alternative Hypothesis

  3. a Null hypothesis

  4. Misunderstanding The Null Hypothesis

  5. Null & Alternative Hypothesis |Statistical Hypothesis #hypothesis #samplingdistribution #statistics

  6. Null Hypothesis

COMMENTS

  1. Null hypothesis

    Basic definitions. The null hypothesis and the alternative hypothesis are types of conjectures used in statistical tests to make statistical inferences, which are formal methods of reaching conclusions and separating scientific claims from statistical noise.. The statement being tested in a test of statistical significance is called the null hypothesis. . The test of significance is designed ...

  2. Null Hypothesis: Definition, Rejecting & Examples

    When your sample contains sufficient evidence, you can reject the null and conclude that the effect is statistically significant. Statisticians often denote the null hypothesis as H 0 or H A.. Null Hypothesis H 0: No effect exists in the population.; Alternative Hypothesis H A: The effect exists in the population.; In every study or experiment, researchers assess an effect or relationship.

  3. Exploring the Null Hypothesis: Definition and Purpose

    The word Null in the context of hypothesis testing means "nothing" or "zero.". As an example, if we wanted to test whether there was a difference in two population means based on the calculations from two samples, we would state the Null Hypothesis in the form of: Ho: mu1 = mu2 or mu1- mu2 = 0. In other words, there is no difference, or ...

  4. Null Hypothesis Definition and Examples, How to State

    Step 1: Figure out the hypothesis from the problem. The hypothesis is usually hidden in a word problem, and is sometimes a statement of what you expect to happen in the experiment. The hypothesis in the above question is "I expect the average recovery period to be greater than 8.2 weeks.". Step 2: Convert the hypothesis to math.

  5. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0, the —null hypothesis: a statement of no difference between sample means or proportions or no difference between a sample mean or proportion and a population mean or proportion. In other words, the difference equals 0.

  6. 9.1: Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. \(H_0\): The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

  7. 7.3: The Null Hypothesis

    The null hypothesis in a correlational study of the relationship between high school grades and college grades would typically be that the population correlation is 0. This can be written as. H0: ρ = 0 (7.3.2) (7.3.2) H 0: ρ = 0. where ρ ρ is the population correlation, which we will cover in chapter 12. Although the null hypothesis is ...

  8. Null hypothesis

    The null hypothesis is usually denoted by the symbol (read "H-zero", "H-nought" or "H-null"). The letter in the symbol stands for "Hypothesis". More examples. More examples of null hypotheses and how to test them can be found in the following lectures.

  9. 9.1 Null and Alternative Hypothesis

    Section 9.1 Null and Alternative Hypothesis. Learning Objective: In this section, you will: • Understand the general concept and use the terminology of hypothesis testing. I claim that my coin is a fair coin. This means that the probability of heads and the probability of tails are both 50% or 0.50. Out of 200 flips of the coin, tails is ...

  10. Null and Alternative Hypotheses

    The null and alternative hypotheses are two competing claims that researchers weigh evidence for and against using a statistical test: Null hypothesis (H0): There's no effect in the population. Alternative hypothesis (HA): There's an effect in the population. The effect is usually the effect of the independent variable on the dependent ...

  11. How to Write a Null Hypothesis (5 Examples)

    Whenever we perform a hypothesis test, we always write a null hypothesis and an alternative hypothesis, which take the following forms: H0 (Null Hypothesis): Population parameter =, ≤, ≥ some value. HA (Alternative Hypothesis): Population parameter <, >, ≠ some value. Note that the null hypothesis always contains the equal sign.

  12. Null & Alternative Hypotheses

    Null hypothesis (H 0): Independent variable does not affect dependent variable. Alternative hypothesis (H a): Independent variable affects dependent variable. Test-specific template sentences. Once you know the statistical test you'll be using, you can write your hypotheses in a more precise and mathematical way specific to the test you chose ...

  13. 10.2: Null and Alternative Hypotheses

    The alternative hypothesis ( Ha H a) is a claim about the population that is contradictory to H0 H 0 and what we conclude when we reject H0 H 0. Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if you have enough evidence to reject the null hypothesis or not. The evidence is in the form of sample ...

  14. Null Hypothesis Definition and Examples

    Null Hypothesis Examples. "Hyperactivity is unrelated to eating sugar " is an example of a null hypothesis. If the hypothesis is tested and found to be false, using statistics, then a connection between hyperactivity and sugar ingestion may be indicated. A significance test is the most common statistical test used to establish confidence in a ...

  15. Examples of null and alternative hypotheses

    It is the opposite of your research hypothesis. The alternative hypothesis--that is, the research hypothesis--is the idea, phenomenon, observation that you want to prove. If you suspect that girls take longer to get ready for school than boys, then: Alternative: girls time > boys time. Null: girls time <= boys time.

  16. Null and Alternative Hypotheses

    Some of the following statements refer to the null hypothesis, some to the alternate hypothesis. State the null hypothesis, H 0, and the alternative hypothesis. H a, in terms of the appropriate parameter (μ or p). The mean number of years Americans work before retiring is 34. At most 60% of Americans vote in presidential elections.

  17. How to Formulate a Null Hypothesis (With Examples)

    To distinguish it from other hypotheses, the null hypothesis is written as H 0 (which is read as "H-nought," "H-null," or "H-zero"). A significance test is used to determine the likelihood that the results supporting the null hypothesis are not due to chance. A confidence level of 95% or 99% is common. Keep in mind, even if the confidence level is high, there is still a small chance the ...

  18. What Is a Null Hypothesis?

    (The null hypothesis that the population mean is 7.0 can't be proven using the sample data; it can only be rejected.) Take another example: The annual return of a specific open-end fund is claimed to be 8%. Assume that open-end fund has been alive for 20 years. The null hypothesis is that the mean return is 8% for the open-end fund.

  19. Null Hypothesis

    Here, the hypothesis test formulas are given below for reference. The formula for the null hypothesis is: H 0 : p = p 0. The formula for the alternative hypothesis is: H a = p >p 0, < p 0 ≠ p 0. The formula for the test static is: Remember that, p 0 is the null hypothesis and p - hat is the sample proportion.

  20. Null Hypothesis: What Is It and How Is It Used in Investing?

    Null Hypothesis: A null hypothesis is a type of hypothesis used in statistics that proposes that no statistical significance exists in a set of given observations. The null hypothesis attempts to ...

  21. 9.1 Null and Alternative Hypotheses

    The actual test begins by considering two hypotheses.They are called the null hypothesis and the alternative hypothesis.These hypotheses contain opposing viewpoints. H 0: The null hypothesis: It is a statement of no difference between the variables—they are not related. This can often be considered the status quo and as a result if you cannot accept the null it requires some action.

  22. Null Hypothesis

    What does null hypothesis stands for? The null hypothesis, denoted as H 0 , is a fundamental concept in statistics used for hypothesis testing. It represents the statement that there is no effect or no difference, and it is the hypothesis that the researcher typically aims to provide evidence against.

  23. 7.1: Null and Alternative Hypotheses

    These hypotheses contain opposing viewpoints. The null hypothesis ( H0 H 0) is a statement about the population that either is believed to be true or is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt. The alternative hypothesis ( Ha H a) is a claim about the population that is contradictory to H0 ...

  24. When the Research Hypothesis Is the Null

    Say, for example, that we deemed a t-value of +/-1.96 (the usual threshold for rejecting the null hypothesis) as extreme enough to constitute good evidence of a non-zero effect. We could make the appropriate adjustments to our t-distribution to identify a new range of t-values that would allow us to reject the hypothesis that an effect is non-zero.

  25. SCIPAC: quantitative estimation of cell-phenotype associations

    Red, blue, and light gray stands for Scissor+, Scissor−, and background (i.e., null) cells. f F1 scores and g FSC for SCIPAC and Scissor under different parameter values. For SCIPAC, each bar is the value under a resolution/number of clusters. For Scissor, each bar is the value under an \(\alpha\)