Save 10% on All AnalystPrep 2024 Study Packages with Coupon Code BLOG10 .

  • Payment Plans
  • Product List
  • Partnerships

AnalystPrep

  • Try Free Trial
  • Study Packages
  • Levels I, II & III Lifetime Package
  • Video Lessons
  • Study Notes
  • Practice Questions
  • Levels II & III Lifetime Package
  • About the Exam
  • About your Instructor
  • Part I Study Packages
  • Parts I & II Packages
  • Part I & Part II Lifetime Package
  • Part II Study Packages
  • Exams P & FM Lifetime Package
  • Quantitative Questions
  • Verbal Questions
  • Data Insight Questions
  • Live Tutoring
  • About your Instructors
  • EA Practice Questions
  • Data Sufficiency Questions
  • Integrated Reasoning Questions

Hypothesis Tests and Confidence Intervals in Multiple Regression

Hypothesis Tests and Confidence Intervals in Multiple Regression

After completing this reading you should be able to:

  • Construct, apply, and interpret hypothesis tests and confidence intervals for a single coefficient in a multiple regression.
  • Construct, apply, and interpret joint hypothesis tests and confidence intervals for multiple coefficients in a multiple regression.
  • Interpret the \(F\)-statistic.
  • Interpret tests of a single restriction involving multiple coefficients.
  • Interpret confidence sets for multiple coefficients.
  • Identify examples of omitted variable bias in multiple regressions.
  • Interpret the \({ R }^{ 2 }\) and adjusted \({ R }^{ 2 }\) in a multiple regression.

Hypothesis Tests and Confidence Intervals for a Single Coefficient

This section is about the calculation of the standard error, hypotheses testing, and confidence interval construction for a single regression in a multiple regression equation.

Introduction

In a previous chapter, we looked at simple linear regression where we deal with just one regressor (independent variable). The response (dependent variable) is assumed to be affected by just one independent variable.  M ultiple regression, on the other hand ,  simultaneously considers the influence of multiple explanatory variables on a response variable Y. We may want to establish the confidence interval of one of the independent variables. We may want to evaluate whether any particular independent variable has a significant effect on the dependent variable. Finally, We may also want to establish whether the independent variables as a group have a significant effect on the dependent variable. In this chapter, we delve into ways all this can be achieved.

Hypothesis Tests for a single coefficient

Suppose that we are testing the hypothesis that the true coefficient \({ \beta }_{ j }\) on the \(j\)th regressor takes on some specific value \({ \beta }_{ j,0 }\). Let the alternative hypothesis be two-sided. Therefore, the following is the mathematical expression of the two hypotheses:

$$ { H }_{ 0 }:{ \beta }_{ j }={ \beta }_{ j,0 }\quad vs.\quad { H }_{ 1 }:{ \beta }_{ j }\neq { \beta }_{ j,0 } $$

This expression represents the two-sided alternative. The following are the steps to follow while testing the null hypothesis:

  • Computing the coefficient’s standard error.

hypothesis testing multiple linear regression

$$ p-value=2\Phi \left( -|{ t }^{ act }| \right) $$

  • Also, the \(t\)-statistic can be compared to the critical value corresponding to the significance level that is desired for the test.

Confidence Intervals for a Single Coefficient

The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. 

hypothesis testing multiple linear regression

The t-statistic has n – k – 1 degrees of freedom where k = number of independents

Supposing that an interval contains the true value of \({ \beta }_{ j }\) with a probability of 95%. This is simply the 95% two-sided confidence interval for \({ \beta }_{ j }\). The implication here is that the true value of \({ \beta }_{ j }\) is contained in 95% of all possible randomly drawn variables.

Alternatively, the 95% two-sided confidence interval for \({ \beta }_{ j }\) is the set of values that are impossible to reject when a two-sided hypothesis test of 5% is applied. Therefore, with a large sample size:

$$ 95\%\quad confidence\quad interval\quad for\quad { \beta }_{ j }=\left[ { \hat { \beta } }_{ j }-1.96SE\left( { \hat { \beta } }_{ j } \right) ,{ \hat { \beta } }_{ j }+1.96SE\left( { \hat { \beta } }_{ j } \right) \right] $$

Tests of Joint Hypotheses

In this section, we consider the formulation of the joint hypotheses on multiple regression coefficients. We will further study the application of an \(F\)-statistic in their testing.

Hypotheses Testing on Two or More Coefficients

Joint null hypothesis.

In multiple regression, we canno t test the null hypothesis that all slope coefficients are equal 0 based on t -tests that each individual slope coefficient equals 0. Why? individual t-tests do not account for the effects of interactions among the independent variables.

For this reason, we conduct the F-test which uses the F-statistic .  The F-test tests the null hypothesis that all of the slope coefficients in the multiple regression model are jointly equal to 0, .i.e.,

\(F\)-Statistic

The F-statistic, which is always a one-tailed test , is calculated as:

hypothesis testing multiple linear regression

To determine whether at least one of the coefficients is statistically significant, the calculated F-statistic is compared with the one-tailed critical F-value, at the appropriate level of significance.

Decision rule:

hypothesis testing multiple linear regression

Rejection of the null hypothesis at a stated level of significance indicates that at least one of the coefficients is significantly different than zero, i.e, at least one of the independent variables in the regression model makes a significant contribution to the dependent variable.

An analyst runs a regression of monthly value-stock returns on four independent variables over 48 months.

The total sum of squares for the regression is 360, and the sum of squared errors is 120.

Test the null hypothesis at the 5% significance level (95% confidence) that all the four independent variables are equal to zero.

\({ H }_{ 0 }:{ \beta }_{ 1 }=0,{ \beta }_{ 2 }=0,\dots ,{ \beta }_{ 4 }=0 \)

\({ H }_{ 1 }:{ \beta }_{ j }\neq 0\) (at least one j is not equal to zero, j=1,2… k )

ESS = TSS – SSR = 360 – 120 = 240

The calculated test statistic = (ESS/k)/(SSR/(n-k-1))

=(240/4)/(120/43) = 21.5

\({ F }_{ 43 }^{ 4 }\) is approximately 2.44 at 5% significance level.

Decision: Reject H 0 .

Conclusion: at least one of the 4 independents is significantly different than zero.

Omitted Variable Bias in Multiple Regression

This is the bias in the OLS estimator arising when at least one included regressor gets collaborated with an omitted variable. The following conditions must be satisfied for an omitted variable bias to occur:

  • There must be a correlation between at least one of the included regressors and the omitted variable.
  • The dependent variable \(Y\) must be determined by the omitted variable.

Practical Interpretation of the \({ R }^{ 2 }\) and the adjusted \({ R }^{ 2 }\), \({ \bar { R } }^{ 2 }\)

To determine the accuracy within which the OLS regression line fits the data, we apply the coefficient of determination and the regression’s standard error . 

The coefficient of determination, represented by \({ R }^{ 2 }\), is a measure of the “goodness of fit” of the regression. It is interpreted as the percentage of variation in the dependent variable explained by the independent variables

hypothesis testing multiple linear regression

\({ R }^{ 2 }\) is not a reliable indicator of the explanatory power of a multiple regression model.Why? \({ R }^{ 2 }\) almost always increases as new independent variables are added to the model, even if the marginal contribution of the new variable is not statistically significant. Thus, a high \({ R }^{ 2 }\) may reflect the impact of a large set of independents rather than how well the set explains the dependent.This problem is solved by the use of the adjusted \({ R }^{ 2 }\) (extensively covered in chapter 8)

The following are the factors to watch out when guarding against applying the \({ R }^{ 2 }\) or the \({ \bar { R } }^{ 2 }\):

  • An added variable doesn’t have to be statistically significant just because the \({ R }^{ 2 }\) or the \({ \bar { R } }^{ 2 }\) has increased.
  • It is not always true that the regressors are a true cause of the dependent variable, just because there is a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessary that there is no omitted variable bias just because we have a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessarily true that we have the most appropriate set of regressors just because we have a high \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).
  • It is not necessarily true that we have an inappropriate set of regressors just because we have a low \({ R }^{ 2 }\) or \({ \bar { R } }^{ 2 }\).

An economist tests the hypothesis that GDP growth in a certain country can be explained by interest rates and inflation.

Using some 30 observations, the analyst formulates the following regression equation:

$$ GDP growth = { \hat { \beta } }_{ 0 } + { \hat { \beta } }_{ 1 } Interest+ { \hat { \beta } }_{ 2 } Inflation $$

Regression estimates are as follows:

Is the coefficient for interest rates significant at 5%?

  • Since the test statistic < t-critical, we accept H 0 ; the interest rate coefficient is  not   significant at the 5% level.
  • Since the test statistic > t-critical, we reject H 0 ; the interest rate coefficient is not significant at the 5% level.
  • Since the test statistic > t-critical, we reject H 0 ; the interest rate coefficient is significant at the 5% level.
  • Since the test statistic < t-critical, we accept H 1 ; the interest rate coefficient is significant at the 5% level.

The correct answer is  C .

We have GDP growth = 0.10 + 0.20(Int) + 0.15(Inf)

Hypothesis:

$$ { H }_{ 0 }:{ \hat { \beta } }_{ 1 } = 0 \quad vs \quad { H }_{ 1 }:{ \hat { \beta } }_{ 1 }≠0 $$

The test statistic is:

$$ t = \left( \frac { 0.20 – 0 }{ 0.05 } \right)  = 4 $$

The critical value is t (α/2, n-k-1) = t 0.025,27  = 2.052 (which can be found on the t-table).

t-table-25-29

Conclusion : The interest rate coefficient is significant at the 5% level.

Offered by AnalystPrep

hypothesis testing multiple linear regression

Modeling Cycles: MA, AR, and ARMA Models

Empirical approaches to risk metrics and hedging, anatomy of the great financial crisis ....

After completing this reading, you should be able to: Describe the historical background... Read More

Characterizing Cycles

After completing this reading you should be able to: Define covariance stationary, autocovariance... Read More

Common Univariate Random Variables

After completing this reading, you should be able to: Distinguish the key properties... Read More

Fund Management

After completing this reading, you should be able to: Differentiate among open-end mutual... Read More

Leave a Comment Cancel reply

You must be logged in to post a comment.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Prevent plagiarism. Run a free check.

Multiple linear regression formula.

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved April 15, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

Advanced Statistics using R

Applied Data Science Meeting, July 4-6, 2023, Shanghai, China . Register for the workshops: (1) Deep Learning Using R, (2) Introduction to Social Network Analysis, (3) From Latent Class Model to Latent Transition Model Using Mplus, (4) Longitudinal Data Analysis, and (5) Practical Mediation Analysis. Click here for more information .

  • Example Datasets
  • Basics of R
  • Graphs in R

Hypothesis testing

  • Confidence interval
  • Simple Regression
  • Multiple Regression
  • Logistic regression
  • Moderation analysis
  • Mediation analysis
  • Path analysis
  • Factor analysis
  • Multilevel regression
  • Longitudinal data analysis
  • Power analysis

Multiple Linear Regression

The general purpose of multiple regression (the term was first used by Pearson, 1908), as a generalization of simple linear regression, is to learn about how several independent variables or predictors (IVs) together predict a dependent variable (DV). Multiple regression analysis often focuses on understanding (1) how much variance in a DV a set of IVs explain and (2) the relative predictive importance of IVs in predicting a DV.

In the social and natural sciences, multiple regression analysis is very widely used in research. Multiple regression allows a researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what the best predictors of success in college are. Psychologists may want to determine which personality dimensions best predicts social adjustment.

Multiple regression model

A general multiple linear regression model at the population level can be written as

\[y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\ldots+\beta_{k}x_{ki}+\varepsilon_{i} \]

  • $y_{i}$: the observed score of individual $i$ on the DV.
  • $x_{1},x_{2},\ldots,x_{k}$ : a set of predictors.
  • $x_{1i}$: the observed score of individual $i$ on IV 1; $x_{ki}$: observed score of individual $i$ on IV $k$.
  • $\beta_{0}$: the intercept at the population level, representing the predicted $y$ score when all the independent variables have their values at 0.
  • $\beta_{1},\ldots,\beta_{k}$: regression coefficients at the population level; $\beta_{1}$: representing the amount predicted $y$ changes when $x_{1}$ changes in 1 unit while holding the other IVs constant; $\beta_{k}$: representing the amount predicted $y$ changes when $x_{k}$ changes in 1 unit while holding the other IVs constant.
  • $\varepsilon$: unobserved errors with mean 0 and variance $\sigma^{2}$.

Parameter estimation

The least squares method used for the simple linear regression analysis can also be used to estimate the parameters in a multiple regression model. The basic idea is to minimize the sum of squared residuals or errors. Let $b_{0},b_{1},\ldots,b_{k}$ represent the estimated regression coefficients.The individual $i$'s residual $e_{i}$ is the difference between the observed $y_{i}$ and the predicted $y_{i}$

\[ e_{i}=y_{i}-\hat{y}_{i}=y_{i}-b_{0}-b_{1}x_{1i}-\ldots-b_{k}x_{ki}.\]

The sum of squared residuals is

\[ SSE=\sum_{i=1}^{n}e_{i}^{2}=\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}. \]

By minimizing $SSE$, the regression coefficient estimates can be obtained as

\[ \boldsymbol{b}=(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}'\boldsymbol{y}=(\sum\boldsymbol{x}_{i}\boldsymbol{x}_{i}')^{-1}(\sum\boldsymbol{x}_{i}\boldsymbol{y}_{i}). \]

How well the multiple regression model fits the data can be assessed using the $R^{2}$. Its calculation is the same as for the simple regression

\[\begin{align*} R^{2} & = & 1-\frac{\sum e_{i}^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}\\& = & \frac{\text{Variation explained by IVs}}{\text{Total variation}} \end{align*}. \]

In multiple regression, $R^{2}$ is the total proportion of variation in $y$ explained by the multiple predictors.

The $R^{2}$ increases or at least is the same with the inclusion of more predictors. However, with more predators, the model becomes more complex and potentially more difficult to interpret. In order to take into consideration of the model complexity, the adjusted $R^{2}$ has been defined, which is calculated as

\[aR^{2}=1-(1-R^{2})\frac{n-1}{n-k-1}.\]

Hypothesis testing of regression coefficient(s)

With the estimates of regression coefficients and their standard errors estimates, we can conduct hypothesis testing for one, a subset, or all regression coefficients.

Testing a single regression coefficient

At first, we can test the significance of the coefficient for a single predictor. In this situation, the null and alternative hypotheses are

\[ H_{0}:\beta_{j}=0\text{ vs }H_{1}:\beta_{j}\neq0 \]

with $\beta_{j}$ denoting the regression coefficient of $x_{j}$ at the population level.

As in the simple regression, we use a test statistic

\[ t_{j}=\frac{b_{j} - \beta{j} }{s.e.(b_{j})}\]

where $b_{j}$ is the estimated regression coefficient of $x_{j}$ using data from a sample. If the null hypothesis is true and $\beta_j = 0$, the test statistic follows a t-distribution with degrees of freedom \(n-k-1\) where \(k\) is the number of predictors.

One can also test the significance of \(\beta_j\) by constructing a confidence interval for it. Based on a t distribution, the \(100(1-\alpha)%\) confidence interval is

\[ [b_{j}+t_{n-k-1}(\alpha/2)*s.e.(b_{j}),\;b_{j}+t_{n-k-1}(1-\alpha/2)*s.e.(b_{j})]\]

where $t_{n-k-1}(\alpha/2)$ is the $\alpha/2$ percentile of the t distribution. As previously discussed, if the confidence interval includes 0, the regression coefficient is not statistically significant at the significance level $\alpha$.

Testing all the regression coefficients together (overall model fit)

Given the multiple predictors, we can also test whether all of the regression coefficients are 0 at the same time. This is equivalent to test whether all predictors combined can explained a significant portion of the variance of the outcome variable. Since $R^2$ is a measure of the variance explained, this test is naturally related to it.

For this hypothesis testing, the null and alternative hypothesis are

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{k}=0\]

\[H_{1}:\text{ at least one of the regression coefficients is different from 0}.\]

In this kind of test, an F test is used. The F-statistic is defined as

\[F=\frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}.\]

It follows an F-distribution with degrees of freedom $k$ and $n-k-1$ when the null hypothesis is true. Given an F statistic, its corresponding p-value can be calculated from the F distribution as shown below. Note that we only look at one side of the distribution because the extreme values should be on the large value side.

Testing a subset of the regression coefficients

We can also test whether a subset of $p$ regression coefficients, e.g., $p$ from 1 to the total number coefficients $k$, are equal to zero. For convenience, we can rearrange all the $p$ regression coefficients to be the first $p$ coefficients. Therefore, the null hypothesis should be

\[H_{0}:\beta_{1}=\beta_{2}=\ldots=\beta_{p}=0\]

and the alternative hypothesis is that at least one of them is not equal to 0.

As for testing the overall model fit, an F test can be used here. In this situation, the F statistic can be calculated as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}},\]

which follows an F-distribution with degrees of freedom $p$ and $n-k-1$. $R^2$ is for the regression model with all the predictors and $R_0^2$ is from the regression model without the first $p$ predictors $x_{1},x_{2},\ldots,x_{p}$ but with the rest predictors $x_{p+1},x_{p+2},\ldots,x_{k}$.

Intuitively, this test determine whether the variance explained by the first \(p\) predictors above and beyond the $k-p$ predictors is significance or not. That is also the increase in R-squared.

As an example, suppose that we wanted to predict student success in college. Why might we want to do this? There's an ongoing debate in college and university admission offices (and in the courts) regarding what factors should be considered important in deciding which applicants to admit. Should admissions officers pay most attention to more easily quantifiable measures such as high school GPA and SAT scores? Or should they give more weight to more subjective measures such as the quality of letters of recommendation? What are the pros and cons of the approaches? Of course, how we define college success is also an open question. For the sake of this example, let's measure college success using college GPA.

In this example, we use a set of simulated data (generated by us). The data are saved in the file gpa.csv. As shown below, the sample size is 100 and there are 4 variables: college GPA (c.gpa), high school GPA (h.gpa), SAT, and quality of recommendation letters (recommd).

Graph the data

Before fitting a regression model, we should check the relationship between college GPA and each predictor through a scatterplot. A scatterplot can tell us the form of relationship, e.g., linear, nonlinear, or no relationship, the direction of relationship, e.g., positive or negative, and the strength of relationship, e.g., strong, moderate, or weak. It can also identify potential outliers.

The scatterplots between college GPA and the three potential predictors are given below. From the plots, we can roughly see all three predictors are positively related to the college GPA. The relationship is close to linear and the relationship seems to be stronger for high school GPA and SAT than for the quality of recommendation letters.

Descriptive statistics

Next, we can calculate some summary statistics to explore our data further. For each variable, we calculate 6 numbers: minimum, 1st quartile, median, mean, 3rd quartile, and maximum. Those numbers can be obtained using the summary() function. To look at the relationship among the variables, we can calculate the correlation matrix using the correlation function cor() .

Based on the correlation matrix, the correlation between college GPA and high school GPA is about 0.545, which is larger than that (0.523) between college GPA and SAT, in turn larger than that (0.35) between college GPA and quality of recommendation letters.

Fit a multiple regression model

As for the simple linear regression, The multiple regression analysis can be carried out using the lm() function in R. From the output, we can write out the regression model as

\[ c.gpa = -0.153+ 0.376 \times h.gpa + 0.00122 \times SAT + 0.023 \times recommd \]

Interpret the results / output

From the output, we see the intercept is -0.153. Its immediate meaning is that when all predictors' values are 0, the predicted college GPA is -0.15. This clearly does not make much sense because one would never get a negative GPA, which results from the unrealistic presumption that the predictors can take the value of 0.

The regression coefficient for the predictor high school GPA (h.gpa) is 0.376. This can be interpreted as keeping SAT and recommd scores constant , the predicted college GPA would increase 0.376 with a unit increase in high school GPA.This is again might be problematic because it might be impossible to increase high school GPA while keeping the other two predictors unchanged. The other two regression coefficients can be interpreted in the same way.

From the output, we can also see that the multiple R-squared ($R^2$) is 0.3997. Therefore, about 40% of the variation in college GPA can be explained by the multiple linear regression with h.GPA, SAT, and recommd as the predictors. The adjusted $R^2$ is slightly smaller because of the consideration of the number of predictors. In fact,

\[ \begin{eqnarray*} aR^{2} & = & 1-(1-R^{2})\frac{n-1}{n-k-1}\\& = & 1-(1-.3997)\frac{100-1}{100-3-1}\\& = & .3809 \end{eqnarray*} \]

Testing Individual Regression Coefficient

For any regression coefficients for the three predictors (also the intercept), a t test can be conducted. For example, for high school GPA, the estimated coefficient is 0.376 with the standard error 0.114. Therefore, the corresponding t statistic is \(t = 0.376/0.114 = 3.294\). Since the statistic follows a t distribution with the degrees of freedom \(df = n - k - 1 = 100 - 3 -1 =96\), we can obtain the p-value as \(p = 2*(1-pt(3.294, 96))= 0.0013\). Since the p-value is less than 0.05, we conclude the coefficient is statistically significant. Note the t value and p-value are directly provided in the output.

Overall model fit (testing all coefficients together)

To test all coefficients together or the overall model fit, we use the F test. Given the $R^2$, the F statistic is

\[ \begin{eqnarray*} F & = & \frac{n-k-1}{k}\frac{R^{2}}{1-R^{2}}\\ & = & \left(\frac{100-3-1}{3}\right)\times \left(\frac{0.3997}{1-.3997}\right )=21.307\end{eqnarray*} \]

which follows the F distribution with degrees of freedom $df1=k=3$ and $df2=n-k-1=96$. The corresponding p-value is 1.160e-10. Note that this information is directly shown in the output as " F-statistic: 21.31 on 3 and 96 DF, p-value: 1.160e-10 ".

Therefore, at least one of the regression coefficients is statistically significantly different from 0. Overall, the three predictors explained a significant portion of the variance in college GPA. The regression model with the 3 predictors is significantly better than the regression model with intercept only (i.e., predict c.gpa by the mean of c.gpa).

Testing a subset of regression coefficients

Suppose we are interested in testing whether the regression coefficients of high school GPA and SAT together are significant or not. Alternative, we want to see above and beyond the quality of recommendation letters, whether the two predictors can explain a significant portion of variance in college GPA. To conduct the test, we need to fit two models:

  • A full model: which consists of all the predictors to predict c.gpa by intercept, h.gpa, SAT, and recommd.
  • A reduced model: obtained by removing the predictors to be tested in the full model.

From the full model, we can get the $R^2 = 0.3997$ with all three predictors and from the reduced model, we can get the $R_0^2 = 0.1226$ with only quality of recommendation letters. Then the F statistic is constructed as

\[F=\frac{n-k-1}{p}\frac{R^{2}-R_{0}^{2}}{1-R^{2}}=\left(\frac{100-3-1}{2}\right )\times\frac{.3997-.1226}{1-.3997}=22.157.\]

Using the F distribution with the degrees of freedom $p=2$ (the number of coefficients to be tested) and $n-k-1 = 96$, we can get the p-value close to 0 ($p=1.22e-08$).

Note that the test conducted here is based on the comparison of two models. In R, if there are two models, they can be compared conveniently using the R function anova() . As shown below, we obtain the same F statistic and p-value.

To cite the book, use: Zhang, Z. & Wang, L. (2017-2022). Advanced statistics using R . Granger, IN: ISDSA Press. https://doi.org/10.35566/advstats. ISBN: 978-1-946728-01-2. To take the full advantage of the book such as running analysis within your web browser, please subscribe .

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

12.2.1: Hypothesis Test for Linear Regression

  • Last updated
  • Save as PDF
  • Page ID 34850

  • Rachel Webb
  • Portland State University

To test to see if the slope is significant we will be doing a two-tailed test with hypotheses. The population least squares regression line would be \(y = \beta_{0} + \beta_{1} + \varepsilon\) where \(\beta_{0}\) (pronounced “beta-naught”) is the population \(y\)-intercept, \(\beta_{1}\) (pronounced “beta-one”) is the population slope and \(\varepsilon\) is called the error term.

If the slope were horizontal (equal to zero), the regression line would give the same \(y\)-value for every input of \(x\) and would be of no use. If there is a statistically significant linear relationship then the slope needs to be different from zero. We will only do the two-tailed test, but the same rules for hypothesis testing apply for a one-tailed test.

We will only be using the two-tailed test for a population slope.

The hypotheses are:

\(H_{0}: \beta_{1} = 0\) \(H_{1}: \beta_{1} \neq 0\)

The null hypothesis of a two-tailed test states that there is not a linear relationship between \(x\) and \(y\). The alternative hypothesis of a two-tailed test states that there is a significant linear relationship between \(x\) and \(y\).

Either a t-test or an F-test may be used to see if the slope is significantly different from zero. The population of the variable \(y\) must be normally distributed.

F-Test for Regression

An F-test can be used instead of a t-test. Both tests will yield the same results, so it is a matter of preference and what technology is available. Figure 12-12 is a template for a regression ANOVA table,

Template for a regression table, containing equations for the sum of squares, degrees of freedom and mean square for regression and for error, as well as the F value of the data.

where \(n\) is the number of pairs in the sample and \(p\) is the number of predictor (independent) variables; for now this is just \(p = 1\). Use the F-distribution with degrees of freedom for regression = \(df_{R} = p\), and degrees of freedom for error = \(df_{E} = n - p - 1\). This F-test is always a right-tailed test since ANOVA is testing the variation in the regression model is larger than the variation in the error.

Use an F-test to see if there is a significant relationship between hours studied and grade on the exam. Use \(\alpha\) = 0.05.

T-Test for Regression

If the regression equation has a slope of zero, then every \(x\) value will give the same \(y\) value and the regression equation would be useless for prediction. We should perform a t-test to see if the slope is significantly different from zero before using the regression equation for prediction. The numeric value of t will be the same as the t-test for a correlation. The two test statistic formulas are algebraically equal; however, the formulas are different and we use a different parameter in the hypotheses.

The formula for the t-test statistic is \(t = \frac{b_{1}}{\sqrt{ \left(\frac{MSE}{SS_{xx}}\right) }}\)

Use the t-distribution with degrees of freedom equal to \(n - p - 1\).

The t-test for slope has the same hypotheses as the F-test:

Use a t-test to see if there is a significant relationship between hours studied and grade on the exam, use \(\alpha\) = 0.05.

6   Multiple Linear Regression

Multiple linear regression extends the simple linear regression framework to multiple regressors: \[ Y = \beta_{0} + \beta_{1} X_{1}+ \beta_{2} X_{2} + ... + \beta_{K-1} X_{K-1} + \epsilon\,. \tag{6.1}\] Here \(X_{1}, X_{2},...,X_{K-1}\) represent different variables, although one may be some transformation of another. That is, it may be that \(X_{1}\) and \(X_{2}\) represent completely different random variables, such as age and work experience wexp , or it may be that \(X_{2}\) is a function of \(X_{1}\) , e.g., \(X_{2} = X_{1}^{2}\) . In either case, when speaking of the multiple linear regression model generically, we will denote the regressors as \(X_{1}, X_{2},...,X_{K-1}\) . 1 The variable \(\epsilon\) is again a catch-all noise term. The parameter \(\beta_{0}\) is the intercept, and \(\beta_{1},\dots,\beta_{K-1}\) are the “slope coefficients”. The term “coefficients” will refer to both the intercept and the slope coefficients.

The following examples illustrate the usefulness of extending the simple linear regression to multiple linear regression.

Example 6.1 Suppose \(X\) and \(Z\) are correlated variables, but only \(Z\) is a true causal variable of \(Y\) . To fix ideas, suppose \(X\) is height, \(Z\) is gender and \(Y\) is wage. Estimating the regression \[ Y=\beta_{0} + \beta_{1} X + \epsilon \] on data from a population where there is a gender wage gap will generally result in a significant estimate of \(\beta_{1}\) . This reflects only the common correlation between both \(Y\) and \(X\) with \(Z\) . As we will see, the multiple linear regression model \[ Y = \beta_{0} + \beta_{1} X + \beta_{2} Z + \epsilon \] can help separate the effects of \(X\) and \(Z\) on \(Y\) . If both \(X\) and \(Z\) are bona fide causal variables, the multiple linear regression model will be helpful in measuring the extent of causality between the two variables on the dependent variable.

Example 6.2 The simple linear regression model assumes a linear conditional expectation, but it may be that the conditional expectation is non-linear in the variables. In some cases, transformations to the regressor or the regressand (or both) suffices. For instance, the “log-linear” model \[ \ln Y = \beta_{0} + \beta_{1} \ln X + \epsilon \] may fit the data well. In other cases, however, we might need greater flexibility in specifying the form of the conditional expectation. The multiple linear regression framework gives us a good deal of flexibility. For example, we can have specifications such as \[ Y = \beta_0 + \beta_1X + \beta_2 X^2 + \epsilon\,, \tag{6.2}\] or \[ Y = \beta_0 + \beta_1X + \beta_2 D.X + \epsilon\,, \tag{6.3}\] where \(D\) is a binary variable that is equal to one if \(X\) is greater or equal to some threshold \(\xi\) , and zero otherwise.

Example 6.3 It may be that there are several variables that are good predictor for the dependent variable. Multiple linear regression models allow us a flexible way to use multiple predictors using specifications such as \[ Y = \beta_{0} + \beta_{1} X + \beta_{2} Z + \epsilon \] or even \[ Y = \beta_{0} + \beta_{1} X + \beta_{2} X^{2} + \beta_{3} Z + \beta_{4} Z^{2} + \beta_5 Z.X + \epsilon \tag{6.4}\]

The regressors \(D.X\) in Eq.  6.3 and \(Z.X\) in Eq.  6.4 are called interaction terms.

Example 6.4 The ability to specify flexible non-linear relationships in the multiple linear regression framework is also helpful in causal applications. It may well be that the relationship between dependent variable and a causal variable is non-linear. For instance, e.g., the rate of increase in earnings as a worker gets older may decline with age, in which case the specification \[ \ln earnings = \beta_{0} + \beta_{1} age + \beta_{2} age^{2} + \beta_{3} wexp + \epsilon \tag{6.5}\] may be appropriate. In Eq.  6.5 , we have \[ \frac{\delta \, \ln earnings}{\delta\, age} = \beta_{1} + \beta_{2}\,age\,. \] which allows the percentage annual increase in earnings to depend on age. We can also allow the rate of increase to depend also on other variables which specifications such as \[ \ln earnings = \beta_{0} + \beta_{1} age + \beta_{2} age^{2} + \beta_{3} wexp + \beta_{4} wexp.age + \epsilon \tag{6.6}\] where \[ \frac{\delta \, \ln earnings}{\delta\, age} = \beta_{1} + \beta_{2}\,age + \beta_{4} wexp\,. \]

Multiple regression models are usually estimated using ordinary least squares. In this chapter, we focus on the multiple linear regression with two regressors \[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon. \] The general multi-regressor case is best dealt with using matrix algebra, which we leave for a later chapter. We use the two regressor case to build intuition regarding issues such as bias-variance tradeoffs, how the inclusion of an additional variable helps to “control” for the confounding effect of that variable, and basic ideas about joint hypotheses testing. We continue to assume that you are working with cross-sectional data. Be reminded that the variables \(Y\) , \(X\) and \(Z\) may be transformations of the variables of interest. Furthermore, \(X\) and \(Z\) may be transformations of the same variable, e.g., we may have \(Z = X^2\) .

We use the following packages in this chapter.

6.1 OLS Estimation of the Multiple Linear Regression Model

Let \(\{Y_{i}, X_{i}, Z_{i}\}_{i=1}^N\) be your sample. For any estimators \(\hat{\beta}_0\) , \(\hat{\beta}_1\) and \(\hat{\beta}_2\) (whether or not obtained by OLS), define the fitted values to be \[ \hat{Y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} X_{i} + \hat{\beta}_{2} Z_{i} \tag{6.7}\] and the residuals to be \[ \hat{\epsilon}_{i} = Y_{i} - \hat{Y}_{i} = Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1} X_{i} - \hat{\beta}_{2} Z_{i} \tag{6.8}\] for \(i=1,2,...,N\) . The OLS method chooses \((\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\beta}_{2})\) to be those values of \((\beta_{0}, \beta_{1}, \beta_{2})\) that minimize the sum of squared residuals \(SSR = \sum_{i=1}^{N} (Y_{i} - \hat{\beta}_{0} - \hat{\beta}_{1} X_{i} - \hat{\beta}_{2} Z_{i})^2\) , i.e., \[ {\hat{\beta}_{0}^{ols}, \hat{\beta}_{1}^{ols}, \hat{\beta}_{2}^{ols}} = \text{argmin}_{\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2} \sum_{i=1}^N (Y_i - \hat{\beta}_0 - \hat{\beta}_1X_i - \hat{\beta}_2Z_i)^2. \tag{6.9}\] The phrase “ \(\text{argmin}_{\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\beta}_{2}}\) ” means “the values of \(\hat{\beta}_{0}, \hat{\beta}_{1}, \hat{\beta}_{2}\) that minimize …”). The OLS estimators can be found by solving the first order conditions: \[ \begin{aligned} \left.\frac{\partial SSR}{\partial \hat{\beta}_0}\right|_{\hat{\beta}_0^{ols},\hat{\beta}_1^{ols},\hat{\beta}_2^{ols}} &= -2\sum_{i=1}^N(Y_i-\hat{\beta}_0^{ols}-\hat{\beta}_1^{ols}X_i-\hat{\beta}_2^{ols}Z_i)=0\,, \\ \left.\frac{\partial SSR}{\partial \hat{\beta}_1}\right|_{\hat{\beta}_0^{ols},\hat{\beta}_1^{ols},\hat{\beta}_2^{ols}} &= -2\sum_{i=1}^N(Y_i-\hat{\beta}_0^{ols}-\hat{\beta}_1^{ols}X_i-\hat{\beta}_2^{ols}Z_i)X_i=0\,, \\ \left.\frac{\partial SSR}{\partial \hat{\beta}_2}\right|_{\hat{\beta}_0^{ols},\hat{\beta}_1^{ols},\hat{\beta}_2^{ols}} &= -2\sum_{i=1}^N(Y_i-\hat{\beta}_0^{ols}-\hat{\beta}_1^{ols}X_i-\hat{\beta}_2^{ols}Z_i)Z_i=0\,. \end{aligned} \tag{6.10}\] We can also write the first order conditions as \[ \sum_{i=1}^N \hat{\epsilon}_i^{ols} = 0\,, \quad \sum_{i=1}^N \hat{\epsilon}_i^{ols}X_i = 0\,, \quad \text{and} \quad \sum_{i=1}^N \hat{\epsilon}_i^{ols}Z_i = 0\,. \tag{6.11}\] Instead of solving the three-equation three-unknown system Eq.  6.10 directly, we are going to take an alternative but entirely equivalent approach. This alternative approach is indirect, but more illustrative. We focus on the estimation of \(\hat{\beta}_{1}^{ols}\) . You can get the solution for \(\hat{\beta}_{2}^{ols}\) by switching \(X_{i}\) with \(Z_{i}\) in the steps shown. After obtaining \(\hat{\beta}_{1}^{ols}\) and \(\hat{\beta}_{2}^{ols}\) , you can use the first equation in Eq.  6.10 to compute \[ \hat{\beta}_{0}^{ols} = \overline{Y} - \hat{\beta}_{1}^{ols}\overline{X} - \hat{\beta}_{2}^{ols}\overline{Z}\,. \] We begin with the following “auxiliary” regressions:

  • Regress \(X_{i}\) on \(Z_{i}\) , and collect the residuals \(r_{i,x|z}\) from this regression, i.e., compute \[ r_{i,x|z} = X_{i} - \hat{\delta}_{0} - \hat{\delta}_{1} Z_{i} \; , \; i=1,2,...,N \] where \(\hat{\delta}_0\) and \(\hat{\delta}_1\) are the OLS estimators for the intercept and slope coefficients from a regression of \(X_i\) on a constant and \(Z_i\) .
  • Regress \(Y_i\) on \(Z_i\) , and collect the residuals \(r_{i,y|z}\) from this regression, i.e., compute \[ r_{i,y|z} = Y_i - \hat{\alpha}_0 - \hat{\alpha}_1Z_i \; , \; i=1,2,...,N \] where \(\hat{\alpha}_0\) and \(\hat{\alpha}_1\) are the OLS estimators for the intercept and slope coefficients from a regression of \(Y_i\) on a constant and \(Z_i\) .

The OLS estimator \(\hat{\beta}_1^{ols}\) obtained from solving the first order conditions Eq.  6.10 turns out to be equal to the OLS estimator of the coefficient on \(r_{i,x|z}\) in a regression of \(r_{i,y|z}\) on \(r_{i,x|z}\) (you can exclude the intercept term here; the sample means of both residuals are zero by construction, so the estimator for the intercept term if included will also be zero). In other words, \[ \hat{\beta}_1^{ols} = \frac{\sum_{i=1}^N r_{i,x|z}r_{i,y|z}}{\sum_{i=1}^N r_{i,x|z}^2}. \tag{6.12}\] To see this, note that since \(\{r_{i,x|z}\}_{i=1}^N\) are OLS residuals from a regression of \(X_i\) on an intercept term and \(Z_i\) , we have \(\sum_{i=1}^N r_{i,x|z} = 0\) and \(\sum_{i=1}^N r_{i,x|z}Z_i = 0\) . This implies \[ \sum_{i=1}^N r_{i,x|z}\hat{X}_i = \sum_{i=1}^N r_{i,x|z}(\hat{\delta}_0 + \hat{\delta}_1Z_i) = 0 \] and furthermore, \[ \sum_{i=1}^N r_{i,x|z}X_i = \sum_{i=1}^N r_{i,x|z}(\hat{X}_i + r_{i,x|z}) = \sum_{i=1}^N r_{i,x|z}^2. \] Now consider the sum \(\sum_{i=1}^N r_{i,x|z}r_{i,y|z}\) . We have \[ \begin{aligned} \sum_{i=1}^N r_{i,x|z}r_{i,y|z} &= \sum_{i=1}^N r_{i,x|z}(Y_i - \hat{\alpha}_0 - \hat{\alpha}_1 Z_i) \\ &= \sum_{i=1}^N r_{i,x|z}Y_i \\ &= \sum_{i=1}^N r_{i,x|z}(\hat{\beta}_0^{ols} + \hat{\beta}_1^{ols}X_i + \hat{\beta}_2^{ols}Z_i + \hat{\epsilon}_i) \\ &= \hat{\beta}_1^{ols}\sum_{i=1}^N r_{i,x|z}^2 + \sum_{i=1}^N r_{i,x|z}\hat{\epsilon}_i. \end{aligned} \tag{6.13}\] Finally, we note that the first order conditions Eq.  6.10 imply that \[ \begin{aligned} \sum_{i=1}^N r_{i,x|z}\hat{\epsilon}_i &= \sum_{i=1}^N \hat{\epsilon}_i(X_i - \hat{\delta}_0 - \hat{\delta}_1Z_i) \\ &= \sum_{i=1}^N \hat{\epsilon}_iX_i - \hat{\delta}_0\sum_{i=1}^N \hat{\epsilon}_i - \hat{\delta}_1\sum_{i=1}^N \hat{\epsilon}_iZ_i = 0. \end{aligned} \] Therefore \[ \sum_{i=1}^N r_{i,x|z}r_{i,y|z} = \hat{\beta}_1^{ols}\sum_{i=1}^N r_{i,x|z}^2 \] which gives Eq.  6.12 .

Note that in order for Eq.  6.12 to be feasible, we require \(\sum_{i=1}^N r_{i,x|z}^2 \neq 0\) . This means that \(X_i\) and \(Z_i\) cannot be perfectly correlated (positively or negatively). They can be correlated, just not perfectly so. Furthermore, in the auxiliary regression of \(X_i\) on \(Z_i\) , we require some variation in \(Z_i\) , i.e., it cannot be that all the \(Z_i\) , \(i=1,2,...,N\) have the same value \(c\) . Similarly, to derive \(\hat{\beta}_2^{ols}\) , we require variation in \(X_i\) . All this is perfectly intuitive. If there is no variation in \(X_i\) in the sample, we cannot measure how \(Y_i\) changes with \(X_i\) . Similarly for \(Z_i\) . If \(X_i\) and \(Z_i\) are perfectly correlated, we will not be able to tell whether a change in \(Y_i\) is due to a change in \(X_i\) or in \(Z_i\) , since they move in perfect lockstep. We can summarize all of these requirements by saying that there is no \((c_1,c_2,c_3) \neq (0,0,0)\) such that \(c_1 + c_2X_i + c_3Z_i = 0\) for all \(i=1,2,...,N\) .

The argument presented shows the essence of how confounding factors are ‘controlled’ in multiple regression analysis. Suppose we want to measure how \(Y_i\) is affected by \(X_i\) . If \(Z_i\) is an important determinant of \(Y_i\) that is correlated with \(X_i\) , but omitted from the regression, then the measurement of the influence of \(X_i\) on \(Y_i\) will be distorted. In an experiment, we would control for \(Z_i\) by literally holding it fixed. In applications in economics, this is impossible. What multiple regression analysis does instead is to strip out all variation in \(Y_i\) and \(X_i\) that are correlated with \(Z_i\) , and then measure the correlation in the remaining variation in \(Y_i\) and \(X_i\) .

6.2 Algebraic Properties of OLS Estimators

We drop the ‘OLS’ superscript in our notation of the OLS estimators, residuals, and fitted values from this point, and write \(\hat{\beta}_0\) , \(\hat{\beta}_1\) , \(\hat{\beta}_2\) , \(\hat{\epsilon}_i\) and \(\hat{Y}_i\) for \(\hat{\beta}_0^{ols}\) , \(\hat{\beta}_1^{ols}\) , \(\hat{\beta}_2^{ols}\) , \(\hat{\epsilon}_i^{ols}\) and \(\hat{Y}_i^{ols}\) respectively. We will reinstate the ‘ols’ superscript whenever we need to emphasize that OLS was used, or when comparing OLS estimators to estimators derived in another way.

Many of the algebraic properties carry over from the simple linear regression model.

We have already noted that the first order conditions can be written as \[ \sum_{i=1}^N \hat{\epsilon}_i=0, \quad \sum_{i=1}^N X_i \hat{\epsilon}_i =0 \quad \text{and} \quad \sum_{i=1}^N Z_i \hat{\epsilon}_i =0. \]

This implies that the fitted values \(\hat{Y}_i\) and the residuals are also uncorrelated.

The first equation in the first order conditions Eq.  6.10 implies that the point \((\overline{X},\overline{Y},\overline{Z})\) lies on the sample regression function.

\(\overline{Y} = \overline{\hat{Y}}\) continues to hold.

The above properties imply that the \(SST = SSE + SSR\) equality continues to hold in the multiple regression case \[ \sum_{i=1}^N (Y_i - \overline{Y})^2 = \sum_{i=1}^N (\hat{Y}_i - \overline{\hat{Y}})^2 + \sum_{i=1}^N \hat{\epsilon}_i^2. \tag{6.14}\] As in the simple linear regression case, we can use Eq.  6.14 to define the goodness-of-fit measure: \[ R^2 = 1 - \frac{SSR}{SST}. \tag{6.15}\] It should be noted that the \(R^2\) will never decrease as we add more variables to the regression. This is because OLS minimizes \(SSR\) , and therefore maximizes \(R^2\) . For example, the \(R^2\) from the regression \(Y = \beta_0 + \beta_1X + \beta_2Z + u\) will never be less than the \(R^2\) from the regression \(Y = \alpha_0 + \alpha_1X + \epsilon\) , and will generally be greater, unless it so happens that \(\hat{\beta}_0 = \hat{\alpha}_0\) , \(\hat{\beta}_1 = \hat{\alpha}_1\) and \(\hat{\beta}_2 = 0\) . For this reason, the “Adjusted \(R^2\) ” \[ \text{Adj.-}R^2 = 1 - \frac{\frac{1}{N-K}SSR}{\frac{1}{N-1}SST} \] is sometimes used, where \(K\) is the number of regressors (including the intercept term; for the 2-regressor case that we are focussing on, \(K=3\) ). The idea is to use unbiased estimates of the variances of \(\epsilon\) and \(Y\) . Since both \(SSR\) and \(N-K\) decrease when additional variables (and parameters) are added into the model, the adjusted \(R^2\) will increase only if \(SSR\) falls enough to lower the value of \(SSR/(N-k)\) . The adjusted \(R^2\) may be used as an alternate measure of goodness-of-fit, but it should not be used as a model selection tool, for reasons we shall come to later later in the chapter.

In the derivation Eq.  6.13 of the OLS estimator \(\hat{\beta}_1\) , we noted that \[ \begin{aligned} \sum_{i=1}^N r_{i,x|z}r_{i,y|z} &= \sum_{i=1}^N r_{i,x|z}(Y_i - \hat{\alpha}_0 - \hat{\alpha}_1 Z_i) \\ &= \sum_{i=1}^N r_{i,x|z}Y_i. \end{aligned} \] This implies that the estimator can also be written as \[ \hat{\beta}_1 = \frac{\sum_{i=1}^N r_{i,x|z}r_{i,y|z}}{\sum_{i=1}^N r_{i,x|z}^2} = \frac{\sum_{i=1}^N r_{i,x|z}Y_i}{\sum_{i=1}^N r_{i,x|z}^2} \tag{6.16}\] which is the formula for the simple linear regression of \(Y_i\) on \(r_{i,x|z}\) . In other words, you can also get the OLS estimator \(\hat{\beta}_1\) by regressing \(Y_i\) on \(r_{i,x|z}\) without first stripping out the covariance between \(Y_i\) and \(Z_i\) .

The expression Eq.  6.16 shows that \(\hat{\beta}_1\) is a linear estimator, i.e., \[ \hat{\beta}_1 = \sum_{i=1}^N w_i Y_i \] where here the weights are \(w_i = r_{i,x|z}/\sum_{i=1}^N r_{i,x|z}^2\) . Note that the weights \(w_i\) are made up solely of observations \(\{X_i\}_{i=1}^N\) and \(\{Z_i\}_{i=1}^N\) , since they are the residuals from a regression of \(X_i\) on \(Z_i\) . Furthermore, the weights have the following properties: \[ \begin{aligned} & \sum_{i=1}^N w_i = 0\,, \\ & \sum_{i=1}^N w_iZ_i = \frac{\sum_{i=1}^Nr_{i,x|z}Z_i}{\sum_{i=1}^Nr_{i,x|z}^2} = 0 \,, \\ & \sum_{i=1}^N w_iX_i = \frac{\sum_{i=1}^Nr_{i,x|z}X_i}{\sum_{i=1}^Nr_{i,x|z}^2} = 1 \,, \\ & \sum_{i=1}^N w_i^2 = \frac{\sum_{i=1}^Nr_{i,x|z}^2}{(\sum_{i=1}^Nr_{i,x|z}^2)^2} = \frac{1}{\sum_{i=1}^Nr_{i,x|z}^2} \,. \end{aligned} \]

If the sample correlation between \(X_i\) and \(Z_i\) is zero, then the coefficient estimate \(\hat{\delta}_1\) in the auxiliary regression where we regressed \(X\) on \(Z\) would be zero, and \(\hat{\delta}_0\) would be equal to the sample mean of \(\overline{X}\) . In other words, we would have \(r_{i,x|z} = X_i - \overline{X}\) , so \[ \hat{\beta}_1 = \frac{\sum_{i=1}^N r_{i,x|z}Y_i}{\sum_{i=1}^N r_{i,x|z}^2} = \frac{\sum_{i=1}^N (X_i - \overline{X})Y_i}{\sum_{i=1}^N (X_i - \overline{X})^2} \] This is, of course, just the OLS estimator for the coefficient on \(X_i\) in the simple linear regression of \(Y\) on \(X\) . In other words, if the sample correlation between \(X_i\) and \(Z_i\) is zero, then including \(Z_i\) in the regression would not change the value of the simple linear regression estimator for the coefficient on \(X_i\) . We will see shortly that including the additional variable may nonetheless reduce the estimator variance.

6.3 Statistical Properties of OLS Estimators

We list Assumption Set B below, which is an adaptation of Assumption Set A to the two-regressor case. With these assumptions, the OLS estimators will again be unbiased and efficient. We will leave the proof of many of these results to a later chapter, when we deal with the general case. In this section, we focus on the OLS estimator variance, and in particular on the trade-off between the benefits of including more variables and the cost of doing so in terms of higher estimator variance.

Assumption Set B:   Suppose that (B1) there are values \(\beta_0\) , \(\beta_1\) and \(\beta_2\) such that the random variable \(\epsilon\) , defined as \[ \epsilon = Y - \beta_0 - \beta_1 X - \beta_2Z \] satisfies

(B2)   \(E[\epsilon|X, Z] = 0\) ,

(B3)   \(var[\epsilon|X, Z] = \sigma^2\) .

Suppose also that your data

(B4)   \(\{X_i,Y_i, Z_i\}_{i=1}^N\) is a random sample from the population, and

(B5)   \(c_1 + c_2X_i + c_3Z_i = 0\) for all \(i=1,2,...,N\) only if \((c_1,c_2,c_3) = (0,0,0)\) .

Assumption B2 implies that \[ E[Y|X] = \beta_0 + \beta_1 X + \beta_2 Z \,. \tag{6.17}\] As in the simple linear regression case, the assumptions imply

  • \(E[\epsilon_i | \text{x}, \text{z}] = 0\) for all \(i=1,...,N\) ,
  • \(E[\epsilon_i^2 | \text{x}, \text{z}] = \sigma^2\) for all \(i=1,...,N\) ,
  • \(E[\epsilon_i\epsilon_j | \text{x}, \text{z}] = 0\) for all \(i\neq j, \; i,j=1,...,N\) where we use the notation \(\text{x}\) to denote \(X_1,X_2,...,X_N\) , and \(\text{z}\) to denote \(Z_1,Z_2,...,Z_N\) .

OLS is unbiased, since \[ \begin{aligned} \hat{\beta}_1 &= \sum_{i=1}^N w_i Y_i \\ &= \sum_{i=1}^N w_i(\beta_0 + \beta_1X_i + \beta_2 Z_i + \epsilon_i)\\ &= \beta_1 + \sum_{i=1}^N w_i \epsilon_i\,. \end{aligned} \] Taking conditional expectations gives \[ E[\hat{\beta}_1|\text{x}, \text{z}] = \beta_1 + \sum_{i=1}^N w_i E[\epsilon_i|\text{x}, \text{z}] = \beta_1\,. \] It follows that the unconditional mean is \(E[\hat{\beta}_1] = \beta_1\) .

The conditional variance of \(\hat{\beta}_1\) under Assumption Set B is \[ \begin{aligned} var[\hat{\beta_1} | \text{x}, \text{z}] &= var\left[\left.\beta_1+\sum_{i=1}^N w_i\epsilon_i \right| \text{x}, \text{z}\right] \\ &= \sum_{i=1}^N w_i^2var[\epsilon_i | \text{x}, \text{z}] \\ &= \frac{\sigma^2}{\sum_{i=1}^Nr_{i,x|z}^2}\,. \end{aligned} \tag{6.18}\] Since the \(R^2\) from the regression of \(X_i\) on \(Z_i\) is \[ R_{x|z}^2 = 1 - \frac{\sum_{i=1}^Nr_{i,x|z}^2}{\sum_{i=1}^N(X_i - \overline{X})^2}, \] we can also write \(var[\hat{\beta_1} | \text{x}, \text{z}]\) as \[ var[\hat{\beta_1} | \text{x}, \text{z}] = \frac{\sigma^2}{(1-R_{x|z}^2)\sum_{i=1}^N(X_i-\overline{X})^2}\,. \tag{6.19}\]

Expression Eq.  6.19 clearly shows the trade-offs involved in adding a second regressor. Suppose the true data generating process is \[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon \,,\; E[\epsilon | X, Z] = 0 \,,\; var[\epsilon | X, Z] = \sigma^2 \] but you ran the regression \[ Y = \beta_0 + \beta_1 X + u. \] If \(X\) and \(Z\) are correlated, then \(X\) and \(u\) are correlated, and you will get biased estimates of \(\beta_1\) . By estimating the multiple linear regression, you are able to get an unbiased estimate of \(\beta_1\) by controlling for \(Z\) . However, the variance of the OLS estimator for \(\beta_1\) changes from \[ var[\hat{\beta_1}|\text{x}] = \frac{\sigma_u^2}{\sum_{i=1}^N(X_i-\overline{X})^2} \] in the simple linear regression to the expression in Eq.  6.19 for the multiple linear regression. Since \(\sigma^2\) is the variance of \(\epsilon\) , and \(\sigma_u^2\) is the variance of a combination of the uncorrelated variables \(Z\) and \(\epsilon\) , we have \(\sigma^2 < \sigma_u^2\) . This has the effect of reducing the estimator variance (which is good!). However, since \(0 < 1-R_{x|z}^2 < 1\) , the denominator in the variance expression is smaller in the multiple linear regression case than in the simple linear regression case. This is because in the multiple regression, we have stripped out all variation in \(X\) that is correlated with \(Z\) , resulting in reduced effective variation in \(X\) , which in turn increases the estimator variance. In general (and especially in causal applications), one would usually consider the trade-off to be in favor of the multiple regression. However, if \(X\) and \(Z\) are highly correlated ( \(R_{x|z}^2\) close to 1), then the reduction in effective variation in \(X\) may be so severe that the estimator variance becomes very large. This tends to reduce the size of the t-statistic, leading to rejection of statistical significance even in cases where the size of the estimate itself may suggest strong economic significance.

To compute a numerical estimate for the conditional variance of \(\hat{\beta}_1\) , we have to estimate \(\sigma^2\) . An unbiased estimator for \(\sigma^2\) in the two-regressor case is \[ \widehat{\sigma^2} = \frac{1}{N-3}\sum_{i=1}^N \hat{\epsilon}_i^2. \tag{6.20}\] The \(SSR\) is divided by \(N-3\) because three ‘degrees-of-freedom’ were used in computing \(\hat{\beta}_0\) , \(\hat{\beta}_1\) and \(\hat{\beta}_2\) and these were used in the computation of \(\hat{\epsilon}_i\) . We shall again leave the proof of unbiasedness of \(\widehat{\sigma^2}\) for when we deal with the general case. We estimate the conditional variance of \(\hat{\beta}_1\) using \[ \widehat{var}[\hat{\beta}_1|\mathrm{x},\mathrm{z}] = \frac{\widehat{\sigma^2}}{(1-R_{x|z}^2)\sum_{i=1}^N(X_i - \overline{X})^2} \tag{6.21}\] The standard error of \(\hat{\beta}_1\) is the square root of Eq.  6.21 .

Example 6.5 The dataset multireg_eg.csv contains three variables \(X\) , \(Y\) and \(Z\) . The variable \(Z\) takes integer values from 1 to 5. Fig.  6.1 shows two versions of a scatterplot of \(Y\) on \(X\) , the one in panel b uses shapes to reflect observations associate with different values of \(Z\) .

There is a clear negative relationships between \(Y\) and \(X\) . However, in panel (b) we see that \(Y\) and \(X\) are in fact positively correlated when \(Z\) is fixed at some specific value. However, there is a positive relationship between \(Y\) and \(Z\) , and a negative one between \(Z\) and \(X\) , the net effect of which is to sweep the scatter in the northwest direction as \(Z\) increases, turning a positive correlation between \(Y\) and \(X\) for fixed values of \(Z\) to a negative one overall.

We run two regressions below. The first is a simple linear regression of \(Y\) on \(X\) . The second is a multiple linear regression of \(Y\) on \(X\) and \(Z\) .

The simple linear regression shows the negative relationship between \(Y\) and \(X\) when viewed over all outcomes of \(Z\) . The multiple regression disentangles the effect of \(X\) and \(Z\) on \(Y\) . In this case, inclusion of \(Z\) has also reduced the standard error on the estimate of the coefficient on \(X\) despite the reduced variation in \(X\) . This is because \(Z\) accounts for a very substantial proportion of the variation in \(Y\) , as can be seen from the substantial increase in \(R^2\) when it is included (i.e., including \(W\) reduces the variance of the noise term by a lot).

We replicate below the multi-step approach to obtaining the coefficient estimate on \(X\) :

The numerical estimate of the coefficient on r_xz is identical to that on X in the previous regression. The standard errors are similar, but not the same. We emphasize that the auxiliary regression approach is for illustrative purposes only. The standard errors, t-statistic, etc. should all be taken from the previous (multiple) regression.

The plots in Fig.  6.2 illustrate the effect of ‘controlling’ for Z .

The range of the y-axis in the three plots are the same (30 to 80 in the first two, -25 to 25 in the third). Likewise the range of the x-axis is the same across all three plots (2.5 to 12.5 in the first, -5 to 5 in the second and third). This allows you to see the reduced variation in the variables. When we regress \(X\) on \(Z\) and take the residuals, we remove the effect of \(Z\) on \(X\) and also center the residuals around zero (OLS residuals always have sample mean zero). You can see from the second diagram that the variation in \(X\) is reduced, which tends to increase the estimator variance. You can also see that the negative slope has been turned into a positive one, albeit with a lot of noise. By removing the effect of \(Z\) on \(Y\) (which we do when we include \(Z\) in the regression), we reduce the variation on \(Y\) . This reduces the estimator variance. The slope coefficient in the simple regression of the data in the last panel gives the effect of \(X\) on \(Y\) , controlling for \(Z\) .

6.4 Hypothesis Testing

To test if \(\beta_k\) is equal to some value \(r_k\) in population, we can again use the t-statistic as in the simple linear regression case: \[ t = \frac{\hat{\beta}_k - r_k}{\sqrt{\widehat{var}[\hat{\beta}_k]}}. \] If the noise terms are conditionally normally distributed, then the \(t\) -statistic has the \(t\) -distribution, with degrees-of-freedom \(N-K\) where \(K=3\) in the two-regressor case with intercept. If we do not assume normality of the noise terms, then (as long as the necessary CLTs apply) we use instead the approximate test, using the \(t\) -statistic as defined above, but using the rejection region derived from the standard normal distribution. You can also test whether a linear combination of the parameters are equal to some value. For example, in the regression \[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon \] you can test, say, \(H_0: \beta_1 + \beta_2 = 1 \text{ vs } H_A: \beta_1 + \beta_2 \neq 1\) . The \(t\) -statistic in this case is \[ t = \frac{\hat{\beta}_1 + \hat{\beta}_2 - 1}{\sqrt{\widehat{var}[\hat{\beta}_1+\hat{\beta}_2]}}. \] To compute this you will need the covariance of \(\hat{\beta}_1\) and \(\hat{\beta}_2\) , the derivation of which we leave for a later chapter.

Example 6.6 Suppose the production technology of a firm can be characterized by the “Cobb-Douglas Production Function”: \[ Q(L,K) = AL^\alpha K^\beta \] where \(Q(L,K)\) is the quantity produced using \(L\) units of labor and \(K\) units of capital. The constants \(A\) , \(\alpha\) and \(\beta\) are the parameters of the model. If we multiply the amount of labor and capital by \(c\) , we get \[ Q(cL,cK) = A(cL)^\alpha (cK)^\beta = c^{\alpha+\beta}AL^\alpha K^\beta. \] The sum \(\alpha + \beta\) therefore represents the ‘returns to scale’. If \(\alpha + \beta = 1\) , then there is constant returns to scale, e.g., doubling the amount of labor and capital ( \(c=2\) ) results in the doubling of total production. If \(\alpha + \beta > 1\) then there is increasing returns to scale, and if \(\alpha + \beta < 1\) , we have decreasing returns to scale. A logarithmic transformation of the production function gives \[ \ln Q = \ln A + \alpha \ln L + \beta \ln K. \] If we have observations \(\{Q_i, L_i, K_i\}_{i=1}^N\) of the quantities produced and amount of labor and capital employed by a set of similar firms in an industry, we could estimate the production function for that industry using the regression \[ \ln Q_i = \ln A + \alpha \ln L_i + \beta \ln K_i + \epsilon_i. \] A test for constant returns to scale would be the test \[ H_0: \alpha + \beta = 1 \text{ vs } H_A: \alpha + \beta \neq 1. \]

Example 6.7 In the previous chapter, we estimated \(\ln(earnings)\) on \(height\) using data in earnings.xlsx and obtained a statistically significant effect of height on earnings. We conjectured that the regression may be measuring a ‘gender gap’ in wages rather than a ‘height gap’, with \(height\) acting as a proxy for the sex of the subjects. We now attempt to control for the sex of the subjects by including a dummy variable \(male\) which is one when an observation is of a male subject, zero otherwise.

Inclusion of the \(male\) dummy variable has reduced the size of the estimate of the \(height\) coefficient to 2.5 percent per inch in height (previously it was estimated at four percent). The estimate is still quite economically significant, and also still statistically significant. Perhaps \(height\) does have a direct effect on \(\ln(earnings)\) , but it is more likely that there are yet more factors that need to be controlled for.

In some cases, we may wish to test multiple hypotheses, e.g., in the two-variable regression, we may wish to test \[ H_0: \beta_1 = 0 \; \text{ and } \; \beta_2 = 0 \; \text{ vs } \; H_A: \beta_1 \neq 0 \;\text{or}\; \beta_2 \neq 0. \] One possibility would be to do individual \(t\) -tests for each of the two hypotheses, but we should be aware that two individual 5% tests is not equivalent to a joint 5% test. The following example illustrates this problem.

Example 6.8 We generate 100 observations of three uncorrelated variables \(X\) , \(Y\) and \(Z\) . We regress \(Y\) on \(X\) and \(Z\) , and collect the t-statistics on \(X\) and \(Z\) . We repeat the experiment 1000 times (with different draws each time, of course, but the same parameters).

When using a 5% t-test, we reject the (true) hypothesis that \(\beta_x = 0\) in about 6% of the experiments, close to 5%. These rejections are regardless of whether the t-test for \(\beta_z=0\) rejects or does not reject. Likewise, the 5% t-test for \(\beta_z = 0\) rejects the hypothesis 5.5% of the time, roughly five percent. However, if we say we reject \(\beta_x = 0\) and \(\beta_z = 0\) if either t-tests rejects the corresponding hypothesis, then the frequency of rejection is much larger, roughly double.

We plot the t-stats below, indicating the critical values for the individual tests. The proportion of points above the upper horizontal line or below the lower one is about 0.05. Similarly, the proportion of points to the left of the left vertical line or to the right of the right one is roughly 0.05. The number of point that meet either of the two sets of criteria is much larger, roughly the sum of the two proportions.

To jointly test multiple hypotheses, we can use the \(F\) -test. Suppose in the regression \[ Y = \beta_0 + \beta_1X + \beta_2 Z + \epsilon \] we wish to jointly test the hypotheses \[ H_0: \beta_1 = 1 \; \text{ and } \; \beta_2 = 0 \; \text{ vs } \; H_A: \beta_1 \neq 1 \;\text{or}\; \beta_2 \neq 0 \;\text{ (or both)}\,. \] Suppose we run the regression twice, once unrestricted, and another time with the restrictions in \(H_0\) imposed. The regression with the restrictions imposed is \[ Y = \beta_0 + X + \epsilon \] so the restricted OLS estimator for \(\beta_0\) is the sample mean of \(Y_i - X_i\) , i.e., \[ \hat{\beta}_{0,r} = (1/N)\sum_{i=1}^N(Y_i - X_i). \] Calculate the \(SSR\) from both equations. The “unrestricted \(SSR\) ” is \[ SSR_{ur} = \sum_{i=1}^N \hat{\epsilon}_i^2 \] where \(\hat{\epsilon}_i = Y_i - \hat{\beta}_0 - \hat{\beta}_1X_i - \hat{\beta}_2 Z_i\) . The restricted \(SSR\) is \[ SSR_{r} = \sum_{i=1}^N \hat{\epsilon}_{i,r}^2 \] where \(\hat{\epsilon}_{i,r} = Y_i - \hat{\beta}_{0,r} - X_i\) . Since OLS minimizes \(SSR\) , imposing restrictions will generally increase the \(SSR\) , and never decrease it, i.e., \[ SSR_r \geq SSR_{ur} \,. \] It can be shown that if the hypotheses in \(H_0\) are true (and the noise terms are normally distributed), then \[ F = \frac{(SSR_r - SSR_{ur})/J}{SSR_{ur}/(N-K)} \sim F_{(J,N-K)} \tag{6.22}\] where \(J\) is the number of restrictions being tested (in our example, \(J=2\) ) and \(K\) is the number of coefficients to be estimated (including intercept; in our example, \(K=3\) ). The F-statistic is always non-negative. The idea is that if the hypotheses in \(H_0\) are true, then imposing the restrictions on the regression would not increase the \(SSR\) by much, and \(F\) will be close to zero. On the other hand, if one or more of the hypotheses in \(H_0\) are false, then imposing them into the regression will cause the \(SSR\) to increase substantially, and the \(F\) statistic will be large. We take a very large \(F\) -statistic, meaning \[ F > F_{\alpha,J,N-K}\,, \] as statistical evidence that one or more of the hypothesis is false, where \(F_{\alpha,J,N-K}\) is the \((1-\alpha)\) -percentile of the \(F_{J,N-K}\) distribution and where \(\alpha\) is typically 0.10, 0.05 or 0.01,

Since \(R^2 = 1 - SSR/SST\) , we can write the \(F\) -statistic in terms of \(R^2\) instead of \(SSR\) . You are asked in an exercise to show that the \(F\) -statistic can be written as \[ F = \frac{(R_{ur}^2 -R_r^2)/J}{(1-R_{ur}^2)/(N-K)}. \] Imposing restrictions cannot increase \(R^2\) , and in general will decrease it. The \(F\) -test essentially tests if the \(R^2\) drops significantly when the restrictions are imposed. If the hypotheses being tested are true, then the drop should be slight. If one or more are false, the drop should be substantial, resulting in a large \(F\) -statistic.

If you cannot assume that the noise terms are conditionally normally distributed, then you will have to use an asymptotic approximation. It can be shown that \[ JF \rightarrow_d \chi_{(J)}^2 \] as \(N \rightarrow \infty\) , where \(J\) is the number of hypotheses being jointly tested, and \(F\) is the \(F\) -statistic Eq.  6.22 . We refer to this as the “Chi-square Test”.

Example 6.9 We continue with Example Example  6.5 . We estimate the model using lm() and store the results in mdl . We use the summary() function to display the results.

The t-statistics are for testing (separately) \(\beta_x=0\) and \(\beta_z=0\) . The F-statistic that is reported is for testing both of these hypotheses jointly, i.e., \(H_0:\beta_x=0 \; \text{ and } \; \beta_z=0\) versus the alternative that one or both do not hold, and the p-value listed next to the F-statistic is the probability that an \(F_{(2,117)}\) random variable exceeds the computed F-statistic. In this example, we resounding reject the null that both coefficients are zero. The residual standard error is the square root of \(\widehat{\sigma^2}\) , the multiple R-squared is the \(R^2\) discussed earlier. The “Adjusted R-Squared” is the modified \(R^2\) as previously discussed.

As an illustration of the general F test, suppose instead we wish to test that \(\beta_0=1\) and \(\beta_z=3\beta_x\) . The restricted regression is \[ Y = 1 + \beta_x X + 3\beta_x Z + \epsilon = 1 + \beta_x (X+3Z) + \epsilon\,. \] The OLS estimator for the only parameter in the restricted regression, \(\beta_x\) , can be obtained from a regression of \(Y_i - 1\) on \((X_i+3Z_i)\) with no intercept term. We have

The restricted residuals can be computed as \[ \hat{\epsilon}_{i,r} = Y_i - 1 - \hat{\beta}_{x,r}(X_i + 3Z_i) \] The F-statistic and associate p-value is

The function linearHypothesis() in the car package can also be used to carry out the F- and Chi-sq tests.

6.5 Exercises

Exercise 6.1 Each of the following regressions produces a sample regression function whose slope is \(\hat{\beta}_1\) when \(X_i < \xi\) and \(\hat{\beta}_1 + \hat{\alpha}_1\) when \(X_i \geq \xi\) . Which of them produces a sample regression function that is continuous at \(\xi\) ?

\(Y_i = \beta_0 + \beta_1 X_i + \alpha_1 D_i X_i + \epsilon\) where \(D_i\) is a dummy variable with \(D_i = 1\) if \(X_i > \xi\) , \(D_i = 0\) otherwise;

\(Y_i = \beta_0 + \alpha_0D_i + \beta_1 X_i + \alpha_1 D X_i + \epsilon\) ;

\(Y_i = \beta_0 + \beta_1 X_i + \alpha_1 (X_i-\xi)_+ + \epsilon_i\) where \[ (X_i-\xi)_+ = \begin{cases} X_i - \xi & \text{if} \; X_i > \xi \;, \\ 0 & \text{if} \; X_i \leq \xi\;. \end{cases} \]

Exercise 6.2 The following is a “piecewise quadratic regression” model \[ Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3(X-\xi)_+^2 + \epsilon\,,\,E[\epsilon | X] = 0. \] where \[ (X_i-\xi)_+^2 = \begin{cases} (X_i - \xi)^2 & \text{if} \; X_i > \xi \;, \\ 0 & \text{if} \; X_i \leq \xi\;. \end{cases} \] Show that the PRF \(E[Y|X]\) is “piecewise quadratic”, following one quadratic equation when \(X\leq\xi\) , and another when \(x > \xi\) . Show that the PRF is continuous, with continuous first derivative.

Exercise 6.3 Suppose your estimated sample regression function is \[ \widehat{wage} = -68.28 + 4.163\,age - 0.052\,age^2 \] where \(\hat{\alpha}_0\) and \(\hat{\alpha}_1\) are positive and \(\hat{\alpha}_2\) is negative. At what age are wages predicted to start declining with age? Does the intercept have any reasonable economic interpretation?

Exercise 6.4 Prove Eq.  6.14 .

Exercise 6.5 Show that the \(F\) -statistic in Eq.  6.22 can be written as \[ F = \frac{(R_{ur}^2 - R_r^2)/J}{(1-R_{ur}^2)/(N-K)} \] where \(R_{ur}\) and \(R_r\) are the \(R^2\) from the unrestricted and restricted regressions respectively, \(J\) is the number of restrictions being tested, \(N\) is the number of observations used in the regression, and \(K\) is the number of coefficient parameters in the unrestricted regression model (including intercept). What does this expression simplify to when testing that all the slope coefficients (excluding the intercept) are equal to zero?

Exercise 6.6 Modify the code in Example  6.8 to collect the F-statistic for jointly testing \(\beta_x = 0\) and \(\beta_z=0\) . Show that the 5% F-test is empirically correctly sized, meaning that the frequency of rejection in the simulation is in fact around 5%.

Exercise 6.7 Suppose \[ \begin{aligned} Y &= \alpha_0 + \alpha_1 X + \alpha_2 Z + u \\ Z &= \delta_1 X + v \end{aligned} \] where \(u\) and \(v\) are independent zero-mean noise terms. Suppose you have a random sample \(\{Y_i,X_i,Z_i\}_{i=1}^N\) and you ran the regression \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i. \] Show that the OLS estimator \(\hat{\beta}_1\) will be biased for \(\alpha_1\) . What is its expectation? Show that the prediction rule \[ \hat{Y}=\hat{\beta}_0 + \hat{\beta}_1X \] still provides unbiased predictions, but that the prediction error variance is greater than the prediction error variance from using the prediction rule \[ Y = \hat{\alpha}_0 + \hat{\alpha}_1 X + \hat{\alpha}_2 Z \] where \(\hat{\alpha}_0\) , \(\hat{\alpha}_1\) , and \(\hat{\alpha}_2\) are the OLS estimators for \(\alpha_0\) , \(\alpha_1\) and \(\alpha_2\) in the regression \[ Y_i = \alpha_0 + \alpha_1 X_i + \alpha_2 Z_i + u_i \]

Exercise 6.8 Verify all of the results reported in Example  6.9 by calculating them directly using the formulas developed in the notes (in particular, verify the coefficient estimates, standard errors, t-statistics and associated p-values, the residual standard error, the multiple R-squared and Adjusted R-squared, the F-statistic and the corresponding p-value).

If we use \(X_{1}\) , \(X_{2}\) , etc. to denote different variables, then the \(i\) th observation of regressor \(X_{j}\) will be denoted \(X_{j,i}\) . If we use \(X\) , \(Y\) , \(Z\) , to denote different variables, then the \(i\) th observation of these variables will be denoted \(X_{i}\) , \(Y_{i}\) , \(Z_{i}\) . ↩︎

  • How It Works

Multiple Linear Regression in SPSS

Discover Multiple Linear Regression in SPSS ! Learn how to perform, understand SPSS output , and report results in APA style. Check out this simple, easy-to-follow guide below for a quick read!

Struggling with the Multiple Regression in SPSS ? We’re here to help . We offer comprehensive assistance to students , covering assignments , dissertations , research, and more. Request Quote Now !

hypothesis testing multiple linear regression

Introduction

Welcome to our comprehensive guide on Multiple Linear Regression in SPSS . In the dynamic world of statistics, understanding the nuances of Multiple Linear Regression is key for researchers and analysts seeking a deeper understanding of relationships within their data. This blog post is your roadmap to mastering Multiple Linear Regression using the Statistical Package for the Social Sciences (SPSS).

From unraveling the fundamentals to providing practical insights through examples, this guide aims to demystify the complexities, making Multiple Linear Regression accessible to both beginners and seasoned data enthusiasts.

Definition: Multiple Linear Regression

Multiple Linear Regression expands upon the principles of Simple Linear Regression by accommodating multiple independent variables. In essence, it assesses the linear relationship between the dependent variable and two or more predictors. The model’s flexibility allows for a more realistic representation of real-world scenarios where outcomes are influenced by multiple factors. By incorporating multiple predictors, this technique offers a nuanced understanding of how each variable contributes to the variation in the dependent variable. This section serves as a gateway to the intricacies of Multiple Linear Regression, setting the stage for a detailed exploration of its components and applications in subsequent sections.

Linear Regression Methods

Multiple Linear Regression encompasses various methods for building and refining models to predict a dependent variable based on multiple independent variables. These methods help researchers and analysts tailor regression models to the specific characteristics of their data and research questions. Here are some key methods in Multiple Linear Regression:

Ordinary Least Squares (OLS)

OLS is the most common method used in Multiple Linear Regression . It minimizes the sum of squared differences between observed and predicted values, aiming to find the coefficients that best fit the data. OLS provides unbiased estimates if the assumptions of the regression model are met.

Stepwise Regression

In stepwise regression , the model-building process involves adding or removing predictor variables at each step based on statistical criteria. The algorithm evaluates variables and decides whether to include or exclude them in a stepwise manner. It can be forward (adding variables) or backward (removing variables) stepwise regression.

Backward Regression

Backward regression begins with a model that includes all predictor variables and then systematically removes the least significant variables based on statistical tests. This process continues until the model only contains statistically significant predictors. It’s a simplification approach aimed at retaining only the most influential variables.

Forward Regression

Forward regression starts with an empty model and incrementally adds the most significant predictor variables based on statistical tests. This iterative process continues until the addition of more variables does not significantly improve the model. Forward regression helps identify the most relevant predictors contributing to the model’s explanatory power.

Hierarchical Regression

In hierarchical regression, predictor variables are entered into the model in a pre-defined sequence or hierarchy. This method allows researchers to examine the impact of different sets of variables on the dependent variable, taking into account their hierarchical or logical order. The most common approach involves entering blocks of variables at different steps, and assessing how each set contributes to the overall predictive power of the model.

Understanding these multiple linear regression types is crucial for selecting the most appropriate model-building strategy based on the specific goals of your analysis and the characteristics of your dataset. Each approach has its advantages and considerations, influencing the interpretability and complexity of the final regression model.

Regression Equation

The Multiple Regression Equation in Multiple Linear Regression takes the form of

Y = b0 + b1X1 + b2X2 + … + bnXn , where

  • Y is the predicted value of the dependent variable,
  • b0 is the intercept,
  • b1, b2, …, bn are the regression coefficients for each independent variable (X1, X2, …, Xn).

The regression coefficients represent the change in the dependent variable for a one-unit change in the corresponding independent variable, while the intercept is the predicted value when all independent variables are zero. Understanding the interplay between these components is essential for deciphering the impact of each predictor on the overall model. In the upcoming sections, we’ll delve deeper into specific aspects of Multiple Linear Regression, such as the role of dummy variables and the critical assumptions that underpin this statistical method.

What are Dummy Variables?

In the realm of Multiple Linear Regression , dummy variables are pivotal when dealing with categorical predictors. These variables allow us to include categorical data, like gender or region, in our regression model. Consider a binary categorical variable, such as gender (Male/Female). We represent this in our equation using a dummy variable, where one category is assigned 0 and the other 1. For instance, if Male is our reference category, the dummy variable would be 1 for Female and 0 for Male. This inclusion of categorical information enhances the model’s flexibility, capturing the nuanced impact of different categories on the dependent variable. As we explore Multiple Linear Regression further, understanding the role of dummy variables becomes paramount for robust and accurate analyses.

Assumption of Multiple Linear Regression

Before diving into Multiple Linear Regression analysis, it’s crucial to be aware of the underlying assumptions that bolster the reliability of the results.

  • Linearity : Assumes a linear relationship between the dependent variable and all independent variables. The model assumes that changes in the dependent variable are proportional to changes in the independent variables.
  • Independence of Residuals : Assumes that the residuals (the differences between observed and predicted values) are independent of each other. The independence assumption is crucial to avoid issues of autocorrelation and ensure the reliability of the model.
  • Homoscedasticity : Assumes that the variability of the residuals remains constant across all levels of the independent variables. Homoscedasticity ensures that the spread of residuals is consistent, indicating that the model’s predictions are equally accurate across the range of predictor values.
  • Normality of Residuals : Assumes that the residuals follow a normal distribution. Normality is essential for making valid statistical inferences and hypothesis testing. Deviations from normality may impact the accuracy of confidence intervals and p-values.
  • No Perfect Multicollinearity : Assumes that there is no perfect linear relationship among the independent variables. Perfect multicollinearity can lead to unstable estimates of regression coefficients, making it challenging to discern the individual impact of each predictor.

These assumptions collectively form the foundation of Multiple Linear Regression analysis . Ensuring that these conditions are met enhances the validity and reliability of the statistical inferences drawn from the model. In the subsequent sections, we will delve into hypothesis testing in Multiple Linear Regression, provide practical examples, and guide you through the step-by-step process of performing and interpreting Multiple Linear Regression analyses using SPSS.

Hypothesis of Multiple Linear Regression

The hypothesis in Multiple Linear Regression revolves around the significance of the regression coefficients. Each coefficient corresponds to a specific predictor variable, and the hypothesis tests whether each predictor has a significant impact on the dependent variable.

  • Null Hypothesis (H0): The regression coefficients for all independent variables are simultaneously equal to zero.
  • Alternative Hypothesis (H1): At least one regression coefficient for an independent variable is not equal to zero.

The hypothesis testing in Multiple Linear Regression revolves around assessing whether the collective set of independent variables has a statistically significant impact on the dependent variable. The null hypothesis suggests no overall effect, while the alternative hypothesis asserts the presence of at least one significant relationship. This testing framework guides the evaluation of the model’s overall significance, providing valuable insights into the joint contribution of the predictor variables.

Example of Simple Multiple Regression

To illustrate the concepts of Multiple Linear Regression, let’s consider an example. Imagine you are studying the factors influencing house prices, with predictors such as square footage, number of bedrooms, and distance to the city centre. By applying Multiple Linear Regression, you can model how these factors collectively influence house prices.

The regression equation would look like:

Price = b0 + b1(square footage) + b2(number of bedrooms) + b3(distance to city centre).

Through this example, you’ll gain practical insights into how Multiple Linear Regression can untangle complex relationships and offer a comprehensive understanding of the factors affecting the dependent variable.

How to Perform Multiple Linear Regression using SPSS Statistics

hypothesis testing multiple linear regression

Step by Step: Running Regression Analysis in SPSS Statistics

Now, let’s delve into the step-by-step process of conducting the Multiple Linear Regression using SPSS Statistics .  Here’s a step-by-step guide on how to perform a Multiple Linear Regression in SPSS :

  • STEP: Load Data into SPSS

Commence by launching SPSS and loading your dataset, which should encompass the variables of interest – a categorical independent variable. If your data is not already in SPSS format, you can import it by navigating to File > Open > Data and selecting your data file.

  • STEP: Access the Analyze Menu

In the top menu, locate and click on “ Analyze .” Within the “Analyze” menu, navigate to “ Regression ” and choose ” Linear ” Analyze > Regression> Linear

  • STEP: Choose Variables

A dialogue box will appear. Move the dependent variable (the one you want to predict) to the “ Dependen t” box and the independent variables to the “ Independent ” box.

  • STEP: Generate SPSS Output

Once you have specified your variables and chosen options, click the “ OK ” button to perform the analysis. SPSS will generate a comprehensive output, including the requested frequency table and chart for your dataset.

Executing these steps initiates the Multiple Linear Regression in SPSS, allowing researchers to assess the impact of the teaching method on students’ test scores while considering the repeated measures. In the next section, we will delve into the interpretation of SPSS output for Multiple Linear Regression .

Conducting a Multiple Linear Regression in SPSS provides a robust foundation for understanding the key features of your data. Always ensure that you consult the documentation corresponding to your SPSS version, as steps might slightly differ based on the software version in use. This guide is tailored for SPSS version 25 , and for any variations, it’s recommended to refer to the software’s documentation for accurate and updated instructions.

SPSS Output for Multiple Regression Analysis

hypothesis testing multiple linear regression

How to Interpret SPSS Output of Multiple Regression

Deciphering the SPSS output of Multiple Linear Regression is a crucial skill for extracting meaningful insights. Let’s focus on three tables in SPSS output;

Model Summary Table

  • R (Correlation Coefficient): This value ranges from -1 to 1 and indicates the strength and direction of the linear relationship. A positive value signifies a positive correlation, while a negative value indicates a negative correlation.
  • R-Square (Coefficient of Determination) : Represents the proportion of variance in the dependent variable explained by the independent variable. Higher values indicate a better fit of the model.
  • Adjusted R Square : Adjusts the R-squared value for the number of predictors in the model, providing a more accurate measure of goodness of fit.

ANOVA Table

  • F (ANOVA Statistic): Indicates whether the overall regression model is statistically significant. A significant F-value suggests that the model is better than a model with no predictors.
  • df (Degrees of Freedom): Represents the degrees of freedom associated with the F-test.
  • P values : The probability of obtaining the observed F-statistic by random chance. A low p-value (typically < 0.05) indicates the model’s significance.

Coefficient Table

  • Unstandardized Coefficients (B): Provides the individual regression coefficients for each predictor variable.
  • Standardized Coefficients (Beta): Standardizes the coefficients, allowing for a comparison of the relative importance of each predictor.
  • t-values : Indicate how many standard errors the coefficients are from zero. Higher absolute t-values suggest greater significance.
  • P values : Test the null hypothesis that the corresponding coefficient is equal to zero. A low p-value suggests that the predictors are significantly related to the dependent variable.

Understanding these tables in the SPSS output is crucial for drawing meaningful conclusions about the strength, significance, and direction of the relationship between variables in a Simple Linear Regression analysis.

  How to Report Results of Multiple Linear Regression in APA

Effectively communicating the results of Multiple Linear Regression in compliance with the American Psychological Association (APA) guidelines is crucial for scholarly and professional writing

  • Introduction : Begin the report with a concise introduction summarizing the purpose of the analysis and the relationship being investigated between the variables.
  • Assumption Checks: If relevant, briefly mention the checks for assumptions such as linearity, independence, homoscedasticity, and normality of residuals to ensure the robustness of the analysis.
  • Significance of the Model : Comment on the overall significance of the model based on the ANOVA table. For example, “The overall regression model was statistically significant (F = [value], p = [value]), suggesting that the predictors collectively contributed to the prediction of the dependent variable.”
  • Regression Equation : Present the Multiple Regression equation, highlighting the intercept and regression coefficients for each predictor variable.
  • Interpretation of Coefficients : Interpret the coefficients, focusing on the slope (b1..bn) to explain the strength and direction of the relationship. Discuss how a one-unit change in the independent variable corresponds to a change in the dependent variable.
  • R-squared Value: Include the R-squared value to highlight the proportion of variance in the dependent variable explained by the independent variables. For instance, “The R-squared value of [value] indicates that [percentage]% of the variability in [dependent variable] can be explained by the linear relationship with [independent variables].”
  • Conclusion : Conclude the report by summarizing the key findings and their implications. Discuss any practical significance of the results in the context of your study.

hypothesis testing multiple linear regression

Get Help For Your SPSS Analysis

Embark on a seamless research journey with SPSSAnalysis.com , where our dedicated team provides expert data analysis assistance for students, academicians, and individuals. We ensure your research is elevated with precision. Explore our pages;

  • SPSS Data Analysis Help – SPSS Helper ,
  • Quantitative Analysis Help ,
  • Qualitative Analysis Help ,
  • SPSS Dissertation Analysis Help ,
  • Dissertation Statistics Help ,
  • Statistical Analysis Help ,
  • Medical Data Analysis Help .

Connect with us at SPSSAnalysis.com to empower your research endeavors and achieve impactful results. Get a Free Quote Today !

Expert SPSS data analysis assistance available.

Struggling with Statistical Analysis in SPSS? - Hire a SPSS Helper Now!

Multiple Linear Regression

We hope you enjoyed this lesson.

Get the Hypothesis Testing course for more great video tutorials.

Start free trial

Cool lesson, huh? Share it with your friends

Facebook Twitter LinkedIn WhatsApp Email

  • Lesson resources Resources
  • Quick reference Reference

About this lesson

Many times there are multiple factors that are influencing the response variable in a problem. Multiple regression determines the relationship between the response factor and multiple control factors. Like with simple linear regression, a formula is created that allows both analysis and prediction of the process and problem.

Exercise files

Download this lesson’s related exercise files.

Quick reference

Multiple linear regression analysis is the creation of an equation with multiple independent X variables that all influence a Y response variable.  This equation is based upon an existing data set and models the conditions represented in the data.

When to use

When there are multiple independent variables that correlate with the system response, a multiple linear regression should be done.  This can be used to predict process performance and identify which factors have the primary impact on process performance.

Instructions

Multiple linear regression is the appropriate technique to use when the data set has multiple continuous independent input variables and a continuous response variable.  The technique determines which variables are statistically significant and creates an equation that shows the relationship of the variables to the response.  To improve the accuracy of the analysis, there should be at least ten data points for each independent variable.  The equation takes on the form:

Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + …

Where the absolute value of the “b” coefficients shows the relative importance of each variable.

Multiple linear regression can be used to predict process performance based on the values of the inputs.  Input levels for ideal performance can be defined and tolerance levels that ensure acceptable performance can be determined using the regression equation.  The equation will also be helpful for setting process controls.

Excel does not have a multiple linear regression function.  The analysis can be done in Minitab using the “Fit Regression Model” option in the Regression menu.  This will display an input panel where the response variable and input variables can be selected.  If the analysis shows a variable is not statistically significant, check the residual plots to see if the result is normal.  If not, remove the variable that is not statistically significant and rerun the analysis.  The normality of the residuals should be improved.

Hints & tips

  • Too many variables increase uncertainty in the analysis.  There should be at least ten data points for each variable (e.g. if using three variables have at least 30 data points). 
  • Drop variables that are not statistically significant to improve the accuracy of the equation.
  • The analysis assumes a linear (straight line) effect.  If the residuals indicate a bad fit, you will need to add higher-order terms and create a non-linear analysis.  This is discussed in another lesson.
  • Always check the residual analysis to ensure it is normally distributed with equal variance and indicates independence.
  • 00:04 Hi, I'm Ray Sheen.
  • 00:05 Sometimes the dependent or
  • 00:07 response variable in the analysis depends upon more than one factor.
  • 00:12 When that happens, you may need to do a multi-linear regression analysis.
  • 00:18 >> Once again, I'll start with our decision tree for hypothesis testing.
  • 00:22 When we have continuous variables for the process response and
  • 00:26 the process independent variables, we turn to the regression analysis, and when we
  • 00:30 have multiple variables at the same time, we use the multiple regression technique.
  • 00:36 Let's take a few minutes to explain what we mean by multiple regression analysis.
  • 00:42 Recall that regression analysis determines the relationship between process
  • 00:45 variables.
  • 00:47 And it's no surprise that multiple regression considers multiple independent
  • 00:51 variables instead of just one.
  • 00:53 The analysis determines the impact of each of these independent
  • 00:57 variables upon the dependent variable.
  • 01:00 If you tried a simple linear analysis and it wasn't a good fit,
  • 01:03 you can consider adding some additional terms.
  • 01:07 Now, regardless of the number of terms, the format for
  • 01:09 the hypothesis tests are still the same as a simple linear regression.
  • 01:13 The null hypothesis is that there is no relationship, and
  • 01:17 the alternative hypothesis is that there is a relationship.
  • 01:21 The analysis will determine the relative significance of each of the factors to
  • 01:25 each other in addition to the dependent factor.
  • 01:28 It will show up with coefficients of these factors.
  • 01:31 The form of the multiple linear equation is a dependent variable y is equal to
  • 01:36 a constant indicated by a in this equation plus a term with each of the variables,
  • 01:41 which has a coefficient associated with it.
  • 01:44 So it's beta 1 times x1 + beta 2 times x 2 + beta 3 times x 3 and so on.
  • 01:53 Multiple regression analysis is particularly useful for
  • 01:56 predicting process performance.
  • 01:59 The multiple regression analysis will result in an equation that relates all
  • 02:03 the independent variables to the dependent variable.
  • 02:07 This analysis provides the terms and the coefficients.
  • 02:12 This equation is incredibly helpful when you're designing the solution for
  • 02:15 a problem in a Lean Six Sigma Project.
  • 02:18 The equation predicts the dependent variable performance based upon whatever
  • 02:22 values you've selected for the independent variables.
  • 02:25 So, as you're designing the solution, you may want to design one of your independent
  • 02:30 variables to be in a particular zone that's well controlled.
  • 02:34 This will then help you determine how with the other variables you can achieve
  • 02:38 the desired performance.
  • 02:40 Based upon the scaling constants for each of the factors,
  • 02:44 you can also decide which factors will have the primary control for the process.
  • 02:49 I prefer to use one easily controlled independent factor to control
  • 02:53 an overall process, and if possible to set the other factors and
  • 02:58 zones that are very easy to lock into a standard setting.
  • 03:02 You can't always do that, but if you can, it makes process control much easier.
  • 03:07 So let's look at how we conduct a multiple regression analysis.
  • 03:11 Excel does not have a function for
  • 03:13 conducting multiple linear regression analysis.
  • 03:16 So we will have to rely on Minitab.
  • 03:19 In Minitab, go to the stat pulldown menu, select Regression, select
  • 03:24 Regression again, and then select Fit Regression Model just like is shown here.
  • 03:30 That will bring up this panel.
  • 03:32 Place your cursor in the response window to activate the list of
  • 03:36 data values in the window on the left ,then select the dependent variable
  • 03:41 often referred to as the y factor and click on the selection button.
  • 03:46 The column should move to the response window.
  • 03:48 Now, place your cursor in the continuous predictors and
  • 03:52 then select the appropriate column.
  • 03:54 You can also use categorical or discrete factors if you have them.
  • 03:58 However, if you're using these factors I always recommend you use factors that
  • 04:02 are bimodal such as true, false.
  • 04:04 Set one of those to a value of 1 and the other to a value of 0.
  • 04:08 And one more point, you get the residual plots by selecting the graph button and
  • 04:13 then choosing residual 4 and 1.
  • 04:17 Let's finish off this topic with a few warnings and
  • 04:20 some pitfalls when doing multiple regression analysis.
  • 04:24 This analysis still assumes linear effect, which means straight line effects for
  • 04:28 each of the independent variables.
  • 04:31 We'll look at interactive effects when we look at nonlinear regression in
  • 04:34 another lesson.
  • 04:36 Adding lots of independent variables can increase uncertainty.
  • 04:40 If you find that some factor has virtually no effect,
  • 04:43 then I recommend removing it from your analysis and simplify things.
  • 04:48 Check your residual plots to make sure the residuals are normally distributed.
  • 04:53 This is another indication that you have a good solution.
  • 04:56 And finally, too many factors creates too many potential interactions,
  • 05:00 and it becomes difficult to statistically validate the effect of each independent
  • 05:04 variable.
  • 05:05 A good rule of thumb is that your data set size should have at least 10 times,
  • 05:10 the number of independent factors being analyzed.
  • 05:14 So if you want to analyze four factors at once,
  • 05:17 the data set needs to have a minimum of 40 points.
  • 05:20 Also, when there are many independent factors in the analysis,
  • 05:24 the regression formula becomes much more sensitive to outliers.
  • 05:29 >> In many cases,
  • 05:30 the multiple linear regression analysis is just what is needed to understand
  • 05:34 how the handful of independent variables affects the overall process output.
  • 05:40 The formula created is extremely helpful when
  • 05:44 determining the optimal solution for your problem.

Lesson notes are only available for subscribers.

PMI, PMP, CAPM and PMBOK are registered marks of the Project Management Institute, Inc.

How is your GoSkills experience?

Your feedback has been sent

© 2024 GoSkills Ltd. Skills for career advancement

hypothesis testing multiple linear regression

Inference for Linear Regression

Kelly McConville

Stat 100 Week 12 | Spring 2024

Announcements

  • Last week of section.

Goals for Today

Recap multiple linear regression

Check assumptions for linear regression inference

Hypothesis testing for linear regression

Estimation and prediction inference for linear regression

Please make sure to fill out the Stat 100 Course Evaluations.

We appreciate constructive feedback.

For all of your course evaluations be mindful of unconscious and unintentional biases .

What does statistical inference (estimation and hypothesis testing) look like when I have more than 0 or 1 explanatory variables?

One route: Multiple Linear Regression!

Multiple Linear Regression

Linear regression is a flexible class of models that allow for:

Both quantitative and categorical explanatory variables.

Multiple explanatory variables.

Curved relationships between the response variable and the explanatory variable.

BUT the response variable is quantitative.

In this week’s p-set you will explore the importance of controlling for key explanatory variables when making inferences about relationships.

Form of the Model:

\[ \begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon \end{align} \]

Fitted Model: Using the Method of Least Squares,

\[ \begin{align} \hat{y} &= \hat{\beta}_o + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 + \cdots + \hat{\beta}_p x_p \end{align} \]

Typical Inferential Questions – Hypothesis Testing

Should \(x_2\) be in the model that already contains \(x_1\) and \(x_3\) ? Also often asked as “Controlling for \(x_1\) and \(x_3\) , is there evidence that \(x_2\) has a relationship with \(y\) ?”

\[ \begin{align} y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \end{align} \]

In other words, should \(\beta_2 = 0\) ?

Typical Inferential Questions – Estimation

After controlling for the other explanatory variables, what is the range of plausible values for \(\beta_3\) (which summarizes the relationship between \(y\) and \(x_3\) )?

Typical Inferential Questions – Prediction

While \(\hat{y}\) is a point estimate for \(y\) , can we also get an interval estimate for \(y\) ? In other words, can we get a range of plausible predictions for \(y\) ?

To answer these questions, we need to add some assumptions to our linear regression model.

Additional Assumptions:

\[ \epsilon \overset{\mbox{ind}}{\sim} N (\mu = 0, \sigma = \sigma_{\epsilon}) \]

\(\sigma_{\epsilon}\) = typical deviations from the model

Let’s unpack these assumptions!

Assumptions – Independence

For ease of visualization, let’s assume a simple linear regression model:

Assumption : The cases are independent of each other.

Question : How do we check this assumption?

Consider how the data were collected.

Assumptions – Normality

\[ \begin{align*} y = \beta_o + \beta_1 x_1 + \epsilon \quad \mbox{ where } \quad \epsilon \overset{\mbox{ind}}{\sim}\color{black}{N} \left(0, \sigma_{\epsilon} \right) \end{align*} \]

Assumption : The errors are normally distributed.

Recall the residual: \(e = y - \hat{y}\)

QQ-plot: Plot the residuals against the quantiles of a normal distribution!

hypothesis testing multiple linear regression

Assumptions – Mean of Errors

Assumption : The points will, on average, fall on the line.

If you use the Method of Least Squares, then you don’t have to check.

It will be true by construction:

\[ \sum e = 0 \]

Assumptions – Constant Variance

Assumption : The variability in the errors is constant.

One option : Scatterplot

hypothesis testing multiple linear regression

Better option (especially when have more than 1 explanatory variable): Residual Plot

hypothesis testing multiple linear regression

Assumptions – Model Form

Assumption : The model form is appropriate.

One option : Scatterplot(s)

hypothesis testing multiple linear regression

Assumption Checking

Question : What if the assumptions aren’t all satisfied?

Try transforming the data and building the model again.

Use a modeling technique beyond linear regression.

Question : What if the assumptions are all (roughly) satisfied?

  • Can now start answering your inference questions!

Let’s now look at an example and learn how to create qq-plots and residual plots in R .

Example: covid and candle ratings.

Kate Petrova created a dataset that made the rounds on Twitter:

hypothesis testing multiple linear regression

COVID and Candle Ratings

She posted all her data and code to GitHub and I did some light wrangling so that we could answer the question:

Do we have evidence that early in the pandemic the association between time and Amazon rating varies by whether or not a candle is scented and in particular, that scented candles have a steeper decline in ratings over time?

In other words, do we have evidence that we should allow the slopes to vary?

hypothesis testing multiple linear regression

Checking assumptions:

Question : What needs to be true about the candles sampled?

Assumption Checking in R

The R package we will use to check model assumptions is called gglm and was written by one of my former Reed students, Grayson White.

First need to fit the model:

hypothesis testing multiple linear regression

Residual Plot

hypothesis testing multiple linear regression

Hypothesis Testing

Question : What tests is get_regression_table() conducting?

For the moment, let’s focus on the equal slopes model.

In General :

\[ H_o: \beta_j = 0 \quad \mbox{assuming all other predictors are in the model} \] \[ H_a: \beta_j \neq 0 \quad \mbox{assuming all other predictors are in the model} \]

For our Example :

\[ H_o: \beta_1 = 0 \quad \mbox{given Type is already in the model} \] \[ H_a: \beta_1 \neq 0 \quad \mbox{given Type is already in the model} \]

\[ H_o: \beta_2 = 0 \quad \mbox{given Date is already in the model} \] \[ H_a: \beta_2 \neq 0 \quad \mbox{given Date is already in the model} \]

Test Statistic: Let \(p\) = number of explanatory variables.

\[ t = \frac{\hat{\beta}_j - 0}{SE(\hat{\beta}_j)} \sim t(df = n - p) \]

when \(H_o\) is true and the model assumptions are met.

Our Example

Test Statistic:

\[ t = \frac{\hat{\beta}_2 - 0}{SE(\hat{\beta}_2)} = \frac{0.831 - 0}{0.063} = 13.2 \]

with p-value \(= P(t \leq -13.2) + P(t \geq 13.2) \approx 0.\)

There is evidence that including whether or not the candle is scented adds useful information to the linear regression model for Amazon ratings that already controls for date.

hypothesis testing multiple linear regression

One More Example – Prices of Houses in Saratoga Springs, NY

Does whether or not a house has central air conditioning relate to its price for houses in Saratoga Springs?

Potential confounding variables?

  • Notice that you generally don’t include interaction terms for the control variables.

Now let’s shift our focus to estimation and prediction!

Typical Inferential Question:

After controlling for the other explanatory variables, what is the range of plausible values for \(\beta_j\) (which summarizes the relationship between \(y\) and \(x_j\) )?

Confidence Interval Formula:

Two Types of Predictions

Confidence interval for the mean response.

→ Defined at given values of the explanatory variables

→ Estimates the average response

→ Centered at \(\hat{y}\)

→ Smaller SE

Prediction Interval for an Individual Response

→ Predicts the response of a single, new observation

→ Larger SE

CI for mean response at a given level of X:

We want to construct a 95% CI for the average price of Saratoga Houses (in 2006!) where the houses meet the following conditions: 1500 square feet, 20 years old, 2 bathrooms, and have central air.

  • Interpretation : We are 95% confident that the average price of 20 year old, 1500 square feet Saratoga houses with central air and 2 bathrooms is between $199,919 and $211834.

PI for a new Y at a given level of X:

Say we want to construct a 95% PI for the price of an individual house that meets the following conditions: 1500 square feet, 20 years old, 2 bathrooms, and have central air.

Notice : Predicting for a new observation not the mean!

  • Interpretation : For a 20 year old, 1500 square feet Saratoga house with central air and 2 bathrooms, we predict, with 95% confidence, that the price will be between $73,885 and $337,869.

Next Time: Comparing Models and Chi-Squared Tests!

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.4 - the hypothesis tests for the slopes.

At the beginning of this lesson, we translated three different research questions pertaining to heart attacks in rabbits ( Cool Hearts dataset ) into three sets of hypotheses we can test using the general linear F -statistic. The research questions and their corresponding hypotheses are:

Hypotheses 1

Is the regression model containing at least one predictor useful in predicting the size of the infarct?

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon\) At least one \(\beta_{j} ≠ 0\) (for j = 1, 2, 3)

Hypotheses 2

Is the size of the infarct significantly (linearly) related to the area of the region at risk?

  • \(H_{0} \colon \beta_{1} = 0 \)
  • \(H_{A} \colon \beta_{1} \ne 0 \)

Hypotheses 3

(Primary research question) Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0\)
  • \(H_{A} \colon \) At least one \(\beta_{j} ≠ 0\) (for j = 2, 3)

Let's test each of the hypotheses now using the general linear F -statistic:

\(F^*=\left(\dfrac{SSE(R)-SSE(F)}{df_R-df_F}\right) \div \left(\dfrac{SSE(F)}{df_F}\right)\)

To calculate the F -statistic for each test, we first determine the error sum of squares for the reduced and full models — SSE ( R ) and SSE ( F ), respectively. The number of error degrees of freedom associated with the reduced and full models — \(df_{R}\) and \(df_{F}\), respectively — is the number of observations, n , minus the number of parameters, p , in the model. That is, in general, the number of error degrees of freedom is n - p . We use statistical software, such as Minitab's F -distribution probability calculator, to determine the P -value for each test.

Testing all slope parameters equal 0 Section  

To answer the research question: "Is the regression model containing at least one predictor useful in predicting the size of the infarct?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 1, 2, 3)

The full model

The full model is the largest possible model — that is, the model containing all of the possible predictors. In this case, the full model is:

\(y_i=(\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE , that appears in the analysis of variance table. Because there are 4 parameters in the full model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4\).

The reduced model

The reduced model is the model that the null hypothesis describes. Because the null hypothesis sets each of the slope parameters in the full model equal to 0, the reduced model is:

\(y_i=\beta_0+\epsilon_i\)

The reduced model suggests that none of the variations in the response y is explained by any of the predictors. Therefore, the error sum of squares for the reduced model, SSE ( R ), is just the total sum of squares, SSTO , that appears in the analysis of variance table. Because there is only one parameter in the reduced model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 1 \).

Upon plugging in the above quantities, the general linear F -statistic:

\(F^*=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div \dfrac{SSE(F)}{df_F}\)

becomes the usual " overall F -test ":

\(F^*=\dfrac{SSR}{3} \div \dfrac{SSE}{n-4}=\dfrac{MSR}{MSE}\)

That is, to test \(H_{0}\) : \(\beta_{1} = \beta_{2} = \beta_{3} = 0 \), we just use the overall F -test and P -value reported in the analysis of variance table:

Analysis of Variance

Regression equation.

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

There is sufficient evidence ( F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

In general, to test that all of the slope parameters in a multiple linear regression model are 0, we use the overall F -test reported in the analysis of variance table.

Testing one slope parameter is 0 Section  

Now let's answer the second research question: "Is the size of the infarct significantly (linearly) related to the area of the region at risk?" To do so, we test the hypotheses:

Again, the full model is the model containing all of the possible predictors:

The error sum of squares for the full model, SSE ( F ), is just the usual error sum of squares, SSE . Alternatively, because the three predictors in the model are \(x_{1}\), \(x_{2}\), and \(x_{3}\), we can denote the error sum of squares as SSE (\(x_{1}\), \(x_{2}\), \(x_{3}\)). Again, because there are 4 parameters in the model, the number of error degrees of freedom associated with the full model is \(df_{F} = n - 4 \).

Because the null hypothesis sets the first slope parameter, \(\beta_{1}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_2x_{i2}+\beta_3x_{i3})+\epsilon_i\)

Because the two predictors in the model are \(x_{2}\) and \(x_{3}\), we denote the error sum of squares as SSE (\(x_{2}\), \(x_{3}\)). Because there are 3 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 3\).

The general linear statistic:

simplifies to:

\(F^*=\dfrac{SSR(x_1|x_2, x_3)}{1}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}=\dfrac{MSR(x_1|x_2, x_3)}{MSE(x_1,x_2, x_3)}\)

Getting the numbers from the Minitab output:

we determine that the value of the F -statistic is:

\(F^* = \dfrac{SSR(x_1 \vert x_2, x_3)}{1} \div \dfrac{SSE(x_1, x_2, x_3)}{28} = \dfrac{0.63742}{0.01946}=32.7554\)

The P -value is the probability — if the null hypothesis were true — that we would get an F -statistic larger than 32.7554. Comparing our F -statistic to an F -distribution with 1 numerator degree of freedom and 28 denominator degrees of freedom, Minitab tells us that the probability is close to 1 that we would observe an F -statistic smaller than 32.7554:

F distribution with 1 DF in Numerator and 28 DF in denominator

Therefore, the probability that we would get an F -statistic larger than 32.7554 is close to 0. That is, the P -value is < 0.001. There is sufficient evidence ( F = 32.8, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk after the other predictors x2 and x3 have been taken into account.

But wait a second! Have you been wondering why we couldn't just use the slope's t -statistic to test that the slope parameter, \(\beta_{1}\), is 0? We can! Notice that the P -value ( P < 0.001) for the t -test ( t * = 5.72):

Coefficients

is the same as the P -value we obtained for the F -test. This will always be the case when we test that only one slope parameter is 0. That's because of the well-known relationship between a t -statistic and an F -statistic that has one numerator degree of freedom:

\(t_{(n-p)}^{2}=F_{(1, n-p)}\)

For our example, the square of the t -statistic, 5.72, equals our F -statistic (within rounding error). That is:

\(t^{*2}=5.72^2=32.72=F^*\)

So what have we learned in all of this discussion about the equivalence of the F -test and the t -test? In short:

Compare the output obtained when \(x_{1}\) = Area is entered into the model last :

Inf = - 0.135 - 0.2435 X2 - 0.0657 X3 + 0.613 Area

to the output obtained when \(x_{1}\) = Area is entered into the model first :

The t -statistic and P -value are the same regardless of the order in which \(x_{1}\) = Area is entered into the model. That's because — by its equivalence to the F -test — the t -test for one slope parameter adjusts for all of the other predictors included in the model.

  • We can use either the F -test or the t -test to test that only one slope parameter is 0. Because the t -test results can be read right off of the Minitab output, it makes sense that it would be the test that we'll use most often.
  • But, we have to be careful with our interpretations! The equivalence of the t -test to the F -test has taught us something new about the t -test. The t -test is a test for the marginal significance of the \(x_{1}\) predictor after the other predictors \(x_{2}\) and \(x_{3}\) have been taken into account. It does not test for the significance of the relationship between the response y and the predictor \(x_{1}\) alone.

Testing a subset of slope parameters is 0 Section  

Finally, let's answer the third — and primary — research question: "Is the size of the infarct area significantly (linearly) related to the type of treatment upon controlling for the size of the region at risk for infarction?" To do so, we test the hypotheses:

  • \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \)
  • \(H_{A} \colon\) At least one \(\beta_{j} \ne 0 \) (for j = 2, 3)

Because the null hypothesis sets the second and third slope parameters, \(\beta_{2}\) and \(\beta_{3}\), equal to 0, the reduced model is:

\(y_i=(\beta_0+\beta_1x_{i1})+\epsilon_i\)

The ANOVA table for the reduced model is:

Because the only predictor in the model is \(x_{1}\), we denote the error sum of squares as SSE (\(x_{1}\)) = 0.8793. Because there are 2 parameters in the model, the number of error degrees of freedom associated with the reduced model is \(df_{R} = n - 2 = 32 – 2 = 30\).

\begin{align} F^*&=\dfrac{SSE(R)-SSE(F)}{df_R-df_F} \div\dfrac{SSE(F)}{df_F}\\&=\dfrac{0.8793-0.54491}{30-28} \div\dfrac{0.54491}{28}\\&= \dfrac{0.33439}{2} \div 0.01946\\&=8.59.\end{align}

Alternatively, we can calculate the F-statistic using a partial F-test :

\begin{align}F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{MSR(x_2, x_3|x_1)}{MSE(x_1,x_2, x_3)}.\end{align}

To conduct the test, we regress y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3 }\)— in order (and with "Sequential sums of squares" selected under "Options"):

Inf = - 0.135 + 0.613 Area - 0.2435 X2 - 0.0657 X3

yielding SSR (\(x_{2}\) | \(x_{1}\)) = 0.31453, SSR (\(x_{3}\) | \(x_{1}\), \(x_{2}\)) = 0.01981, and MSE = 0.54491/28 = 0.01946. Therefore, the value of the partial F -statistic is:

\begin{align} F^*&=\dfrac{SSR(x_2, x_3|x_1)}{2}\div \dfrac{SSE(x_1,x_2, x_3)}{n-4}\\&=\dfrac{0.31453+0.01981}{2}\div\dfrac{0.54491}{28}\\&= \dfrac{0.33434}{2} \div 0.01946\\&=8.59,\end{align}

which is identical (within round-off error) to the general F-statistic above. The P -value is the probability — if the null hypothesis were true — that we would observe a partial F -statistic more extreme than 8.59. The following Minitab output:

F distribution with 2 DF in Numerator and 28 DF in denominator

tells us that the probability of observing such an F -statistic that is smaller than 8.59 is 0.9988. Therefore, the probability of observing such an F -statistic that is larger than 8.59 is 1 - 0.9988 = 0.0012. The P -value is very small. There is sufficient evidence ( F = 8.59, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs — after taking into account the size of the region at risk.

Summary of MLR Testing Section  

For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are:

  • Hypothesis test for testing that all of the slope parameters are 0.
  • Hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0.
  • Hypothesis test for testing that one slope parameter is 0.

We have learned how to perform each of the above three hypothesis tests. Along the way, we also took two detours — one to learn about the " general linear F-test " and one to learn about " sequential sums of squares. " As you now know, knowledge about both is necessary for performing the three hypothesis tests.

The F -statistic and associated p -value in the ANOVA table is used for testing whether all of the slope parameters are 0. In most applications, this p -value will be small enough to reject the null hypothesis and conclude that at least one predictor is useful in the model. For example, for the rabbit heart attacks study, the F -statistic is (0.95927/(4–1)) / (0.54491/(32–4)) = 16.43 with p -value 0.000.

To test whether a subset — more than one, but not all — of the slope parameters are 0, there are two equivalent ways to calculate the F-statistic:

  • Use the general linear F-test formula by fitting the full model to find SSE(F) and fitting the reduced model to find SSE(R) . Then the numerator of the F-statistic is (SSE(R) – SSE(F)) / ( \(df_{R}\) – \(df_{F}\)) .
  • Alternatively, use the partial F-test formula by fitting only the full model but making sure the relevant predictors are fitted last and "sequential sums of squares" have been selected. Then the numerator of the F-statistic is the sum of the relevant sequential sums of squares divided by the sum of the degrees of freedom for these sequential sums of squares. The denominator of the F -statistic is the mean squared error in the ANOVA table.

For example, for the rabbit heart attacks study, the general linear F-statistic is ((0.8793 – 0.54491) / (30 – 28)) / (0.54491 / 28) = 8.59 with p -value 0.0012. Alternatively, the partial F -statistic for testing the slope parameters for predictors \(x_{2}\) and \(x_{3}\) using sequential sums of squares is ((0.31453 + 0.01981) / 2) / (0.54491 / 28) = 8.59.

To test whether one slope parameter is 0, we can use an F -test as just described. Alternatively, we can use a t -test, which will have an identical p -value since in this case, the square of the t -statistic is equal to the F -statistic. For example, for the rabbit heart attacks study, the F -statistic for testing the slope parameter for the Area predictor is (0.63742/1) / (0.54491/(32–4)) = 32.75 with p -value 0.000. Alternatively, the t -statistic for testing the slope parameter for the Area predictor is 0.613 / 0.107 = 5.72 with p -value 0.000, and \(5.72^{2} = 32.72\).

Incidentally, you may be wondering why we can't just do a series of individual t-tests to test whether a subset of the slope parameters is 0. For example, for the rabbit heart attacks study, we could have done the following:

  • Fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and \(x_{3}\) and use an individual t-test for \(x_{3}\).
  • If the test results indicate that we can drop \(x_{3}\) then fit the model of y = InfSize on \(x_{1}\) = Area and \(x_{2}\) and use an individual t-test for \(x_{2}\).

The problem with this approach is we're using two individual t-tests instead of one F-test, which means our chance of drawing an incorrect conclusion in our testing procedure is higher. Every time we do a hypothesis test, we can draw an incorrect conclusion by:

  • rejecting a true null hypothesis, i.e., make a type I error by concluding the tested predictor(s) should be retained in the model when in truth it/they should be dropped; or
  • failing to reject a false null hypothesis, i.e., make a type II error by concluding the tested predictor(s) should be dropped from the model when in truth it/they should be retained.

Thus, in general, the fewer tests we perform the better. In this case, this means that wherever possible using one F-test in place of multiple individual t-tests is preferable.

Hypothesis tests for the slope parameters Section  

The problems in this section are designed to review the hypothesis tests for the slope parameters, as well as to give you some practice on models with a three-group qualitative variable (which we'll cover in more detail in Lesson 8). We consider tests for:

  • whether one slope parameter is 0 (for example, \(H_{0} \colon \beta_{1} = 0 \))
  • whether a subset (more than one but less than all) of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{2} = \beta_{3} = 0 \) against the alternative \(H_{A} \colon \beta_{2} \ne 0 \) or \(\beta_{3} \ne 0 \) or both ≠ 0)
  • whether all of the slope parameters are 0 (for example, \(H_{0} \colon \beta_{1} = \beta_{2} = \beta_{3}\) = 0 against the alternative \(H_{A} \colon \) at least one of the \(\beta_{i}\) is not 0)

(Note the correct specification of the alternative hypotheses for the last two situations.)

Sugar beets study

A group of researchers was interested in studying the effects of three different growth regulators ( treat , denoted 1, 2, and 3) on the yield of sugar beets (y = yield , in pounds). They planned to plant the beets in 30 different plots and then randomly treat 10 plots with the first growth regulator, 10 plots with the second growth regulator, and 10 plots with the third growth regulator. One problem, though, is that the amount of available nitrogen in the 30 different plots varies naturally, thereby giving a potentially unfair advantage to plots with higher levels of available nitrogen. Therefore, the researchers also measured and recorded the available nitrogen (\(x_{1}\) = nit , in pounds/acre) in each plot. They are interested in comparing the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen. The Sugar Beets dataset contains the data from the researcher's experiment.

Preliminary Work

The plot shows a similar positive linear trend within each treatment category, which suggests that it is reasonable to formulate a multiple regression model that would place three parallel lines through the data.

Because the qualitative variable treat distinguishes between the three treatment groups (1, 2, and 3), we need to create two indicator variables, \(x_{2}\) and \(x_{3}\), say, to fit a linear regression model to these data. The new indicator variables should be defined as follows:

Use Minitab's Calc >> Make Indicator Variables command to create the new indicator variables in your worksheet

Minitab creates an indicator variable for each treatment group but we can only use two, for treatment groups 1 and 2 in this case (treatment group 3 is the reference level in this case).

Then, if we assume the trend in the data can be summarized by this regression model:

\(y_{i} = \beta_{0}\) + \(\beta_{1}\)\(x_{1}\) + \(\beta_{2}\)\(x_{2}\) + \(\beta_{3}\)\(x_{3}\) + \(\epsilon_{i}\)

where \(x_{1}\) = nit and \(x_{2}\) and \(x_{3}\) are defined as above, what is the mean response function for plots receiving treatment 3? for plots receiving treatment 1? for plots receiving treatment 2? Are the three regression lines that arise from our formulated model parallel? What does the parameter \(\beta_{2}\) quantify? And, what does the parameter \(\beta_{3}\) quantify?

The fitted equation from Minitab is Yield = 84.99 + 1.3088 Nit - 2.43 \(x_{2}\) - 2.35 \(x_{3}\), which means that the equations for each treatment group are:

  • Group 1: Yield = 84.99 + 1.3088 Nit - 2.43(1) = 82.56 + 1.3088 Nit
  • Group 2: Yield = 84.99 + 1.3088 Nit - 2.35(1) = 82.64 + 1.3088 Nit
  • Group 3: Yield = 84.99 + 1.3088 Nit

The three estimated regression lines are parallel since they have the same slope, 1.3088.

The regression parameter for \(x_{2}\) represents the difference between the estimated intercept for treatment 1 and the estimated intercept for reference treatment 3.

The regression parameter for \(x_{3}\) represents the difference between the estimated intercept for treatment 2 and the estimated intercept for reference treatment 3.

Testing whether all of the slope parameters are 0

\(H_0 \colon \beta_1 = \beta_2 = \beta_3 = 0\) against the alternative \(H_A \colon \) at least one of the \(\beta_i\) is not 0.

\(F=\dfrac{SSR(X_1,X_2,X_3)\div3}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1,X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{16039.5}{3}}{\frac{1078.0}{30-4}} = \dfrac{5346.5}{41.46} = 128.95\)

Since the p -value for this F -statistic is reported as 0.000, we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that at least one of the slope parameters is not zero, i.e., the regression model containing at least one predictor is useful in predicting the size of sugar beet yield.

Tests for whether one slope parameter is 0

\(H_0 \colon \beta_1= 0\) against the alternative \(H_A \colon \beta_1 \ne 0\)

t -statistic = 19.60, p -value = 0.000, so we reject \(H_{0}\) in favor of \(H_{A}\) and conclude that the slope parameter for \(x_{1}\) = nit is not zero, i.e., sugar beet yield is significantly linearly related to the available nitrogen (controlling for treatment).

\(F=\dfrac{SSR(X_1|X_2,X_3)\div1}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_1|X_2,X_3)}{MSE(X_1,X_2,X_3)}\)

Use the Minitab output to calculate the value of this F statistic. Does the value you obtain equal \(t^{2}\), the square of the t -statistic as we might expect?

\(F-statistic= \dfrac{\frac{15934.5}{1}}{\frac{1078.0}{30-4}} = \dfrac{15934.5}{41.46} = 384.32\), which is the same as \(19.60^{2}\).

Because \(t^{2}\) will equal the partial F -statistic whenever you test for whether one slope parameter is 0, it makes sense to just use the t -statistic and P -value that Minitab displays as a default. But, note that we've just learned something new about the meaning of the t -test in the multiple regression setting. It tests for the ("marginal") significance of the \(x_{1}\) predictor after \(x_{2}\) and \(x_{3}\) have already been taken into account.

Tests for whether a subset of the slope parameters is 0

\(H_0 \colon \beta_2=\beta_3= 0\) against the alternative \(H_A \colon \beta_2 \ne 0\) or \(\beta_3 \ne 0\) or both \(\ne 0\).

\(F=\dfrac{SSR(X_2,X_3|X_1)\div2}{SSE(X_1,X_2,X_3)\div(n-4)}=\dfrac{MSR(X_2,X_3|X_1)}{MSE(X_1,X_2,X_3)}\)

\(F = \dfrac{\frac{10.4+27.5}{2}}{\frac{1078.0}{30-4}} = \dfrac{18.95}{41.46} = 0.46\).

F distribution with 2 DF in Numerator and 26 DF in denominator

p-value \(= 1-0.363677 = 0.636\), so we fail to reject \(H_{0}\) in favor of \(H_{A}\) and conclude that we cannot rule out \(\beta_2 = \beta_3 = 0\), i.e., there is no significant difference in the mean yields of sugar beets subjected to the different growth regulators after taking into account the available nitrogen.

Note that the sequential mean square due to regression, MSR(\(X_{2}\),\(X_{3}\)|\(X_{1}\)), is obtained by dividing the sequential sum of square by its degrees of freedom (2, in this case, since two additional predictors \(X_{2}\) and \(X_{3}\) are considered). Use the Minitab output to calculate the value of this F statistic, and use Minitab to get the associated P -value. Answer the researcher's question at the \(\alpha= 0.05\) level.

  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Model Parallelism vs Data Parallelism: Examples - April 11, 2024
  • Model Complexity & Overfitting in Machine Learning: How to Reduce - April 10, 2024
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade - April 9, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Model Parallelism vs Data Parallelism: Examples
  • Model Complexity & Overfitting in Machine Learning: How to Reduce
  • 6 Game-Changing Features of ChatGPT’s Latest Upgrade
  • Self-Prediction vs Contrastive Learning: Examples
  • Free IBM Data Sciences Courses on Coursera

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Do you get more food when you order in-person at Chipotle?

April 7, 2024

Inspired by this Reddit post , we will conduct a hypothesis test to determine if there is a difference in the weight of Chipotle orders between in-person and online orders. The data was originally collected by Zackary Smigel , and a cleaned copy be found in data/chipotle.csv .

Throughout the application exercise we will use the infer package which is part of tidymodels to conduct our permutation tests.

Variable type: character

Variable type: Date

Variable type: numeric

The variable we will use in this analysis is weight which records the total weight of the meal in grams.

We wish to test the claim that the difference in weight between in-person and online orders must be due to something other than chance.

hypothesis testing multiple linear regression

  • Your turn: Write out the correct null and alternative hypothesis in terms of the difference in means between in-person and online orders. Do this in both words and in proper notation.

Null hypothesis: TODO

\[H_0: \mu_{\text{online}} - \mu_{\text{in-person}} = TODO\]

Alternative hypothesis: The difference in means between in-person and online Chipotle orders is not \(0\) .

\[H_A: \mu_{\text{online}} - \mu_{\text{in-person}} TODO\]

Observed data

Our goal is to use the collected data and calculate the probability of a sample statistic at least as extreme as the one observed in our data if in fact the null hypothesis is true.

  • Demo: Calculate and report the sample statistic below using proper notation.

The null distribution

Let’s use permutation-based methods to conduct the hypothesis test specified above.

We’ll start by generating the null distribution.

  • Demo: Generate the null distribution.
  • Your turn: Take a look at null_dist . What does each element in this distribution represent?

Add response here.

Question: Before you visualize the distribution of null_dist – at what value would you expect this distribution to be centered? Why?

Demo: Create an appropriate visualization for your null distribution. Does the center of the distribution match what you guessed in the previous question?

hypothesis testing multiple linear regression

  • Demo: Now, add a vertical red line on your null distribution that represents your sample statistic.

hypothesis testing multiple linear regression

Question: Based on the position of this line, does your observed sample difference in means appear to be an unusual observation under the assumption of the null hypothesis?

Above, we eyeballed how likely/unlikely our observed mean is. Now, let’s actually quantify it using a p-value.

Question: What is a p-value?

Guesstimate the p-value

  • Demo: Visualize the p-value.

hypothesis testing multiple linear regression

Your turn: What is you guesstimate of the p-value?

Calculate the p-value

hypothesis testing multiple linear regression

Your turn: What is the conclusion of the hypothesis test based on the p-value you calculated? Make sure to frame it in context of the data and the research question. Use a significance level of 5% to make your conclusion.

Demo: Interpret the p-value in context of the data and the research question.

Reframe as a linear regression model

While we originally evaluated the null/alternative hypotheses as a difference in means, we could also frame this as a regression problem where the outcome of interest (weight of the order) is a continuous variable. Framing it this way allows us to include additional explanatory variables in our model which may account for some of the variation in weight.

Single explanatory variable

Demo: Let’s reevaluate the original hypotheses using a linear regression model. Notice the similarities and differences in the code compared to a difference in means, and that the obtained p-value should be nearly identical to the results from the difference in means test.

hypothesis testing multiple linear regression

Multiple explanatory variables

Demo: Now let’s also account for additional variables that likely influence the weight of the order.

  • Protein type ( meat )
  • Type of meal ( meal_type ) - burrito or bowl
  • Store ( store ) - at which Chipotle location the order was placed

hypothesis testing multiple linear regression

Your turn: Interpret the p-value for the order in context of the data and the research question.

Compare to CLT-based method

Demo: Let’s compare the p-value obtained from the permutation test to the p-value obtained from that derived using the Central Limit Theorem (CLT).

Your turn: What is the p-value obtained from the CLT-based method? How does it compare to the p-value obtained from the permutation test?

COMMENTS

  1. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  2. Hypothesis Tests and Confidence Intervals in Multiple Regression

    Confidence Intervals for a Single Coefficient. The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression. The t-statistic has n - k - 1 degrees of freedom where k = number of independents. Supposing that an interval contains the true value of ...

  3. PDF Hypothesis Testing in the Multiple regression model

    Hypothesis Testing in the Multiple regression model. • Testing that individual coefficients take a specific value such as zero or some other value is done in exactly the same way as with the simple two variable regression model. • Now suppose we wish to test that a number of coefficients or combinations of coefficients take some particular ...

  4. 5.3

    A population model for a multiple linear regression model that relates a y -variable to p -1 x -variables is written as. y i = β 0 + β 1 x i, 1 + β 2 x i, 2 + … + β p − 1 x i, p − 1 + ϵ i. We assume that the ϵ i have a normal distribution with mean 0 and constant variance σ 2. These are the same assumptions that we used in simple ...

  5. Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation

    a hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0. In this lesson, we also learn how to perform each of the above three hypothesis tests. Key Learning Goals for this Lesson: Be able to interpret the coefficients of a multiple regression model. Understand what the scope of the model is ...

  6. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  7. Multiple linear regression

    Linear regression has an additive assumption: $ sales = β 0 + β 1 × tv + β 2 × radio + ε $. i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of 100 β 2 USD in sales on average, regardless of how much you spend on radio ads. We saw that in Fig 3.5 above.

  8. Hypothesis Tests in Multiple Linear Regression, Part 1

    Organized by textbook: https://learncheme.com/ See Part 2: https://www.youtube.com/watch?v=ziGbG0dRlsAMade by faculty at the University of Colorado Boulder, ...

  9. Lesson 5: Multiple Linear Regression

    Minitab Help 5: Multiple Linear Regression; R Help 5: Multiple Linear Regression; Lesson 6: MLR Model Evaluation. 6.1 - Three Types of Hypotheses; 6.2 - The General Linear F-Test; 6.3 - Sequential (or Extra) Sums of Squares; 6.4 - The Hypothesis Tests for the Slopes; 6.5 - Partial R-squared; 6.6 - Lack of Fit Testing in the Multiple Regression ...

  10. Multiple Regression: Estimation and Hypothesis Testing

    Multiple Regression: Estimation and Hypothesis Testing. ... Although in many ways a straightforward extension of the two-variable linear regression model, the three-variable model introduced several new concepts, such as partial regression coefficients, adjusted and unadjusted multiple coefficient of determination, and multicollinearity ...

  11. Multiple Linear Regression. A complete study

    Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...

  12. Multiple linear regression -- Advanced Statistics using R

    Hypothesis testing of regression coefficient(s) With the estimates of regression coefficients and their standard errors estimates, we can conduct hypothesis testing for one, a subset, or all regression coefficients. ... about 40% of the variation in college GPA can be explained by the multiple linear regression with h.GPA, SAT, and recommd as ...

  13. Hypothesis Tests in Multiple Linear Regression, Part 2

    Organized by textbook: https://learncheme.com/ See Part 1: https://www.youtube.com/watch?v=1cTUCVQ09cU The spreadsheet can be found at https://learncheme.co...

  14. 12.2.1: Hypothesis Test for Linear Regression

    The formula for the t-test statistic is t = b1 (MSE SSxx)√. Use the t-distribution with degrees of freedom equal to n − p − 1. The t-test for slope has the same hypotheses as the F-test: Use a t-test to see if there is a significant relationship between hours studied and grade on the exam, use α = 0.05.

  15. PDF 13 Multiple Linear( Regression(

    Multiple Linear 13 Regression. Chapter 12. Definition. The multiple regression model equation. Y = b 0 + b 1x1 + b 2x2 + ... +. where E(ε) = 0 and Var(ε) = s 2. b pxp + ε. is. Again, it is assumed that ε is normally distributed.

  16. Writing hypothesis for linear multiple regression models

    2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...

  17. 5.7

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0. Hypothesis test for testing ...

  18. Econometrics Notes

    Multiple regression models are usually estimated using ordinary least squares. In this chapter, we focus on the multiple linear regression with two regressors \[ Y = \beta_0 + \beta_1 X + \beta_2 Z + \epsilon. \] The general multi-regressor case is best dealt with using matrix algebra, which we leave for a later chapter. We use the two regressor case to build intuition regarding issues such as ...

  19. Multiple Linear Regression in SPSS

    The hypothesis testing in Multiple Linear Regression revolves around assessing whether the collective set of independent variables has a statistically significant impact on the dependent variable. The null hypothesis suggests no overall effect, while the alternative hypothesis asserts the presence of at least one significant relationship.

  20. Multiple Linear Regression

    Multiple linear regression analysis is the creation of an equation with multiple independent X variables that all influence a Y response variable. This equation is based upon an existing data set and models the conditions represented in the data. ... 01:09 the hypothesis tests are still the same as a simple linear regression. 01:13 The null ...

  21. Hypothesis testing in the multiple regression model

    4.1 Hypothesis testing: an overview 1 4.1.1 Formulation of the null hypothesis and the alternative hypothesis 2 4.1.2 Test statistic 2 4.1.3 Decision rule 3 4.2 Testing hypotheses using the t test 5 4.2.1 Test of a single parameter 5 4.2.2 Confidence intervals 16 4.2.3 Testing hypothesis about a single linear combination of the parameters 17 4.2.4 Economic importance versus statistical ...

  22. Inference for Regression

    BUT the response variable is quantitative. In this week's p-setyou will explore the importance of controlling for key explanatory variableswhen making inferences about relationships. Multiple Linear Regression. Form of the Model: \[\begin{align}y &= \beta_o + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\end{align}\] Fitted ...

  23. 6.4

    For the simple linear regression model, there is only one slope parameter about which one can perform hypothesis tests. For the multiple linear regression model, there are three different hypothesis tests for slopes that one could conduct. They are: Hypothesis test for testing that all of the slope parameters are 0.

  24. Linear regression hypothesis testing: Concepts, Examples

    This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0. Determine the test statistics: The next step is to determine the test statistics and calculate the value.

  25. Do you get more food when you order in-person at Chipotle?

    Demo: Let's reevaluate the original hypotheses using a linear regression model. Notice the similarities and differences in the code compared to a difference in means, and that the obtained p-value should be nearly identical to the results from the difference in means test.