Statology

Statistics Made Easy

Understanding the Null Hypothesis for Linear Regression

Linear regression is a technique we can use to understand the relationship between one or more predictor variables and a response variable .

If we only have one predictor variable and one response variable, we can use simple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x

  • ŷ: The estimated response value.
  • β 0 : The average value of y when x is zero.
  • β 1 : The average change in y associated with a one unit increase in x.
  • x: The value of the predictor variable.

Simple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = 0
  • H A : β 1 ≠ 0

The null hypothesis states that the coefficient β 1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

The alternative hypothesis states that β 1 is not equal to zero. In other words, there is a statistically significant relationship between x and y.

If we have multiple predictor variables and one response variable, we can use multiple linear regression , which uses the following formula to estimate the relationship between the variables:

ŷ = β 0 + β 1 x 1 + β 2 x 2 + … + β k x k

  • β 0 : The average value of y when all predictor variables are equal to zero.
  • β i : The average change in y associated with a one unit increase in x i .
  • x i : The value of the predictor variable x i .

Multiple linear regression uses the following null and alternative hypotheses:

  • H 0 : β 1 = β 2 = … = β k = 0
  • H A : β 1 = β 2 = … = β k ≠ 0

The null hypothesis states that all coefficients in the model are equal to zero. In other words, none of the predictor variables have a statistically significant relationship with the response variable, y.

The alternative hypothesis states that not every coefficient is simultaneously equal to zero.

The following examples show how to decide to reject or fail to reject the null hypothesis in both simple linear regression and multiple linear regression models.

Example 1: Simple Linear Regression

Suppose a professor would like to use the number of hours studied to predict the exam score that students will receive in his class. He collects data for 20 students and fits a simple linear regression model.

The following screenshot shows the output of the regression model:

Output of simple linear regression in Excel

The fitted simple linear regression model is:

Exam Score = 67.1617 + 5.2503*(hours studied)

To determine if there is a statistically significant relationship between hours studied and exam score, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  47.9952
  • P-value:  0.000

Since this p-value is less than .05, we can reject the null hypothesis. In other words, there is a statistically significant relationship between hours studied and exam score received.

Example 2: Multiple Linear Regression

Suppose a professor would like to use the number of hours studied and the number of prep exams taken to predict the exam score that students will receive in his class. He collects data for 20 students and fits a multiple linear regression model.

Multiple linear regression output in Excel

The fitted multiple linear regression model is:

Exam Score = 67.67 + 5.56*(hours studied) – 0.60*(prep exams taken)

To determine if there is a jointly statistically significant relationship between the two predictor variables and the response variable, we need to analyze the overall F value of the model and the corresponding p-value:

  • Overall F-Value:  23.46
  • P-value:  0.00

Since this p-value is less than .05, we can reject the null hypothesis. In other words, hours studied and prep exams taken have a jointly statistically significant relationship with exam score.

Note: Although the p-value for prep exams taken (p = 0.52) is not significant, prep exams combined with hours studied has a significant relationship with exam score.

Additional Resources

Understanding the F-Test of Overall Significance in Regression How to Read and Interpret a Regression Table How to Report Regression Results How to Perform Simple Linear Regression in Excel How to Perform Multiple Linear Regression in Excel

' src=

Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

An open portfolio of interoperable, industry leading products

The Dotmatics digital science platform provides the first true end-to-end solution for scientific R&D, combining an enterprise data platform with the most widely used applications for data analysis, biologics, flow cytometry, chemicals innovation, and more.

linear regression hypothesis

Statistical analysis and graphing software for scientists

Bioinformatics, cloning, and antibody discovery software

Plan, visualize, & document core molecular biology procedures

Electronic Lab Notebook to organize, search and share data

Proteomics software for analysis of mass spec data

Modern cytometry analysis platform

Analysis, statistics, graphing and reporting of flow cytometry data

Software to optimize designs of clinical trials

The Ultimate Guide to Linear Regression

Get all your linear regression questions answered here

Welcome! When most people think of statistical models, their first thought is linear regression models. What most people don’t realize is that linear regression is a specific type of regression.

With that in mind, we’ll start with an overview of regression models as a whole. Then after we understand the purpose, we’ll focus on the linear part, including why it’s so popular and how to calculate regression lines-of-best-fit! (Or, if you already understand regression, you can skip straight down to the linear part) .

This guide will help you run and understand the intuition behind linear regression models. It’s intended to be a refresher resource for scientists and researchers, as well as to help new students gain better intuition about this useful modeling tool.

What is regression?

In its simplest form, regression is a type of model that uses one or more variables to estimate the actual values of another. There are plenty of different kinds of regression models, including the most commonly used linear regression, but they all have the basics in common. 

Usually the researcher has a response variable they are interested in predicting, and an idea of one or more predictor variables that could help in making an educated guess. Some simple examples include:

  • Predicting the progression of a disease such as diabetes using predictors such as age, cholesterol, etc. (linear regression)
  • Predicting survival rates or time-to-failure based on explanatory variables (survival analysis) 
  • Predicting political affiliation based on a person’s income level and years of education (logistic regression or some other classifier)
  • Predicting drug inhibition concentration at various dosages (nonlinear regression)

There are all sorts of applications, but the point is this: If we have a dataset of observations that links those variables together for each item in the dataset, we can regress the response on the predictors. Furthermore:

Fitting a model to your data can tell you how one variable increases or decreases as the value of another variable changes.

For example, if we have a dataset of houses that includes both their size and selling price, a regression model can help quantify the relationship between the two. ( Not that any model will be perfect for this !)

The most noticeable aspect of a regression model is the equation it produces. This model equation gives a line of best fit, which can be used to produce estimates of a response variable based on any value of the predictors ( within reason ). We call the output of the model a point estimate because it is a point on the continuum of possibilities. Of course, how good that prediction actually depends on everything from the accuracy of the data you’re putting in the model to how hard the question is in the first place.

Compare this to other methods like correlation, which can tell you the strength of the relationship between the variables, but is not helpful in estimating point estimates of the actual values for the response.

What is the difference between the variables in regression?

There are two different kinds of variables in regression: The one which helps predict (predictors), and the one you’re trying to predict (response).

Predictors were historically called independent variables in science textbooks. You may also see them referred to as x-variables, regressors, inputs, or covariates. Depending on the type of regression model you can have multiple predictor variables, which is called multiple regression . Predictors can be either continuous (numerical values such as height and weight) or categorical (levels of categories such as truck/SUV/motorcycle).

The response variable is often explained in layman’s terms as “the thing you actually want to predict or know more about”. It is usually the focus of the study and can be referred to as the dependent variable, y-variable, outcome, or target. In general, the response variable has a single value for each observation (e.g., predicting the temperature based on some other variables), but there can be multiple values (e.g., predicting the location of an object in latitude and longitude). The latter case is called multivariate regression (not to be confused with multiple regression). 

What are the purposes of regression analysis?

Regression Analysis has two main purposes:

  • Explanatory - A regression analysis explains the relationship between the response and predictor variables. For example, it can answer questions such as, does kidney function increase the severity of symptoms in some particular disease process? 
  • Predictive - A regression model can give a point estimate of the response variable based on the value of the predictors. 

How do I know which model best fits the data?

The most common way of determining the best model is by choosing the one that minimizes the squared difference between the actual values and the model’s estimated values. This is called least squares. Note that “least squares regression” is often used as a moniker for linear regression even though least squares is used for linear as well as nonlinear and other types of regression.

What is linear regression?

The most popular form of regression is linear regression, which is used to predict the value of one numeric (continuous) response variable based on one or more predictor variables (continuous or categorical).

Most people think the name “linear regression” comes from a straight line relationship between the variables. For most cases, that’s a fine way to think of it intuitively: As a predictor variable increases, the response either increases or decreases at the same rate (all other things equal). If this relationship holds the same for any values of the variables, a straight line pattern will form in the data when graphed, as in the example below:

1 - Old Faithful Eruption Times -Linear regression

However, the actual reason that it’s called linear regression is technical and has enough subtlety that it often causes confusion. For example, the graph below is linear regression, too, even though the resulting line is curved. The definition is mathematical and has to do with how the predictor variables relate to the response variable. Suffice it to say that linear regression handles most simple relationships, but can’t do complicated mathematical operations such as raising one predictor variable to the power of another predictor variable.

2 - Linear Regression Example

The most common linear regression models use the ordinary least squares algorithm to pick the parameters in the model and form the best line possible to show the relationship (the line-of-best-fit). Though it’s an algorithm shared by many models, linear regression is by far the most common application. If someone is discussing least-squares regression, it is more likely than not that they are talking about linear regression.

What are the major advantages of linear regression analysis?

Linear regression models are known for being easy to interpret thanks to the applications of the model equation, both for understanding the underlying relationship and in applying the model to predictions. The fact that regression analysis is great for explanatory analysis and often good enough for prediction is rare among modeling techniques.

In contrast, most techniques do one or the other. For example, a well-tuned AI-based artificial neural network model may be great at prediction but is a “black box” that offers little to no interpretability. 

There are some other benefits too:

  • Linear regression is computationally fast, particularly if you’re using statistical software. Though it’s not always a simple task to do by hand, it’s still much faster than the days it would take to calculate many other models.
  • The popularity of regression models is itself an advantage. The fact that it is a tried and tested approach used by so many scientists makes for easy collaboration.

Assumptions of linear regression

Just because scientists' initial reaction is usually to try a linear regression model, that doesn't mean it is always the right choice. In fact, there are some underlying assumptions that, if ignored, could invalidate the model.

  • Random sample - The observations in your data need to be independent from one another. There are many ways that dependence occurs, for example, one common way is with multiple response data, where a single subject is measured multiple times. The measurements on the same individual are presumably correlated, and you couldn’t use linear regression in this case.
  • Independence between predictors - If you have multiple predictors in your model, in theory, they shouldn’t be correlated with one another. If they are, this can cause instability in your model fit, although this affects the interpretation of your model rather than the predictions. See more about multicollinearity here .
  • Homoscedasticity - Meaning ‘equal scatter,’ this says that your residuals (the difference between the model prediction and the observed values) should be just as variable anywhere along the continuum. This is assessed with residual plots.
  • Residuals are normally distributed - In addition to having equal scatter, in the standard linear regression model, the residuals are assumed to come from a normal distribution. This is commonly assessed using a QQ-plot.
  • Linear relationship between predictors and response - The relationships must be linear as described above , ruling out some more complicated mathematical relationships. You can model some “curves” in your data using, say, variable X and variable X^2 ("X squared") as predictors.
  • No uncertainty in predictor measurements - The model assumes that all of the uncertainty is in the response variable. This is the most nuanced assumption: Even if you’re attempting to make inferences about a model with predictors that are themselves estimates, this would not affect you unless you need to attribute the uncertainty to the predictors. This field of study is called “measurement error.”

Other things to keep in mind for valid inference:

  • Representative sample - The dataset you are going to use should be a representative (and random!) sample of the population you’re trying to make inferences about. To use an intuitive example, you should not expect all people to act the same as those in your household. Since we often underestimate our own bias, the best bet is to have a random sample when you start.
  • Sample size - If your dataset only has 5 observations in it, the model will be less effective at finding a real pattern (or if one exists) than if it has 100. There is no one-size-fits-all number for every study, but generally 30 or more is considered the low end of what regression needs.
  • Stay in range - Don’t try to make predictions outside the range of the dataset you used to build the model. For example, let’s say you are predicting home values based on square footage. If your dataset only has homes between 1,000 and 3,000 square feet, the model may not be a good judge of the value of an 800 or 4,000 square-foot house. This is called extrapolating, and is not recommended.

Types of linear regression

The two most common types of regression are simple linear regression and multiple linear regression, which only differ by the number of predictors in the model. Simple linear regression has a single predictor. 

Simple linear regression

It’s called simple for a reason: If you are testing a linear relationship between exactly two continuous variables (one predictor and one response variable), you’re looking for a simple linear regression model, also called a least squares regression line. Are you looking to use more predictors than that? Try a multiple linear regression model. That is the main difference between the two, but there are other considerations and differences involved too.

You can use statistical software such as Prism to calculate simple linear regression coefficients and graph the regression line it produces. For a quick simple linear regression analysis, try our free online linear regression calculator .

Interpreting a simple linear regression model

Remember the y = mx+b formula for a line from grade school? The slope was m , and the y-intercept was b , and both were necessary to draw a line. That’s what you’re basically building here too, but most textbooks and programs will write out the predictive equation for regression this way:

Simple-linear-regression-formula

Y is your response variable, and X is your predictor. The two 𝛽 symbols are called “parameters”, the things the model will estimate to create your line of best fit. The first (not connected to X) is the intercept, the other (the coefficient in front of X) is called the slope term.

As an example, we will use a sample Prism dataset with diabetes data to model the relationship between a person’s glucose level (predictor) and their glycosylated hemoglobin level (response). Once we run the analysis we get this output:

3 - SLR Results Page - Linear regression

Best-fit parameters and the regression equation

The first section in the Prism output for simple linear regression is all about the workings of the model itself. They can be called parameters, estimates, or (as they are above) best-fit values. Keep in mind, parameter estimates could be positive or negative in regression depending on the relationship.

There you see the slope (for glucose) and the y-intercept. The values for those help us build the equation the model uses to estimate and make predictions:

Glycosylated Hemoglobin = 2.24 + (0.0312*Glucose)

Notice: That same equation is given later in the output, near the bottom of the page.

Using this equation, we can plug in any number in the range of our dataset for glucose and estimate that person’s glycosylated hemoglobin level. For instance, a glucose level of 90 corresponds to an estimate of 5.048 for that person’s glycosylated hemoglobin level. But that’s just the start of how these parameters are used.

Interpreting parameter estimates

You can also interpret the parameters of simple linear regression on their own, and because there are only two it is pretty straightforward.

The slope parameter is often the most helpful: It means that for every 1 unit increase in glucose, the estimated glycosylated hemoglobin level will increase by 0.0312 units. As an aside, if it was negative (perhaps -0.04), we would say a 1 unit increase in glucose would actually decrease the estimated response by -0.04.

The intercept parameter is useful for fitting the model, because it shifts the best-fit-line up or down. In this example, the value it shows (2.24) is the predicted glycosylated hemoglobin level for a person with a glucose level of 0. In cases like this, the interpretation of the intercept isn’t very interesting or helpful.

Simply put, if there’s no predictor with a value of 0 in the dataset, you should ignore this part of the interpretation and consider the model as a whole and the slope. However, notice that if you plug in 0 for a person’s glucose, 2.24 is exactly what the full model estimates. 

Confidence intervals and standard error

The next couple sections seem technical, but really get back to the core of how no model is perfect. We can give “point estimates” for the best-fit parameters today, but there’s still some uncertainty involved in trying to find the true and exact relationship between the variables. 

Standard error and confidence intervals work together to give an estimate of that uncertainty. Add and subtract the standard error from the estimate to get a fair range of possible values for that true relationship. With this 95% confidence interval, you can say you believe the true value of that parameter is somewhere between the two endpoints (for the slope of glucose, somewhere between 0.0285 and 0.0340).

This method may seem too cautious at first, but is simply giving a range of real possibilities around the point estimate. After all, wouldn’t you like to know if the point estimate you gave was wildly variable? This gives you that missing piece. 

Goodness of fit

Determining how well your model fits can be done graphically and numerically. If you know what to look for, there’s nothing better than plotting your data to assess the fit and how well your data meet the assumptions of the model. These diagnostic graphics plot the residuals, which are the differences between the estimated model and the observed data points.

A good plot to use is a residual plot versus the predictor (X) variable. Here you want to look for equal scatter, meaning the points all vary roughly the same above and below the dotted line across all x values. The plot on the left looks great, whereas the plot on the right shows a clear parabolic shaped trend, which would need to be addressed.

10 - Log Transform Comparison - Linear regression

Another way to assess the goodness of fit is with the R-squared statistic, which is the proportion of the variance in the response that is explained by the model. In this case, the value of 0.561 says that 56% of the variance in glycosylated hemoglobin can be explained by this very simple model equation (effectively, that person’s glucose level).

The name R-squared may remind you of a similar statistic: Pearson’s R, which measures the correlation between any two variables. Fun fact: As long as you’re doing simple linear regression, the square-root of R-squared (which is to say, R), is equivalent to the Pearson’s R correlation between the predictor and response variable.

The reason is that simple linear regression draws on the same mechanisms of least-squares that Pearson’s R does for correlation. Keep in mind, while regression and correlation are similar they are not the same thing . The differences usually come down to the purpose of the analysis, as correlation does not fit a line through the data points.

Significance and F-tests

So we have a model, and we know how to use it for predictions. We know R-squared gives an idea of how well the model fits the data… but how do we know if there is actually a significant relationship between the variables? 

A section at the bottom asks that same question: Is the slope significantly non-zero? This is especially important for this model, where the best-fit value (roughly 0.03) seems very close to 0 to the naked eye. How can we feel confident one way or another?

In this case, the slope  is  significantly non-zero: An F-test gives a p-value of less than 0.0001. F-tests answer this for the model as a whole rather than its individual slopes, but in this case there is only one slope anyway. P-values are always interpreted in comparison to a “significance threshold”: If it’s less than the threshold level, the model is said to show a trend that is significantly different from “no relationship” (or, the null hypothesis). And based on how we set up the regression analysis to use 0.05 as the threshold for significance, it tells us that the model points to a significant relationship. There is evidence that this relationship is real.

If it wasn’t, then we are effectively saying there is no evidence that the model gives any new information beyond random guessing. In other words: The model may output a number for a prediction, but if the slope is not significant, it may not be worth actually considering that prediction.

Graphing linear regression

Since a linear regression model produces an equation for a line, graphing linear regression’s line-of-best-fit in relation to the points themselves is a popular way to see how closely the model fits the eye test. Software like Prism makes the graphing part of regression incredibly easy, because a graph is created automatically alongside the details of the model. Here are some more graphing tips , along with an example from our analysis:

5 - SLR Line of Best Fit - Linear regression

Multiple linear regression

If you understand the basics of simple linear regression, you understand about 80% of multiple linear regression, too. The inner-workings are the same, it is still based on the least-squares regression algorithm, and it is still a model designed to predict a response. But instead of just one predictor variable, multiple linear regression uses multiple predictors.

The model equation is similar to the previous one, the main thing you notice is that it’s longer because of the additional predictors. Let’s say you are using 3 predictor variables, the predictive equation will produce 3 slope estimates (one for each) along with an Intercept term:

Multiple Linear Regession formula

Prism makes it easy to create a multiple linear regression model, especially calculating regression slope coefficients and generating graphics to diagnose how well the model fits.

What do I need to know about multicollinearity?

The assumptions for multiple linear regression are discussed here. With multiple predictors, in addition to the interpretation getting more challenging, another added complication is with multicollinearity. 

Multicollinearity occurs when two or more predictor variables “overlap” in what they measure. In other places you will see this referred to as the variables being dependent of one another. Ideally, the predictors are independent and no one predictor influences the values of another. 

There are various ways of measuring multicollinearity , but the main thing to know is that multicollinearity won’t affect how well your model predicts point values. However, it garbles inference about how each individual variable affects the response. 

For example, say that you want to estimate the height of a tree, and you have measured the circumference of the tree at two heights from the ground, one meter and two meter. The circumferences will be highly correlated. If you include both in the model, it’s very possible that you could end up with a negative slope parameter for one of those circumferences. Clearly, a tree doesn't get shorter when the circumference gets larger. Instead, that negative slope coefficient is acting as an adjustment to the other variable.

What is the difference between simple linear regression and multiple linear regression?

Once you’ve decided that your study is a good fit for a linear model, the choice between the two simply comes down to how many predictor variables you include. Just one? Simple linear. More than that? Multiple linear.

Based on that, you may be wondering, “Why would I ever do a simple linear regression when multiple linear regression can account for more variables?” Great question!

The answer is that sometimes less is more.  A common misconception is that the goal of a model is to be 100% accurate. Scientists know that no model is perfect, it is a simplified version of reality. So the goal isn’t perfection: Rather, the goal is to find as simple a model as possible to describe relationships so you understand the system, reach valid scientific conclusions, and design new experiments.

Still not convinced? Let’s say you were able to create a model that was 100% accurate for each point in your dataset. Most of the time if you’ve done this, you’ve done one of two things:

  • Come to an obvious conclusion that isn’t practically useful (100% of winning basketball teams score more points than their opponent) OR
  • You’ve modeled not only the trend in your data, but also the random “noise” that is too variable to count on. This is called “overfitting”: You tried so hard to account for every aspect of the past that the model ignores the differences that will arise in the future.

Other differences pop up on the technical side. To give some quick examples of that, using multiple linear regression means that:

  • In addition to the overall interpretation and significance of the model, each slope now has its own interpretation and question of significance.
  • R-squared is not as intuitive as it was for simple linear regression.
  • Graphing the equation is not a single line anymore. You could say that multiple linear regression just does not lend itself to graphing as easily.
All in all: simple regression is always more intuitive than multiple linear regression!

Interpreting multiple linear regression

We’ve said that multiple linear regression is harder to interpret than simple linear regression, and that is true. Taking the math and more technical aspects out of the question, overall interpretation is always harder the more factors are involved. But while there are more things to keep track of, the basic components of the thought process remain the same: parameters, confidence intervals and significance. We even use the model equation the same way.

Let’s use the same diabetes dataset to illustrate, but with a new wrinkle: In addition to glucose level, we will also include HDL and the person’s age as predictors of their glycosylated hemoglobin level (response). Here’s the output from Prism:

6 - MLR Results Page - Linear regression

Analysis of variance and F-tests

While most scientists’ eyes go straight to the section with parameter estimates, the first section of output is valuable and is the best place to start. Analysis of variance tests the model as a whole (and some individual pieces) to tell you how good your model is before you make sense of the rest.

It includes the Sum of Squares table, and the F-test on the far right of that section is of highest interest. The “Regression” as a whole (on the top line of the section) has a p-value of less than 0.0001 and is significant at the 0.05 level we chose to use. Each parameter slope has its own individual F-test too, but it is easier to understand as a t-test.

Parameter estimates and T-tests

Now for the fun part: The model itself has the same structure and information we used for simple linear regression, and we interpret it very similarly. The key is to remember that you are interpreting each parameter in its own right (not something you have to keep in mind with only one parameter!). Prism puts all of the statistics for each parameters in one table, including (for each parameter):

  • The parameter’s estimate itself
  • Its standard error and confidence interval
  • A P-value from a t-test

The estimates themselves are straightforward and are used to make the model equation, just like before. In this case the model’s predictive equation is (when rounding to the nearest thousandth):

Glycosylated Hemoglobin = 1.870 + 0.029*Glucose - 0.005*HDL +0.018*Age

If you remember back to our simple linear regression model, the slope for glucose has changed slightly. That is because we are now accounting for other factors too. This distinction can sometimes change the interpretation of an individual predictor’s effect dramatically.

When interpreting the individual slope estimates for predictor variables, the difference goes back to how Multiple Regression assumes each predictor is independent of the others. For simple regression you can say “a 1 point increase in X usually corresponds to a 5 point increase in Y”. For multiple regression it’s more like “a 1 point increase in X usually corresponds to a 5 point increase in Y, assuming every other factor is equal.” That may not seem like a big jump, but it acknowledges 1) that there are more factors at play and 2) the need for those predictors to not have influence on one another for the model to be helpful.

The standard errors and confidence intervals are also shown for each parameter, giving an idea of the variability for each slope/intercept on its own. Interpreting each one of these is done exactly the same way as we mentioned in the simple linear regression example, but remember that if multicollinearity exists, the standard errors and confidence intervals get inflated (often drastically).

On the end are p-values, which as you might guess, are interpreted just like we did for the first example. The underlying method behind the p-value here is a T-test. These only tell how significant each of the factors are, to evaluate the model as a whole we would need to use the F-test at the top. 

Evaluating each on its own though is still helpful: In this case it shows that while the other predictors are all significant, HDL shows no significance since we have already considered the other factors. That is not to say that it has no significance on its own, only that it adds no value to a model of just glucose and age. In fact, now that we know this, we could choose to re-run our model with only glucose and age and dial in better parameter estimates for that simpler model.

Another difference in interpretation occurs when you have categorical predictor variables such as sex in our example data. When you add categorical variables to a model, you pick a “reference level.” In this case (image below), we selected female as our reference level. The model below says that males have slightly lower predicted response than females (about 0.15 less).

7 - MLR with Sex - Linear regression

Assessing how well your model fits with multiple linear regression is more difficult than with simple linear regression, although the ideas remain the same, i.e., there are graphical and numerical diagnoses.

At the very least, it’s good to check a residual vs predicted plot to look for trends. In our diabetes model, this plot (included below) looks okay at first, but has some issues. Notice that values tend to miss high on the left and low on the right.

8 - Residual multiple linear - Linear regression

However, on further inspection, notice that there are only a few outlying points causing this unequal scatter. If you see outliers like above in your analysis that disrupt equal scatter, you have a few options .

As for numerical evaluations of goodness of fit, you have a lot more options with multiple linear regression. R-squared is still a go-to if you just want a measure to describe the proportion of variance in the response variable that is explained by your model. However, a common use of the goodness of fit statistics is to perform model selection , which means deciding on what variables to include in the model. If that’s what you’re using the goodness of fit for, then you’re better off using adjusted R-squared or an information criterion such as AICc.

Graphing multiple linear regression

Graphs are extremely useful to test how well a multiple linear regression model fits overall. With multiple predictors, it’s not feasible to plot the predictors against the response variable like it is in simple linear regression. A simple solution is to use the predicted response value on the x-axis and the residuals on the y-axis (as shown above). As a reminder, the residuals are the differences between the predicted and the observed response values. There are also several other plots using residuals that can be used to assess other model assumptions such as normally distributed error terms and serial correlation.

Model selection - choosing which predictor variables to include

How do you know which predictor variables to include in your model? It’s a great question and an active area of research.

For most researchers in the sciences, you’re dealing with a few predictor variables, and you have a pretty good hypothesis about the general structure of your model. If this is the case, then you might just try fitting a few different models, and picking the one that looks best based on how the residuals look and using a goodness of fit metric such as adjusted R-square or AICc .

Why doesn't my model fit well?

There are a lot of reasons that would cause your model to not fit well. One reason is having too much unexplained variance in the response. This could be because there were important predictor variables that you didn’t measure, or the relationship between the predictors and the response is more complicated than a simple linear regression model. In this last case, you can consider using interaction terms or transformations of the predictor variables.

If prediction accuracy is all that matters to you, meaning that you only want a good estimate of  the response and don’t need to understand how the predictors affect it, then there are a lot of clever, computational tools for building and selecting models. We won’t cover them in this guide, but if you want to know more about this topic, look into cross-validation and LASSO regression to get started.

Interactions

Interactions and transformations are useful tools to address situations where your model doesn't fit well by just using the unmodified predictor variables.

Interaction terms are found by multiplying two predictor variables together to create a new “interaction” variable. They greatly increase the complexity of describing how each variable affects the response. The primary use is to allow for more flexibility so that the effect of one predictor variable depends on the value of another predictor variable.

For a specific example using the diabetes data above, perhaps we have reason to believe that the effect of glucose on the response (hemoglobin %) changes depending on the age of the patient. Stats software makes this simple to do, but in effect, we multiply glucose by age, and include that new term in our model. Our new model when rounded is:

Glycosylated Hemoglobin = 0.42 + 0.044*Glucose - 0.004*HDL +0.044*Age - .0003*Glucose*Age

For reference, our model without the interaction term was:

Glycosylated Hemoglobin = 1.865 + 0.029*Glucose - 0.005*HDL +0.018*Age

Adding the interaction term changed the other estimates by a lot! Interpreting what this means is challenging. At the very least, we can say that the effect of glucose depends on age for this model since the coefficients are statistically significant. We might also want to say that high glucose appears to matter less for older patients due to the negative coefficient estimate of the interaction term (-0.0002). However, there is very high multicollinearity in this model (and in nearly every model with interaction terms), so interpreting the coefficients should be done with caution. Even with this example, if we remove a few outliers, this interaction term is no longer statistically significant, so it is unstable and could simply be a byproduct of noisy data.

9 - Interaction Results - Linear regression

Transformations

In addition to interactions, another strategy to use when your model doesn't fit your data well are transformations of variables. You can transform your response or any of your predictor variables.

Transformations on the response variable change the interpretation quite a bit. Instead of the model fitting your response variable, y , it fits the transformed y . A common example where this is appropriate is with predicting height for various ages of an animal species. Log transformations on the response, height in this case, are used because the variability in height at birth is very small, but the variability of height with adult animals is much higher. This violates the assumption of equal scatter. 

In the plots below, notice the funnel type shape on the left, where the scatter widens as age increases. On the right hand side, the funnel shape disappears and the variability of the residuals looks consistent.

The linear model using the log transformed y fits much better, however now the interpretation of the model changes. Using the example data above, the predicted model is:

ln(y) = -0.4 + 0.2 * x

This means that a single unit change in x results in a 0.2 increase in the log of y . That doesn't mean much to most people. Instead, you probably want your interpretation to be on the original y scale. To do that, we need to exponentiate both sides of the equation, which (avoiding the mathematical details) means that a 1 unit increase in x results in a 22% increase in y .

All of that is to say that transformations can assist with fitting your model, but they can complicate interpretation. 

When linear regression doesn't work

The ubiquitous nature of linear regression is a positive for collaboration, but sometimes it causes researchers to assume (before doing their due diligence) that a linear regression model is the right model for every situation. Sometimes software even seems to reinforce this attitude and the model that is subsequently chosen, rather than the person remaining in control of their research.

Sure, linear regression is great for its simplicity and familiarity, but there are many situations where there are better alternatives.

Other types of regression

Logistic regression.

Linear vs logistic regression: linear regression is appropriate when your response variable is continuous, but if your response has only two levels (e.g., presence/absence, yes/no, etc.), then look into simple logistic regression or multiple logistic regression .

Poisson regression

If instead, your response variable is a count (e.g., number of earthquakes in an area, number of males a female horseshoe crab has nesting nearby, etc.), then consider Poisson regression .

Nonlinear regression

For more complicated mathematical relationships between the predictors and response variables, such as dose-response curves in pharmacokinetics, check out nonlinear regression .

If you’ve designed and run an experiment with a continuous response variable and your research factors are categorical (e.g., Diet 1/Diet 2, Treatment 1/Treatment 2, etc.), then you need ANOVA models. These are differentiated by the number of treatments ( one-way ANOVA , two-way ANOVA , three-way ANOVA ) or other characteristics such as repeated measures ANOVA .

Principal component regression

Principal component regression is useful when you have as many or more predictor variables than observations in your study. It offers a technique for reducing the “dimension” of your predictors, so that you can still fit a linear regression model.

Cox proportional hazards regression

Cox proportional hazards regression is the go-to technique for survival analysis, when you have data measuring time until an event.

Deming regression

Deming regression is useful when there are two variables ( x and y ), and there is measurement error in both variables. One common situation that this occurs is comparing results from two different methods (e.g., comparing two different machines that measure blood oxygen level or that check for a particular pathogen).

Perform your own Linear Regression

Are you ready to calculate your own Linear Regression?  With a consistently clear, practical, and well-documented interface, learn how Prism can give you the controls you need to fit your data and simplify nonlinear regression .

Start your 30 day free trial of Prism   and get access to:

  • A step by step guide on how to perform Linear Regression
  • Sample data to save you time
  • More tips on how Prism can help your research

With Prism, in a matter of minutes you learn how to go from entering data to performing statistical analyses and generating high-quality graphs.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.2 - writing hypotheses.

The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)).

When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the direction of the test (non-directional, right-tailed or left-tailed), and (3) the value of the hypothesized parameter.

  • At this point we can write hypotheses for a single mean (\(\mu\)), paired means(\(\mu_d\)), a single proportion (\(p\)), the difference between two independent means (\(\mu_1-\mu_2\)), the difference between two proportions (\(p_1-p_2\)), a simple linear regression slope (\(\beta\)), and a correlation (\(\rho\)). 
  • The research question will give us the information necessary to determine if the test is two-tailed (e.g., "different from," "not equal to"), right-tailed (e.g., "greater than," "more than"), or left-tailed (e.g., "less than," "fewer than").
  • The research question will also give us the hypothesized parameter value. This is the number that goes in the hypothesis statements (i.e., \(\mu_0\) and \(p_0\)). For the difference between two groups, regression, and correlation, this value is typically 0.

Hypotheses are always written in terms of population parameters (e.g., \(p\) and \(\mu\)).  The tables below display all of the possible hypotheses for the parameters that we have learned thus far. Note that the null hypothesis always includes the equality (i.e., =).

If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

To log in and use all the features of Khan Academy, please enable JavaScript in your browser.

Statistics and probability

Course: statistics and probability   >   unit 5, fitting a line to data.

  • Estimating the line of best fit exercise
  • Eyeballing the line of best fit
  • Estimating with linear regression (linear models)
  • Estimating equations of lines of best fit, and using them to make predictions
  • Line of best fit: smoking in 1945
  • Estimating slope of line of best fit
  • Equations of trend lines: Phone data

Linear regression review

What is linear regression.

  • (Choice A)   A ‍   A A ‍  
  • (Choice B)   B ‍   B B ‍  
  • (Choice C)   C ‍   C C ‍  
  • (Choice D)   None of the lines fit the data. D None of the lines fit the data.

Using equations for lines of fit

Example: finding the equation.

  • (Choice A)   y = 5 x + 1.5 ‍   A y = 5 x + 1.5 ‍  
  • (Choice B)   y = 1.5 x + 5 ‍   B y = 1.5 x + 5 ‍  
  • (Choice C)   y = − 1.5 x + 5 ‍   C y = − 1.5 x + 5 ‍  
  • Your answer should be
  • an integer, like 6 ‍  
  • a simplified proper fraction, like 3 / 5 ‍  
  • a simplified improper fraction, like 7 / 4 ‍  
  • a mixed number, like 1   3 / 4 ‍  
  • an exact decimal, like 0.75 ‍  
  • a multiple of pi, like 12   pi ‍   or 2 / 3   pi ‍  

Want to join the conversation?

  • Upvote Button navigates to signup page
  • Downvote Button navigates to signup page
  • Flag Button navigates to signup page
  • Prompt Library
  • DS/AI Trends
  • Stats Tools
  • Interview Questions
  • Generative AI
  • Machine Learning
  • Deep Learning

Linear regression hypothesis testing: Concepts, Examples

Simple linear regression model

In relation to machine learning , linear regression is defined as a predictive modeling technique that allows us to build a model which can help predict continuous response variables as a function of a linear combination of explanatory or predictor variables. While training linear regression models, we need to rely on hypothesis testing in relation to determining the relationship between the response and predictor variables. In the case of the linear regression model, two types of hypothesis testing are done. They are T-tests and F-tests . In other words, there are two types of statistics that are used to assess whether linear regression models exist representing response and predictor variables. They are t-statistics and f-statistics. As data scientists , it is of utmost importance to determine if linear regression is the correct choice of model for our particular problem and this can be done by performing hypothesis testing related to linear regression response and predictor variables. Many times, it is found that these concepts are not very clear with a lot many data scientists. In this blog post, we will discuss linear regression and hypothesis testing related to t-statistics and f-statistics . We will also provide an example to help illustrate how these concepts work.

Table of Contents

What are linear regression models?

A linear regression model can be defined as the function approximation that represents a continuous response variable as a function of one or more predictor variables. While building a linear regression model, the goal is to identify a linear equation that best predicts or models the relationship between the response or dependent variable and one or more predictor or independent variables.

There are two different kinds of linear regression models. They are as follows:

  • Simple or Univariate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and one predictor or independent variable. The form of the equation that represents a simple linear regression model is Y=mX+b, where m is the coefficients of the predictor variable and b is bias. When considering the linear regression line, m represents the slope and b represents the intercept.
  • Multiple or Multi-variate linear regression models : These are linear regression models that are used to build a linear relationship between one response or dependent variable and more than one predictor or independent variable. The form of the equation that represents a multiple linear regression model is Y=b0+b1X1+ b2X2 + … + bnXn, where bi represents the coefficients of the ith predictor variable. In this type of linear regression model, each predictor variable has its own coefficient that is used to calculate the predicted value of the response variable.

While training linear regression models, the requirement is to determine the coefficients which can result in the best-fitted linear regression line. The learning algorithm used to find the most appropriate coefficients is known as least squares regression . In the least-squares regression method, the coefficients are calculated using the least-squares error function. The main objective of this method is to minimize or reduce the sum of squared residuals between actual and predicted response values. The sum of squared residuals is also called the residual sum of squares (RSS). The outcome of executing the least-squares regression method is coefficients that minimize the linear regression cost function .

The residual e of the ith observation is represented as the following where [latex]Y_i[/latex] is the ith observation and [latex]\hat{Y_i}[/latex] is the prediction for ith observation or the value of response variable for ith observation.

[latex]e_i = Y_i – \hat{Y_i}[/latex]

The residual sum of squares can be represented as the following:

[latex]RSS = e_1^2 + e_2^2 + e_3^2 + … + e_n^2[/latex]

The least-squares method represents the algorithm that minimizes the above term, RSS.

Once the coefficients are determined, can it be claimed that these coefficients are the most appropriate ones for linear regression? The answer is no. After all, the coefficients are only the estimates and thus, there will be standard errors associated with each of the coefficients.  Recall that the standard error is used to calculate the confidence interval in which the mean value of the population parameter would exist. In other words, it represents the error of estimating a population parameter based on the sample data. The value of the standard error is calculated as the standard deviation of the sample divided by the square root of the sample size. The formula below represents the standard error of a mean.

[latex]SE(\mu) = \frac{\sigma}{\sqrt(N)}[/latex]

Thus, without analyzing aspects such as the standard error associated with the coefficients, it cannot be claimed that the linear regression coefficients are the most suitable ones without performing hypothesis testing. This is where hypothesis testing is needed . Before we get into why we need hypothesis testing with the linear regression model, let’s briefly learn about what is hypothesis testing?

Train a Multiple Linear Regression Model using R

Before getting into understanding the hypothesis testing concepts in relation to the linear regression model, let’s train a multi-variate or multiple linear regression model and print the summary output of the model which will be referred to, in the next section. 

The data used for creating a multi-linear regression model is BostonHousing which can be loaded in RStudioby installing mlbench package. The code is shown below:

install.packages(“mlbench”) library(mlbench) data(“BostonHousing”)

Once the data is loaded, the code shown below can be used to create the linear regression model.

attach(BostonHousing) BostonHousing.lm <- lm(log(medv) ~ crim + chas + rad + lstat) summary(BostonHousing.lm)

Executing the above command will result in the creation of a linear regression model with the response variable as medv and predictor variables as crim, chas, rad, and lstat. The following represents the details related to the response and predictor variables:

  • log(medv) : Log of the median value of owner-occupied homes in USD 1000’s
  • crim : Per capita crime rate by town
  • chas : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • rad : Index of accessibility to radial highways
  • lstat : Percentage of the lower status of the population

The following will be the output of the summary command that prints the details relating to the model including hypothesis testing details for coefficients (t-statistics) and the model as a whole (f-statistics) 

linear regression model summary table r.png

Hypothesis tests & Linear Regression Models

Hypothesis tests are the statistical procedure that is used to test a claim or assumption about the underlying distribution of a population based on the sample data. Here are key steps of doing hypothesis tests with linear regression models:

  • Hypothesis formulation for T-tests: In the case of linear regression, the claim is made that there exists a relationship between response and predictor variables, and the claim is represented using the non-zero value of coefficients of predictor variables in the linear equation or regression model. This is formulated as an alternate hypothesis. Thus, the null hypothesis is set that there is no relationship between response and the predictor variables . Hence, the coefficients related to each of the predictor variables is equal to zero (0). So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis for each test states that a1 = 0, a2 = 0, a3 = 0 etc. For all the predictor variables, individual hypothesis testing is done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. Thus, if there are, say, 5 features, there will be five hypothesis tests and each will have an associated null and alternate hypothesis.
  • Hypothesis formulation for F-test : In addition, there is a hypothesis test done around the claim that there is a linear regression model representing the response variable and all the predictor variables. The null hypothesis is that the linear regression model does not exist . This essentially means that the value of all the coefficients is equal to zero. So, if the linear regression model is Y = a0 + a1x1 + a2x2 + a3x3, then the null hypothesis states that a1 = a2 = a3 = 0.
  • F-statistics for testing hypothesis for linear regression model : F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0. F-statistics is calculated as a function of sum of squares residuals for restricted regression (representing linear regression model with only intercept or bias and all the values of coefficients as zero) and sum of squares residuals for unrestricted regression (representing linear regression model). In the above diagram, note the value of f-statistics as 15.66 against the degrees of freedom as 5 and 194. 
  • Evaluate t-statistics against the critical value/region : After calculating the value of t-statistics for each coefficient, it is now time to make a decision about whether to accept or reject the null hypothesis. In order for this decision to be made, one needs to set a significance level, which is also known as the alpha level. The significance level of 0.05 is usually set for rejecting the null hypothesis or otherwise. If the value of t-statistics fall in the critical region, the null hypothesis is rejected. Or, if the p-value comes out to be less than 0.05, the null hypothesis is rejected.
  • Evaluate f-statistics against the critical value/region : The value of F-statistics and the p-value is evaluated for testing the null hypothesis that the linear regression model representing response and predictor variables does not exist. If the value of f-statistics is more than the critical value at the level of significance as 0.05, the null hypothesis is rejected. This means that the linear model exists with at least one valid coefficients. 
  • Draw conclusions : The final step of hypothesis testing is to draw a conclusion by interpreting the results in terms of the original claim or hypothesis. If the null hypothesis of one or more predictor variables is rejected, it represents the fact that the relationship between the response and the predictor variable is not statistically significant based on the evidence or the sample data we used for training the model. Similarly, if the f-statistics value lies in the critical region and the value of the p-value is less than the alpha value usually set as 0.05, one can say that there exists a linear regression model.

Why hypothesis tests for linear regression models?

The reasons why we need to do hypothesis tests in case of a linear regression model are following:

  • By creating the model, we are establishing a new truth (claims) about the relationship between response or dependent variable with one or more predictor or independent variables. In order to justify the truth, there are needed one or more tests. These tests can be termed as an act of testing the claim (or new truth) or in other words, hypothesis tests.
  • One kind of test is required to test the relationship between response and each of the predictor variables (hence, T-tests)
  • Another kind of test is required to test the linear regression model representation as a whole. This is called F-test.

While training linear regression models, hypothesis testing is done to determine whether the relationship between the response and each of the predictor variables is statistically significant or otherwise. The coefficients related to each of the predictor variables is determined. Then, individual hypothesis tests are done to determine whether the relationship between response and that particular predictor variable is statistically significant based on the sample data used for training the model. If at least one of the null hypotheses is rejected, it represents the fact that there exists no relationship between response and that particular predictor variable. T-statistics is used for performing the hypothesis testing because the standard deviation of the sampling distribution is unknown. The value of t-statistics is compared with the critical value from the t-distribution table in order to make a decision about whether to accept or reject the null hypothesis regarding the relationship between the response and predictor variables. If the value falls in the critical region, then the null hypothesis is rejected which means that there is no relationship between response and that predictor variable. In addition to T-tests, F-test is performed to test the null hypothesis that the linear regression model does not exist and that the value of all the coefficients is zero (0). Learn more about the linear regression and t-test in this blog – Linear regression t-test: formula, example .

Recent Posts

Ajitesh Kumar

  • Free IBM Data Sciences Courses on Coursera - April 6, 2024
  • Self-Supervised Learning vs Transfer Learning: Examples - April 3, 2024
  • OKRs vs KPIs vs KRAs: Differences and Examples - February 21, 2024

Ajitesh Kumar

One response.

Very informative

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Search for:
  • Excellence Awaits: IITs, NITs & IIITs Journey

ChatGPT Prompts (250+)

  • Generate Design Ideas for App
  • Expand Feature Set of App
  • Create a User Journey Map for App
  • Generate Visual Design Ideas for App
  • Generate a List of Competitors for App
  • Free IBM Data Sciences Courses on Coursera
  • Self-Supervised Learning vs Transfer Learning: Examples
  • OKRs vs KPIs vs KRAs: Differences and Examples
  • CEP vs Traditional Database Examples
  • Retrieval Augmented Generation (RAG) & LLM: Examples

Data Science / AI Trends

  • • Prepend any arxiv.org link with talk2 to load the paper into a responsive chat application
  • • Custom LLM and AI Agents (RAG) On Structured + Unstructured Data - AI Brain For Your Organization
  • • Guides, papers, lecture, notebooks and resources for prompt engineering
  • • Common tricks to make LLMs efficient and stable
  • • Machine learning in finance

Free Online Tools

  • Create Scatter Plots Online for your Excel Data
  • Histogram / Frequency Distribution Creation Tool
  • Online Pie Chart Maker Tool
  • Z-test vs T-test Decision Tool
  • Independent samples t-test calculator

Recent Comments

I found it very helpful. However the differences are not too understandable for me

Very Nice Explaination. Thankyiu very much,

in your case E respresent Member or Oraganization which include on e or more peers?

Such a informative post. Keep it up

Thank you....for your support. you given a good solution for me.

Linear regression - Hypothesis testing

by Marco Taboga , PhD

This lecture discusses how to perform tests of hypotheses about the coefficients of a linear regression model estimated by ordinary least squares (OLS).

Table of contents

Normal vs non-normal model

The linear regression model, matrix notation, tests of hypothesis in the normal linear regression model, test of a restriction on a single coefficient (t test), test of a set of linear restrictions (f test), tests based on maximum likelihood procedures (wald, lagrange multiplier, likelihood ratio), tests of hypothesis when the ols estimator is asymptotically normal, test of a restriction on a single coefficient (z test), test of a set of linear restrictions (chi-square test), learn more about regression analysis.

The lecture is divided in two parts:

in the first part, we discuss hypothesis testing in the normal linear regression model , in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors;

in the second part, we show how to carry out hypothesis tests in linear regression analyses where the hypothesis of normality holds only in large samples (i.e., the OLS estimator can be proved to be asymptotically normal).

How to choose which test to carry out after estimating a linear regression model.

We also denote:

We now explain how to derive tests about the coefficients of the normal linear regression model.

It can be proved (see the lecture about the normal linear regression model ) that the assumption of conditional normality implies that:

How the acceptance region is determined depends not only on the desired size of the test , but also on whether the test is:

one-tailed (only one of the two things, i.e., either smaller or larger, is possible).

For more details on how to determine the acceptance region, see the glossary entry on critical values .

[eq28]

The F test is one-tailed .

A critical value in the right tail of the F distribution is chosen so as to achieve the desired size of the test.

Then, the null hypothesis is rejected if the F statistics is larger than the critical value.

In this section we explain how to perform hypothesis tests about the coefficients of a linear regression model when the OLS estimator is asymptotically normal.

As we have shown in the lecture on the properties of the OLS estimator , in several cases (i.e., under different sets of assumptions) it can be proved that:

These two properties are used to derive the asymptotic distribution of the test statistics used in hypothesis testing.

The test can be either one-tailed or two-tailed . The same comments made for the t-test apply here.

[eq50]

Like the F test, also the Chi-square test is usually one-tailed .

The desired size of the test is achieved by appropriately choosing a critical value in the right tail of the Chi-square distribution.

The null is rejected if the Chi-square statistics is larger than the critical value.

Want to learn more about regression analysis? Here are some suggestions:

R squared of a linear regression ;

Gauss-Markov theorem ;

Generalized Least Squares ;

Multicollinearity ;

Dummy variables ;

Selection of linear regression models

Partitioned regression ;

Ridge regression .

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression - Hypothesis testing", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-hypothesis-testing.

Most of the learning materials found on this website are now available in a traditional textbook format.

  • F distribution
  • Beta distribution
  • Conditional probability
  • Central Limit Theorem
  • Binomial distribution
  • Mean square convergence
  • Delta method
  • Almost sure convergence
  • Mathematical tools
  • Fundamentals of probability
  • Probability distributions
  • Asymptotic theory
  • Fundamentals of statistics
  • About Statlect
  • Cookies, privacy and terms of use
  • Loss function
  • Almost sure
  • Type I error
  • Precision matrix
  • Integrable variable
  • To enhance your privacy,
  • we removed the social buttons,
  • but don't forget to share .

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Multiple linear regression formula

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved April 2, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

Linear Regression Analysis using SPSS Statistics

Introduction.

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the independent variable (or sometimes, the predictor variable). For example, you could use linear regression to understand whether exam performance can be predicted based on revision time; whether cigarette consumption can be predicted based on smoking duration; and so forth. If you have two or more independent variables, rather than just one, you need to use multiple regression .

This "quick start" guide shows you how to carry out linear regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for linear regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using linear regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using linear regression. You need to do this because it is only appropriate to use linear regression if your data "passes" seven assumptions that are required for linear regression to give you a valid result. In practice, checking for these seven assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these seven assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out linear regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let’s take a look at these seven assumptions:

  • Assumption #1: Your dependent variable should be measured at the continuous level (i.e., it is either an interval or ratio variable). Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable .
  • Assumption #2: Your independent variable should also be measured at the continuous level (i.e., it is either an interval or ratio variable). See the bullet above for examples of continuous variables.

Types of relationship

  • Assumption #5: You should have independence of observations , which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our enhanced linear regression guide.

Homoscedasticity in linear regression

  • Assumption #7: Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed (we explain these terms in our enhanced linear regression guide). Two common methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in our enhanced linear regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. Assumptions #3 should be checked first, before moving onto assumptions #4, #5, #6 and #7. We suggest testing the assumptions in this order because assumptions #3, #4, #5, #6 and #7 require you to run the linear regression procedure in SPSS Statistics first, so it is easier to deal with these after checking assumption #1 and #2. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running a linear regression might not be valid. This is why we dedicate a number of sections of our enhanced linear regression guide to help you get this right. You can find out more about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a linear regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A salesperson for a large car brand wants to determine whether there is a relationship between an individual's income and the price they pay for a car. As such, the individual's "income" is the independent variable and the "price" they pay for a car is the dependent variable. The salesperson wants to use this information to determine which cars to offer potential customers in new areas where average income is known.

Setup in SPSS Statistics

In SPSS Statistics, we created two variables so that we could enter our data: Income (the independent variable), and Price (the dependent variable). It can also be useful to create a third variable, caseno , to act as a chronological case number. This third variable is used to make it easy for you to eliminate cases (e.g., significant outliers) that you have identified when checking for assumptions. However, we do not include it in the SPSS Statistics procedure that follows because we assume that you have already checked these assumptions. In our enhanced linear regression guide, we show you how to correctly enter data in SPSS Statistics to run a linear regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The five steps below show you how to analyse your data using linear regression in SPSS Statistics when none of the seven assumptions in the previous section, Assumptions , have been violated. At the end of these four steps, we show you how to interpret the results from your linear regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6 and #7, which are required when using linear regression and can be tested using SPSS Statistics, you can learn more about our enhanced guides on our Features: Overview page.

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a linear regression in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

You will be presented with the Linear Regression dialogue box:

'Linear Regression' dialogue box in SPSS Statistics. Variables 'Income' & 'Price' on the left

Access all 96 SPSS Statistics guides in Laerd Statistics

Subscription Plans

Output of Linear Regression Analysis

SPSS Statistics will generate quite a few tables of output for a linear regression. In this section, we show you only the three main tables required to understand your results from the linear regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the six assumptions required to carry out linear regression is provided in our enhanced guide. This includes relevant scatterplots, histogram (with superimposed normal curve), Normal P-P Plot, casewise diagnostics and the Durbin-Watson statistic. Below, we focus on the results for the linear regression analysis only.

The first table of interest is the Model Summary table, as shown below:

'Model Summary' table for a linear regression in SPSS Statistics. Shows 'Sum of Squares', 'df', 'Mean Square', 'F' & 'Sig.'

This table provides the R and R 2 values. The R value represents the simple correlation and is 0.873 (the " R " Column), which indicates a high degree of correlation. The R 2 value (the " R Square " column) indicates how much of the total variation in the dependent variable, Price , can be explained by the independent variable, Income . In this case, 76.2% can be explained, which is very large.

The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts the dependent variable) and is shown below:

'ANOVA' table for a linear regression in SPSS. Shows 'Unstandarized Coefficients', 'Standardized Coefficients', 't' & 'Sig.'

This table indicates that the regression model predicts the dependent variable significantly well. How do we know this? Look at the " Regression " row and go to the " Sig. " column. This indicates the statistical significance of the regression model that was run. Here, p < 0.0005, which is less than 0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome variable (i.e., it is a good fit for the data).

The Coefficients table provides us with the necessary information to predict price from income, as well as determine whether income contributes statistically significantly to the model (by looking at the " Sig. " column). Furthermore, we can use the values in the " B " column under the " Unstandardized Coefficients " column, as shown below:

'Coefficients' table for linear regression. Shows 'Unstandarized Coefficients', 'Standardized Coefficients', 't' & 'Sig.'

to present the regression equation as:

Price = 8287 + 0.564(Income)

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced linear regression guide. We also show you how to write up the results from your assumptions tests and linear regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

We also have a "quick start" guide on how to perform a linear regression analysis in Stata .

  • Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning

Linear Algebra and Matrix

  • Scalar and Vector
  • Python Program to Add Two Matrices
  • Python program to multiply two matrices
  • Vector Operations
  • Product of Vectors
  • Scalar Product of Vectors
  • Dot and Cross Products on Vectors
  • Transpose a matrix in Single line in Python
  • Transpose of a Matrix
  • Adjoint and Inverse of a Matrix
  • How to inverse a matrix using NumPy
  • Determinant of a Matrix
  • Program to find Normal and Trace of a matrix
  • Data Science | Solving Linear Equations
  • Data Science - Solving Linear Equations with Python
  • System of Linear Equations
  • System of Linear Equations in three variables using Cramer's Rule
  • Eigenvalues
  • Applications of Eigenvalues and Eigenvectors
  • How to compute the eigenvalues and right eigenvectors of a given square array using NumPY?

Statistics for Machine Learning

  • Descriptive Statistic
  • Measures of Central Tendency
  • Measures of Dispersion | Types, Formula and Examples
  • Mean, Variance and Standard Deviation
  • Calculate the average, variance and standard deviation in Python using NumPy
  • Random Variables
  • Difference between Parametric and Non-Parametric Methods
  • Probability Distribution
  • Confidence Interval
  • Mathematics | Covariance and Correlation
  • Program to find correlation coefficient
  • Robust Correlation
  • Normal Probability Plot
  • Quantile Quantile plots
  • True Error vs Sample Error
  • Bias-Variance Trade Off - Machine Learning
  • Understanding Hypothesis Testing
  • Paired T-Test - A Detailed Overview
  • P-value in Machine Learning
  • F-Test in Statistics
  • Residual Leverage Plot (Regression Diagnostic)
  • Difference between Null and Alternate Hypothesis
  • Mann and Whitney U test
  • Wilcoxon Signed Rank Test
  • Kruskal Wallis Test
  • Friedman Test
  • Mathematics | Probability

Probability and Probability Distributions

  • Mathematics - Law of Total Probability
  • Bayes's Theorem for Conditional Probability
  • Mathematics | Probability Distributions Set 1 (Uniform Distribution)
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Mathematics | Probability Distributions Set 5 (Poisson Distribution)
  • Uniform Distribution Formula
  • Mathematics | Probability Distributions Set 2 (Exponential Distribution)
  • Mathematics | Probability Distributions Set 3 (Normal Distribution)
  • Mathematics | Beta Distribution Model
  • Gamma Distribution Model in Mathematics
  • Chi-Square Test for Feature Selection - Mathematical Explanation
  • Student's t-distribution in Statistics
  • Python - Central Limit Theorem
  • Mathematics | Limits, Continuity and Differentiability
  • Implicit Differentiation

Calculus for Machine Learning

  • Engineering Mathematics - Partial Derivatives
  • Advanced Differentiation
  • How to find Gradient of a Function using Python?
  • Optimization techniques for Gradient Descent
  • Higher Order Derivatives
  • Taylor Series
  • Application of Derivative - Maxima and Minima | Mathematics
  • Absolute Minima and Maxima
  • Optimization for Data Science
  • Unconstrained Multivariate Optimization
  • Lagrange Multipliers
  • Lagrange's Interpolation

Linear Regression in Machine learning

  • Ordinary Least Squares (OLS) using statsmodels

Regression in Machine Learning

Machine Learning is a branch of Artificial intelligence that focuses on the development of algorithms and statistical models that can learn from and make predictions on data. Linear regression is also a type of machine-learning algorithm more specifically a supervised machine-learning algorithm that learns from the labelled datasets and maps the data points to the most optimized linear functions. which can be used for prediction on new datasets. 

First of we should know what supervised machine learning algorithms is. It is a type of machine learning where the algorithm learns from labelled data.  Labeled data means the dataset whose respective target value is already known. Supervised learning has two types:

  • Classification : It predicts the class of the dataset based on the independent input variable. Class is the categorical or discrete values. like the image of an animal is a cat or dog?
  • Regression : It predicts the continuous output variables based on the independent input variable. like the prediction of house prices based on different parameters like house age, distance from the main road, location, area, etc.

Here, we will discuss one of the simplest types of regression i.e. Linear Regression.

Table of Content

What is Linear Regression?

Types of linear regression, what is the best fit line, cost function for linear regression, assumptions of simple linear regression, assumptions of multiple linear regression, evaluation metrics for linear regression, python implementation of linear regression, regularization techniques for linear models, applications of linear regression, advantages & disadvantages of linear regression, linear regression – frequently asked questions (faqs).

Linear regression is a type of supervised machine learning algorithm that computes the linear relationship between the dependent variable and one or more independent features by fitting a linear equation to observed data.

When there is only one independent feature, it is known as Simple Linear Regression , and when there are more than one feature, it is known as Multiple Linear Regression .

Similarly, when there is only one dependent variable, it is considered Univariate Linear Regression , while when there are more than one dependent variables, it is known as Multivariate Regression .

Why Linear Regression is Important?

The interpretability of linear regression is a notable strength. The model’s equation provides clear coefficients that elucidate the impact of each independent variable on the dependent variable, facilitating a deeper understanding of the underlying dynamics. Its simplicity is a virtue, as linear regression is transparent, easy to implement, and serves as a foundational concept for more complex algorithms.

Linear regression is not merely a predictive tool; it forms the basis for various advanced models. Techniques like regularization and support vector machines draw inspiration from linear regression, expanding its utility. Additionally, linear regression is a cornerstone in assumption testing, enabling researchers to validate key assumptions about the data.

There are two main types of linear regression:

Simple Linear Regression

This is the simplest form of linear regression, and it involves only one independent variable and one dependent variable. The equation for simple linear regression is: [Tex]y=\beta_{0}+\beta_{1}X [/Tex] where:

  • Y is the dependent variable
  • X is the independent variable
  • β0 is the intercept
  • β1 is the slope

Multiple Linear Regression

This involves more than one independent variable and one dependent variable. The equation for multiple linear regression is: [Tex]y=\beta_{0}+\beta_{1}X+\beta_{2}X+………\beta_{n}X [/Tex] where:

  • X1, X2, …, Xp are the independent variables
  • β1, β2, …, βn are the slopes

The goal of the algorithm is to find the best Fit Line equation that can predict the values based on the independent variables.

In regression set of records are present with X and Y values and these values are used to learn a function so if you want to predict Y from an unknown X this learned function can be used. In regression we have to find the value of Y, So, a function is required that predicts continuous Y in the case of regression given X as independent features.

Our primary objective while using linear regression is to locate the best-fit line, which implies that the error between the predicted and actual values should be kept to a minimum. There will be the least error in the best-fit line.

The best Fit Line equation provides a straight line that represents the relationship between the dependent and independent variables. The slope of the line indicates how much the dependent variable changes for a unit change in the independent variable(s).

Linear Regression in Machine learning

Linear Regression

Here Y is called a dependent or target variable and X is called an independent variable also known as the predictor of Y. There are many types of functions or modules that can be used for regression. A linear function is the simplest type of function. Here, X may be a single feature or multiple features representing the problem.

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x)). Hence, the name is Linear Regression. In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The regression line is the best-fit line for our model. 

We utilize the cost function to compute the best values in order to get the best fit line since different values for weights or the coefficient of lines result in different regression lines.

Hypothesis function in Linear Regression

As we have assumed earlier that our independent feature is the experience i.e X and the respective salary Y is the dependent variable. Let’s assume there is a linear relationship between X and Y then the salary can be predicted using:

[Tex]\hat{Y} = \theta_1 + \theta_2X [/Tex]

[Tex]\hat{y}_i = \theta_1 + \theta_2x_i [/Tex]

  • [Tex]y_i \epsilon Y \;\; (i= 1,2, \cdots , n)      [/Tex]  are labels to data (Supervised learning)
  • [Tex]x_i \epsilon X \;\; (i= 1,2, \cdots , n)      [/Tex]  are the input independent training data (univariate – one input variable(parameter)) 
  • [Tex]\hat{y_i} \epsilon \hat{Y} \;\; (i= 1,2, \cdots , n)      [/Tex]  are the predicted values.

The model gets the best regression fit line by finding the best θ 1 and θ 2 values. 

  • θ 1 : intercept 
  • θ 2 : coefficient of x 

Once we find the best θ 1 and θ 2 values, we get the best-fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x. 

How to update θ 1 and θ 2 values to get the best-fit line? 

To achieve the best-fit regression line, the model aims to predict the target value  [Tex]\hat{Y}      [/Tex]  such that the error difference between the predicted value  [Tex]\hat{Y}      [/Tex]  and the true value Y is minimum. So, it is very important to update the θ 1 and θ 2 values, to reach the best value that minimizes the error between the predicted y value (pred) and the true y value (y). 

[Tex]minimize\frac{1}{n}\sum_{i=1}^{n}(\hat{y_i}-y_i)^2 [/Tex]

The cost function or the loss function is nothing but the error or difference between the predicted value  [Tex]\hat{Y}      [/Tex]  and the true value Y.

In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which calculates the average of the squared errors between the predicted values [Tex]\hat{y}_i [/Tex] and the actual values [Tex]{y}_i [/Tex] . The purpose is to determine the optimal values for the intercept [Tex]\theta_1 [/Tex] and the coefficient of the input feature [Tex]\theta_2 [/Tex] providing the best-fit line for the given data points. The linear equation expressing this relationship is [Tex]\hat{y}_i = \theta_1 + \theta_2x_i [/Tex] .

MSE function can be calculated as:

[Tex]\text{Cost function}(J) = \frac{1}{n}\sum_{n}^{i}(\hat{y_i}-y_i)^2 [/Tex]

Utilizing the MSE function, the iterative process of gradient descent is applied to update the values of \ [Tex]\theta_1 \& \theta_2 [/Tex] . This ensures that the MSE value converges to the global minima, signifying the most accurate fit of the linear regression line to the dataset.

This process involves continuously adjusting the parameters \(\theta_1\) and \(\theta_2\) based on the gradients calculated from the MSE. The final result is a linear regression line that minimizes the overall squared differences between the predicted and actual values, providing an optimal representation of the underlying relationship in the data.

Gradient Descent for Linear Regression

A linear regression model can be trained using the optimization algorithm gradient descent by iteratively modifying the model’s parameters to reduce the mean squared error (MSE) of the model on a training dataset. To update θ 1 and θ 2 values in order to reduce the Cost function (minimizing RMSE value) and achieve the best-fit line the model uses Gradient Descent. The idea is to start with random θ 1 and θ 2 values and then iteratively update the values, reaching minimum cost. 

A gradient is nothing but a derivative that defines the effects on outputs of the function with a little bit of variation in inputs.

Let’s differentiate the cost function(J) with respect to  [Tex]\theta_1      [/Tex]   

[Tex]\begin {aligned} {J}’_{\theta_1} &=\frac{\partial J(\theta_1,\theta_2)}{\partial \theta_1} \\ &= \frac{\partial}{\partial \theta_1} \left[\frac{1}{n} \left(\sum_{i=1}^{n}(\hat{y}_i-y_i)^2 \right )\right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(\frac{\partial}{\partial \theta_1}(\hat{y}_i-y_i) \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(\frac{\partial}{\partial \theta_1}( \theta_1 + \theta_2x_i-y_i) \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(1+0-0 \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}(\hat{y}_i-y_i) \left(2 \right ) \right] \\ &= \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i) \end {aligned} [/Tex]

Let’s differentiate the cost function(J) with respect to  [Tex]\theta_2 [/Tex]

[Tex]\begin {aligned} {J}’_{\theta_2} &=\frac{\partial J(\theta_1,\theta_2)}{\partial \theta_2} \\ &= \frac{\partial}{\partial \theta_2} \left[\frac{1}{n} \left(\sum_{i=1}^{n}(\hat{y}_i-y_i)^2 \right )\right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(\frac{\partial}{\partial \theta_2}(\hat{y}_i-y_i) \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(\frac{\partial}{\partial \theta_2}( \theta_1 + \theta_2x_i-y_i) \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}2(\hat{y}_i-y_i) \left(0+x_i-0 \right ) \right] \\ &= \frac{1}{n}\left[\sum_{i=1}^{n}(\hat{y}_i-y_i) \left(2x_i \right ) \right] \\ &= \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i)\cdot x_i \end {aligned} [/Tex]

Finding the coefficients of a linear equation that best fits the training data is the objective of linear regression. By moving in the direction of the Mean Squared Error negative gradient with respect to the coefficients, the coefficients can be changed. And the respective intercept and coefficient of X will be if  [Tex]\alpha      [/Tex]  is the learning rate.

Gradient Descent -Geeksforgeeks

Gradient Descent

[Tex]\begin{aligned} \theta_1 &= \theta_1 – \alpha \left( {J}’_{\theta_1}\right) \\&=\theta_1 -\alpha \left( \frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i)\right) \end{aligned} \\ \begin{aligned} \theta_2 &= \theta_2 – \alpha \left({J}’_{\theta_2}\right) \\&=\theta_2 – \alpha \left(\frac{2}{n}\sum_{i=1}^{n}(\hat{y}_i-y_i)\cdot x_i\right) \end{aligned} [/Tex]

Linear regression is a powerful tool for understanding and predicting the behavior of a variable, however, it needs to meet a few conditions in order to be accurate and dependable solutions. 

linear regression hypothesis

  • Independence : The observations in the dataset are independent of each other. This means that the value of the dependent variable for one observation does not depend on the value of the dependent variable for another observation. If the observations are not independent, then linear regression will not be an accurate model.

linear regression hypothesis

Homoscedasticity in Linear Regression

  • Normality : The residuals should be normally distributed. This means that the residuals should follow a bell-shaped curve. If the residuals are not normally distributed, then linear regression will not be an accurate model.

For Multiple Linear Regression, all four of the assumptions from Simple Linear Regression apply. In addition to this, below are few more:

  • No multicollinearity : There is no high correlation between the independent variables. This indicates that there is little or no correlation between the independent variables. Multicollinearity occurs when two or more independent variables are highly correlated with each other, which can make it difficult to determine the individual effect of each variable on the dependent variable. If there is multicollinearity, then multiple linear regression will not be an accurate model.
  • Additivity: The model assumes that the effect of changes in a predictor variable on the response variable is consistent regardless of the values of the other variables. This assumption implies that there is no interaction between variables in their effects on the dependent variable.
  • Feature Selection: In multiple linear regression, it is essential to carefully select the independent variables that will be included in the model. Including irrelevant or redundant variables may lead to overfitting and complicate the interpretation of the model.
  • Overfitting: Overfitting occurs when the model fits the training data too closely, capturing noise or random fluctuations that do not represent the true underlying relationship between variables. This can lead to poor generalization performance on new, unseen data.

Multicollinearity

Multicollinearity is a statistical phenomenon that occurs when two or more independent variables in a multiple regression model are highly correlated, making it difficult to assess the individual effects of each variable on the dependent variable.

Detecting Multicollinearity includes two techniques:

  • Correlation Matrix: Examining the correlation matrix among the independent variables is a common way to detect multicollinearity. High correlations (close to 1 or -1) indicate potential multicollinearity.
  • VIF (Variance Inflation Factor): VIF is a measure that quantifies how much the variance of an estimated regression coefficient increases if your predictors are correlated. A high VIF (typically above 10) suggests multicollinearity.

A variety of evaluation measures can be used to determine the strength of any linear regression model. These assessment metrics often give an indication of how well the model is producing the observed outputs.

The most common measurements are:

Mean Square Error (MSE)

Mean Squared Error (MSE) is an evaluation metric that calculates the average of the squared differences between the actual and predicted values for all the data points. The difference is squared to ensure that negative and positive differences don’t cancel each other out.

[Tex]MSE = \frac{1}{n}\sum_{i=1}^{n}\left ( y_i – \widehat{y_{i}} \right )^2 [/Tex]

  • n is the number of data points.
  • y i is the actual or observed value for the i th data point.
  • [Tex]\widehat{y_{i}} [/Tex] is the predicted value for the i th data point.

MSE is a way to quantify the accuracy of a model’s predictions. MSE is sensitive to outliers as large errors contribute significantly to the overall score.

Mean Absolute Error (MAE)

Mean Absolute Error is an evaluation metric used to calculate the accuracy of a regression model. MAE measures the average absolute difference between the predicted values and actual values.

Mathematically, MAE is expressed as:

[Tex]MAE =\frac{1}{n} \sum_{i=1}^{n}|Y_i – \widehat{Y_i}| [/Tex]

  • n is the number of observations
  • Y i represents the actual values.
  • [Tex]\widehat{Y_i} [/Tex] represents the predicted values

Lower MAE value indicates better model performance. It is not sensitive to the outliers as we consider absolute differences.

Root Mean Squared Error (RMSE)

The square root of the residuals’ variance is the Root Mean Squared Error . It describes how well the observed data points match the expected values, or the model’s absolute fit to the data.

In mathematical notation, it can be expressed as: [Tex]RMSE=\sqrt{\frac{RSS}{n}}=\sqrt\frac{{{\sum_{i=2}^{n}(y^{actual}_{i}}- y_{i}^{predicted})^2}}{n} [/Tex] Rather than dividing the entire number of data points in the model by the number of degrees of freedom, one must divide the sum of the squared residuals to obtain an unbiased estimate. Then, this figure is referred to as the Residual Standard Error (RSE).

In mathematical notation, it can be expressed as: [Tex]RMSE=\sqrt{\frac{RSS}{n}}=\sqrt\frac{{{\sum_{i=2}^{n}(y^{actual}_{i}}- y_{i}^{predicted})^2}}{(n-2)} [/Tex]

RSME is not as good of a metric as R-squared. Root Mean Squared Error can fluctuate when the units of the variables vary since its value is dependent on the variables’ units (it is not a normalized measure).

Coefficient of Determination (R-squared)

R-Squared is a statistic that indicates how much variation the developed model can explain or capture. It is always in the range of 0 to 1. In general, the better the model matches the data, the greater the R-squared number. In mathematical notation, it can be expressed as: [Tex]R^{2}=1-(^{\frac{RSS}{TSS}}) [/Tex]

  • Residual sum of Squares (RSS): The sum of squares of the residual for each data point in the plot or data is known as the residual sum of squares, or RSS. It is a measurement of the difference between the output that was observed and what was anticipated. [Tex]RSS=\sum_{i=2}^{n}(y_{i}-b_{0}-b_{1}x_{i})^{2} [/Tex]
  • Total Sum of Squares (TSS): The sum of the data points’ errors from the answer variable’s mean is known as the total sum of squares, or TSS. [Tex]TSS= \sum_{}^{}(y-\overline{y_{i}})^2 [/Tex]

R squared metric is a measure of the proportion of variance in the dependent variable that is explained the independent variables in the model.

Adjusted R-Squared Error

Adjusted R 2 measures the proportion of variance in the dependent variable that is explained by independent variables in a regression model. Adjusted R-square accounts the number of predictors in the model and penalizes the model for including irrelevant predictors that don’t contribute significantly to explain the variance in the dependent variables.

Mathematically, adjusted R 2 is expressed as:

[Tex]Adjusted \, R^2 = 1 – (\frac{(1-R^2).(n-1)}{n-k-1}) [/Tex]

  • k is the number of predictors in the model
  • R 2 is coeeficient of determination

Adjusted R-square helps to prevent overfitting. It penalizes the model with additional predictors that do not contribute significantly to explain the variance in the dependent variable.

Import the necessary libraries:

import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.axes as ax from matplotlib.animation import FuncAnimation

Load the dataset and separate input and Target variables

Here is the link for dataset: Dataset Link

url = 'https://media.geeksforgeeks.org/wp-content/uploads/20240320114716/data_for_lr.csv' data = pd . read_csv ( url ) data # Drop the missing values data = data . dropna () # training dataset and labels train_input = np . array ( data . x [ 0 : 500 ]) . reshape ( 500 , 1 ) train_output = np . array ( data . y [ 0 : 500 ]) . reshape ( 500 , 1 ) # valid dataset and labels test_input = np . array ( data . x [ 500 : 700 ]) . reshape ( 199 , 1 ) test_output = np . array ( data . y [ 500 : 700 ]) . reshape ( 199 , 1 )

Build the Linear Regression Model and Plot the regression line

  • In forward propagation, Linear regression function Y=mx+c is applied by initially assigning random value of parameter (m & c).
  • The we have written the function to finding the cost function i.e the mean 

class LinearRegression : def __init__ ( self ): self . parameters = {} def forward_propagation ( self , train_input ): m = self . parameters [ 'm' ] c = self . parameters [ 'c' ] predictions = np . multiply ( m , train_input ) + c return predictions def cost_function ( self , predictions , train_output ): cost = np . mean (( train_output - predictions ) ** 2 ) return cost def backward_propagation ( self , train_input , train_output , predictions ): derivatives = {} df = ( predictions - train_output ) # dm= 2/n * mean of (predictions-actual) * input dm = 2 * np . mean ( np . multiply ( train_input , df )) # dc = 2/n * mean of (predictions-actual) dc = 2 * np . mean ( df ) derivatives [ 'dm' ] = dm derivatives [ 'dc' ] = dc return derivatives def update_parameters ( self , derivatives , learning_rate ): self . parameters [ 'm' ] = self . parameters [ 'm' ] - learning_rate * derivatives [ 'dm' ] self . parameters [ 'c' ] = self . parameters [ 'c' ] - learning_rate * derivatives [ 'dc' ] def train ( self , train_input , train_output , learning_rate , iters ): # Initialize random parameters self . parameters [ 'm' ] = np . random . uniform ( 0 , 1 ) * - 1 self . parameters [ 'c' ] = np . random . uniform ( 0 , 1 ) * - 1 # Initialize loss self . loss = [] # Initialize figure and axis for animation fig , ax = plt . subplots () x_vals = np . linspace ( min ( train_input ), max ( train_input ), 100 ) line , = ax . plot ( x_vals , self . parameters [ 'm' ] * x_vals + self . parameters [ 'c' ], color = 'red' , label = 'Regression Line' ) ax . scatter ( train_input , train_output , marker = 'o' , color = 'green' , label = 'Training Data' ) # Set y-axis limits to exclude negative values ax . set_ylim ( 0 , max ( train_output ) + 1 ) def update ( frame ): # Forward propagation predictions = self . forward_propagation ( train_input ) # Cost function cost = self . cost_function ( predictions , train_output ) # Back propagation derivatives = self . backward_propagation ( train_input , train_output , predictions ) # Update parameters self . update_parameters ( derivatives , learning_rate ) # Update the regression line line . set_ydata ( self . parameters [ 'm' ] * x_vals + self . parameters [ 'c' ]) # Append loss and print self . loss . append ( cost ) print ( "Iteration = {} , Loss = {} " . format ( frame + 1 , cost )) return line , # Create animation ani = FuncAnimation ( fig , update , frames = iters , interval = 200 , blit = True ) # Save the animation as a video file (e.g., MP4) ani . save ( 'linear_regression_A.gif' , writer = 'ffmpeg' ) plt . xlabel ( 'Input' ) plt . ylabel ( 'Output' ) plt . title ( 'Linear Regression' ) plt . legend () plt . show () return self . parameters , self . loss

Trained the model and Final Prediction

#Example usage linear_reg = LinearRegression () parameters , loss = linear_reg . train ( train_input , train_output , 0.0001 , 20 )

Iteration = 1, Loss = 9130.407560462196 Iteration = 1, Loss = 1107.1996742908998 Iteration = 1, Loss = 140.31580932842422 Iteration = 1, Loss = 23.795780526084116 Iteration = 2, Loss = 9.753848205147605 Iteration = 3, Loss = 8.061641745006835 Iteration = 4, Loss = 7.8577116490914864 Iteration = 5, Loss = 7.8331350515579015 Iteration = 6, Loss = 7.830172502503967 Iteration = 7, Loss = 7.829814681591015 Iteration = 8, Loss = 7.829770758846183 Iteration = 9, Loss = 7.829764664327399 Iteration = 10, Loss = 7.829763128602258 Iteration = 11, Loss = 7.829762142342088 Iteration = 12, Loss = 7.829761222379141 Iteration = 13, Loss = 7.829760310486438 Iteration = 14, Loss = 7.829759399646989 Iteration = 15, Loss = 7.829758489015161 Iteration = 16, Loss = 7.829757578489033 Iteration = 17, Loss = 7.829756668056319 Iteration = 18, Loss = 7.829755757715535 Iteration = 19, Loss = 7.829754847466484 Iteration = 20, Loss = 7.829753937309139

Linear Regression Line

The linear regression line provides valuable insights into the relationship between the two variables. It represents the best-fitting line that captures the overall trend of how a dependent variable (Y) changes in response to variations in an independent variable (X).

  • Positive Linear Regression Line : A positive linear regression line indicates a direct relationship between the independent variable (X) and the dependent variable (Y). This means that as the value of X increases, the value of Y also increases. The slope of a positive linear regression line is positive, meaning that the line slants upward from left to right.
  • Negative Linear Regression Line : A negative linear regression line indicates an inverse relationship between the independent variable (X) and the dependent variable (Y). This means that as the value of X increases, the value of Y decreases. The slope of a negative linear regression line is negative, meaning that the line slants downward from left to right.

Lasso Regression (L1 Regularization)

Lasso Regression is a technique used for regularizing a linear regression model, it adds a penalty term to the linear regression objective function to prevent overfitting .

The objective function after applying lasso regression is:

[Tex]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}(\widehat{y_i} – y_i) + \lambda \sum_{j=1}^{n}|\theta_j| [/Tex]

  • the first term is the least squares loss, representing the squared difference between predicted and actual values.
  • the second term is the L1 regularization term, it penalizes the sum of absolute values of the regression coefficient θ j .

Ridge Regression (L2 Regularization)

Ridge regression is a linear regression technique that adds a regularization term to the standard linear objective. Again, the goal is to prevent overfitting by penalizing large coefficient in linear regression equation. It useful when the dataset has multicollinearity where predictor variables are highly correlated.

The objective function after applying ridge regression is:

[Tex]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}(\widehat{y_i} – y_i) + \lambda \sum_{j=1}^{n}\theta_{j}^{2} [/Tex]

  • the second term is the L1 regularization term, it penalizes the sum of square of values of the regression coefficient θ j .

Elastic Net Regression

Elastic Net Regression is a hybrid regularization technique that combines the power of both L1 and L2 regularization in linear regression objective.

[Tex]J(\theta) = \frac{1}{2m} \sum_{i=1}^{m}(\widehat{y_i} – y_i) + \alpha \lambda \sum_{j=1}^{n}{|\theta_j|} + \frac{1}{2}(1- \alpha) \lambda \sum_{j=1}{n} \theta_{j}^{2} [/Tex]

  • the first term is least square loss.
  • the second term is L1 regularization and third is ridge regression.
  • ???? is the overall regularization strength.
  • α controls the mix between L1 and L2 regularization.

Linear regression is used in many different fields, including finance, economics, and psychology, to understand and predict the behavior of a particular variable. For example, in finance, linear regression might be used to understand the relationship between a company’s stock price and its earnings or to predict the future value of a currency based on its past performance.

Advantages of Linear Regression

  • Linear regression is a relatively simple algorithm, making it easy to understand and implement. The coefficients of the linear regression model can be interpreted as the change in the dependent variable for a one-unit change in the independent variable, providing insights into the relationships between variables.
  • Linear regression is computationally efficient and can handle large datasets effectively. It can be trained quickly on large datasets, making it suitable for real-time applications.
  • Linear regression is relatively robust to outliers compared to other machine learning algorithms. Outliers may have a smaller impact on the overall model performance.
  • Linear regression often serves as a good baseline model for comparison with more complex machine learning algorithms.
  • Linear regression is a well-established algorithm with a rich history and is widely available in various machine learning libraries and software packages.

Disadvantages of Linear Regression

  • Linear regression assumes a linear relationship between the dependent and independent variables. If the relationship is not linear, the model may not perform well.
  • Linear regression is sensitive to multicollinearity, which occurs when there is a high correlation between independent variables. Multicollinearity can inflate the variance of the coefficients and lead to unstable model predictions.
  • Linear regression assumes that the features are already in a suitable form for the model. Feature engineering may be required to transform features into a format that can be effectively used by the model.
  • Linear regression is susceptible to both overfitting and underfitting. Overfitting occurs when the model learns the training data too well and fails to generalize to unseen data. Underfitting occurs when the model is too simple to capture the underlying relationships in the data.
  • Linear regression provides limited explanatory power for complex relationships between variables. More advanced machine learning techniques may be necessary for deeper insights.

Linear regression is a fundamental machine learning algorithm that has been widely used for many years due to its simplicity, interpretability, and efficiency. It is a valuable tool for understanding relationships between variables and making predictions in a variety of applications.

However, it is important to be aware of its limitations, such as its assumption of linearity and sensitivity to multicollinearity. When these limitations are carefully considered, linear regression can be a powerful tool for data analysis and prediction.

What does linear regression mean in simple?

Linear regression is a supervised machine learning algorithm that predicts a continuous target variable based on one or more independent variables. It assumes a linear relationship between the dependent and independent variables and uses a linear equation to model this relationship.

Why do we use linear regression?

Linear regression is commonly used for: Predicting numerical values based on input features Forecasting future trends based on historical data Identifying correlations between variables Understanding the impact of different factors on a particular outcome

How to use linear regression?

Use linear regression by fitting a line to predict the relationship between variables, understanding coefficients, and making predictions based on input values for informed decision-making.

Why is it called linear regression?

Linear regression is named for its use of a linear equation to model the relationship between variables, representing a straight line fit to the data points.

What is linear regression examples?

Predicting house prices based on square footage, estimating exam scores from study hours, and forecasting sales using advertising spending are examples of linear regression applications.

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • Machine Learning
  • CBSE Exam Format Changed for Class 11-12: Focus On Concept Application Questions
  • 10 Best Waze Alternatives in 2024 (Free)
  • 10 Best Squarespace Alternatives in 2024 (Free)
  • Top 10 Owler Alternatives & Competitors in 2024
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

11: Linear Regression and Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 26115
  • 11.1: Testing the Hypothesis that β = 0 The correlation coefficient tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, and perform a hypothesis test of the "significance of the correlation coefficient" to decide whether the linear relationship in the sample data is strong enough to use to linear model.

IMAGES

  1. PPT

    linear regression hypothesis

  2. Linear Regression with Multiple Variables

    linear regression hypothesis

  3. Mod-01 Lec-39 Hypothesis Testing in Linear Regression

    linear regression hypothesis

  4. How to Write and Test Statistical Hypotheses in Simple Linear

    linear regression hypothesis

  5. Linear Regression

    linear regression hypothesis

  6. Multiple Linear Regression Hypothesis Testing in Matrix Form

    linear regression hypothesis

VIDEO

  1. Multiple Regression and Hypothesis Testing

  2. Hypothesis Testing in Simple Linear Regression

  3. Lecture 5. Hypothesis Testing In Simple Linear Regression Model

  4. 11.6. Simple Linear Regression: Hypothesis Testing

  5. Regression general linear hypothesis tests, null test

  6. Simple linear regression hypothesis testing

COMMENTS

  1. Understanding the Null Hypothesis for Linear Regression

    x: The value of the predictor variable. Simple linear regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y.

  2. 12.2.1: Hypothesis Test for Linear Regression

    The hypotheses are: Find the critical value using dfE = n − p − 1 = 13 for a two-tailed test α = 0.05 inverse t-distribution to get the critical values ± 2.160. Draw the sampling distribution and label the critical values, as shown in Figure 12-14. Figure 12-14: Graph of t-distribution with labeled critical values.

  3. PDF Chapter 9 Simple Linear Regression

    218 CHAPTER 9. SIMPLE LINEAR REGRESSION 9.2 Statistical hypotheses For simple linear regression, the chief null hypothesis is H 0: β 1 = 0, and the corresponding alternative hypothesis is H 1: β 1 6= 0. If this null hypothesis is true, then, from E(Y) = β 0 + β 1x we can see that the population mean of Y is β 0 for

  4. Simple Linear Regression

    Regression allows you to estimate how a dependent variable changes as the independent variable (s) change. Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to ...

  5. 3.3.4: Hypothesis Test for Simple Linear Regression

    In simple linear regression, this is equivalent to saying "Are X an Y correlated?". In reviewing the model, Y = β0 +β1X + ε Y = β 0 + β 1 X + ε, as long as the slope ( β1 β 1) has any non‐zero value, X X will add value in helping predict the expected value of Y Y. However, if there is no correlation between X and Y, the value of ...

  6. The Ultimate Guide to Linear Regression

    A complete step-by-step guide to Linear Regression with examples. Explore the Scientific R&D Platform. ... the null hypothesis). And based on how we set up the regression analysis to use 0.05 as the threshold for significance, it tells us that the model points to a significant relationship. There is evidence that this relationship is real.

  7. 5.2

    5.2 - Writing Hypotheses. The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis ( H 0) and an alternative hypothesis ( H a ). When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the ...

  8. The Complete Guide to Linear Regression Analysis

    With a simple calculation, we can find the value of β0 and β1 for minimum RSS value. With the stats model library in python, we can find out the coefficients, Table 1: Simple regression of sales on TV. Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

  9. Linear Regression

    A regression equation is linear when all its terms are one of the following: Constant. Parameter multiplying an independent variable. Additionally, a linear regression equation can only add terms together, producing one general form: Dependent variable = constant + parameter * IV + … + parameter * IV. Statisticians refer to this form as being ...

  10. Linear regression review (article)

    Write a linear equation to describe the given model. Step 1: Find the slope. This line goes through (,) ‍ and (,) ‍ , so the slope is − − = − ‍ . Step 2: Find the y ‍ -intercept. We can see that the line passes through (,) ‍ , so the y ‍ -intercept is ‍ . Step 3: Write the equation in y = m x + b ‍ form. The equation is y ...

  11. 15.5: Hypothesis Tests for Regression Models

    15.5: Hypothesis Tests for Regression Models. So far we've talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size). The next thing we need to talk about is hypothesis tests.

  12. PDF Lecture 9: Linear Regression

    Regression. Technique used for the modeling and analysis of numerical data. Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other. Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships.

  13. Linear regression hypothesis testing: Concepts, Examples

    F-statistics for testing hypothesis for linear regression model: F-test is used to test the null hypothesis that a linear regression model does not exist, representing the relationship between the response variable y and the predictor variables x1, x2, x3, x4 and x5. The null hypothesis can also be represented as x1 = x2 = x3 = x4 = x5 = 0.

  14. Linear regression

    The lecture is divided in two parts: in the first part, we discuss hypothesis testing in the normal linear regression model, in which the OLS estimator of the coefficients has a normal distribution conditional on the matrix of regressors; in the second part, we show how to carry out hypothesis tests in linear regression analyses where the ...

  15. Linear regression

    e. In statistics, linear regression is a statistical model which estimates the linear relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables ). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear ...

  16. Linear Regression in R

    Simple regression dataset Multiple regression dataset. Table of contents. Getting started in R. Step 1: Load the data into R. Step 2: Make sure your data meet the assumptions. Step 3: Perform the linear regression analysis. Step 4: Check for homoscedasticity. Step 5: Visualize the results with a graph.

  17. Multiple Linear Regression

    The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value ...

  18. 14.4: Hypothesis Test for Simple Linear Regression

    This page titled 14.4: Hypothesis Test for Simple Linear Regression is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by Maurice A. Geraghty via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.

  19. PDF Lecture 5 Hypothesis Testing in Multiple Linear Regression

    As in simple linear regression, under the null hypothesis t 0 = βˆ j seˆ(βˆ j) ∼ t n−p−1. We reject H 0 if |t 0| > t n−p−1,1−α/2. This is a partial test because βˆ j depends on all of the other predictors x i, i 6= j that are in the model. Thus, this is a test of the contribution of x j given the other predictors in the model.

  20. Linear Regression Analysis using SPSS Statistics

    Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable). The variable we are using to predict the other variable's value is called the ...

  21. How to Simplify Hypothesis Testing for Linear Regression in Python

    A Quick Reminder Regarding Linear Regression. Before I share the 4 assumptions that should be met in order to run a linear regression hypothesis test, there is one important point to keep in mind regarding linear regression. Linear regression can be thought of as a dual purpose tool: To predict future values for the y variable

  22. Linear Regression in Machine learning

    Hypothesis function in Linear Regression. As we have assumed earlier that our independent feature is the experience i.e X and the respective salary Y is the dependent variable. Let's assume there is a linear relationship between X and Y then the salary can be predicted using: [Tex]\hat{Y} = \theta_1 + \theta_2X [/Tex] ...

  23. 11: Linear Regression and Hypothesis Testing

    The correlation coefficient tells us about the strength and direction of the linear relationship between x and y. However, the reliability of the linear model also depends on how many observed data points are in the sample. We need to look at both the value of the correlation coefficient r and the sample size n, and perform a hypothesis test of ...