Research-Methodology

Regression Analysis

Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of relationships between a dependent variable and one or more independent variables.

The basic form of regression models includes unknown parameters (β), independent variables (X), and the dependent variable (Y).

Regression model, basically, specifies the relation of dependent variable (Y) to a function combination of independent variables (X) and unknown parameters (β)

                                    Y  ≈  f (X, β)   

Regression equation can be used to predict the values of ‘y’, if the value of ‘x’ is given, and both ‘y’ and ‘x’ are the two sets of measures of a sample size of ‘n’. The formulae for regression equation would be

Regression analysis

Do not be intimidated by visual complexity of correlation and regression formulae above. You don’t have to apply the formula manually, and correlation and regression analyses can be run with the application of popular analytical software such as Microsoft Excel, Microsoft Access, SPSS and others.

Linear regression analysis is based on the following set of assumptions:

1. Assumption of linearity . There is a linear relationship between dependent and independent variables.

2. Assumption of homoscedasticity . Data values for dependent and independent variables have equal variances.

3. Assumption of absence of collinearity or multicollinearity . There is no correlation between two or more independent variables.

4. Assumption of normal distribution . The data for the independent variables and dependent variable are normally distributed

My e-book,  The Ultimate Guide to Writing a Dissertation in Business Studies: a step by step assistance  offers practical assistance to complete a dissertation with minimum or no stress. The e-book covers all stages of writing a dissertation starting from the selection to the research area to submitting the completed version of the work within the deadline. John Dudovskiy

Regression analysis

What is Regression Analysis?

  • Regression Analysis – Linear Model Assumptions
  • Regression Analysis – Simple Linear Regression
  • Regression Analysis – Multiple Linear Regression

Regression Analysis in Finance

Regression tools, additional resources, regression analysis.

The estimation of relationships between a dependent variable and one or more independent variables

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and one or more independent variables . It can be utilized to assess the strength of the relationship between variables and for modeling the future relationship between them.

Regression Analysis - Types of Regression Analysis

Regression analysis includes several variations, such as linear, multiple linear, and nonlinear. The most common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more complicated data sets in which the dependent and independent variables show a nonlinear relationship.

Regression analysis offers numerous applications in various disciplines, including finance .

Regression Analysis – Linear Model Assumptions

Linear regression analysis is based on six fundamental assumptions:

  • The dependent and independent variables show a linear relationship between the slope and the intercept.
  • The independent variable is not random.
  • The value of the residual (error) is zero.
  • The value of the residual (error) is constant across all observations.
  • The value of the residual (error) is not correlated across all observations.
  • The residual (error) values follow the normal distribution.

Regression Analysis – Simple Linear Regression

Simple linear regression is a model that assesses the relationship between a dependent variable and an independent variable. The simple linear model is expressed using the following equation:

Y = a + bX + ϵ

  • Y – Dependent variable
  • X – Independent (explanatory) variable
  • a – Intercept
  • b – Slope
  • ϵ – Residual (error)

Check out the following video to learn more about simple linear regression:

Regression Analysis – Multiple Linear Regression

Multiple linear regression analysis is essentially similar to the simple linear model, with the exception that multiple independent variables are used in the model. The mathematical representation of multiple linear regression is:

Y = a + b X 1  + c X 2  + d X 3 + ϵ

  • X 1 , X 2 , X 3  – Independent (explanatory) variables
  • b, c, d – Slopes

Multiple linear regression follows the same conditions as the simple linear model. However, since there are several independent variables in multiple linear analysis, there is another mandatory condition for the model:

  • Non-collinearity: Independent variables should show a minimum correlation with each other. If the independent variables are highly correlated with each other, it will be difficult to assess the true relationships between the dependent and independent variables.

Regression analysis comes with several applications in finance. For example, the statistical method is fundamental to the Capital Asset Pricing Model (CAPM) . Essentially, the CAPM equation is a model that determines the relationship between the expected return of an asset and the market risk premium.

The analysis is also used to forecast the returns of securities, based on different factors, or to forecast the performance of a business. Learn more forecasting methods in CFI’s Budgeting and Forecasting Course !

1. Beta and CAPM

In finance, regression analysis is used to calculate the Beta (volatility of returns relative to the overall market) for a stock. It can be done in Excel using the Slope function .

Screenshot of Beta Calculator Template in Excel

Download CFI’s free beta calculator !

2. Forecasting Revenues and Expenses

When forecasting financial statements for a company, it may be useful to do a multiple regression analysis to determine how changes in certain assumptions or drivers of the business will impact revenue or expenses in the future. For example, there may be a very high correlation between the number of salespeople employed by a company, the number of stores they operate, and the revenue the business generates.

Simple Linear Regression - Forecasting Revenues and Expenses

The above example shows how to use the Forecast function in Excel to calculate a company’s revenue, based on the number of ads it runs.

Learn more forecasting methods in CFI’s Budgeting and Forecasting Course !

Excel remains a popular tool to conduct basic regression analysis in finance, however, there are many more advanced statistical tools that can be used.

Python and R are both powerful coding languages that have become popular for all types of financial modeling, including regression. These techniques form a core part of data science and machine learning, where models are trained to detect these relationships in data.

Learn more about regression analysis, Python, and Machine Learning in CFI’s Business Intelligence & Data Analysis certification.

To learn more about related topics, check out the following free CFI resources:

  • Cost Behavior Analysis
  • Forecasting Methods
  • Joseph Effect
  • Variance Inflation Factor (VIF)
  • High Low Method vs. Regression Analysis
  • See all data science resources
  • Share this article

Excel Fundamentals - Formulas for Finance

Create a free account to unlock this Template

Access and download collection of free Templates to help power your productivity and performance.

Already have an account? Log in

Supercharge your skills with Premium Templates

Take your learning and productivity to the next level with our Premium Templates.

Upgrading to a paid membership gives you access to our extensive collection of plug-and-play Templates designed to power your performance—as well as CFI's full course catalog and accredited Certification Programs.

Already have a Self-Study or Full-Immersion membership? Log in

Access Exclusive Templates

Gain unlimited access to more than 250 productivity Templates, CFI's full course catalog and accredited Certification Programs, hundreds of resources, expert reviews and support, the chance to work with real-world finance and research tools, and more.

Already have a Full-Immersion membership? Log in

  • Privacy Policy

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

Advantages of Regression AnalysisDisadvantages of Regression Analysis
Provides a quantitative measure of the relationship between variablesAssumes a linear relationship between variables, which may not always hold true
Helps in predicting and forecasting outcomes based on historical dataRequires a large sample size to produce reliable results
Identifies and measures the significance of independent variables on the dependent variableAssumes no multicollinearity, meaning that independent variables should not be highly correlated with each other
Provides estimates of the coefficients that represent the strength and direction of the relationship between variablesAssumes the absence of outliers or influential data points
Allows for hypothesis testing to determine the statistical significance of the relationshipCan be sensitive to the inclusion or exclusion of certain variables, leading to different results
Can handle both continuous and categorical variablesAssumes the independence of observations, which may not hold true in some cases
Offers a visual representation of the relationship through the use of scatter plots and regression linesMay not capture complex non-linear relationships between variables without appropriate transformations
Provides insights into the marginal effects of independent variables on the dependent variableRequires the assumption of homoscedasticity, meaning that the variance of errors is constant across all levels of the independent variables

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Textual Analysis

Textual Analysis – Types, Examples and Guide

Content Analysis

Content Analysis – Methods, Types and Examples

Framework Analysis

Framework Analysis – Method, Types and Examples

Descriptive Statistics

Descriptive Statistics – Types, Methods and...

Multidimensional Scaling

Multidimensional Scaling – Types, Formulas and...

Discourse Analysis

Discourse Analysis – Methods, Types and Examples

If you could change one thing about college, what would it be?

Graduate faster

Better quality online classes

Flexible schedule

Access to top-rated instructors

Mountain side representing simple regression analysis

The Complete Guide To Simple Regression Analysis

08.08.2023 • 8 min read

Sarah Thomas

Subject Matter Expert

Learn what simple regression analysis means and why it’s useful for analyzing data, and how to interpret the results.

In This Article

What Is Simple Linear Regression Analysis?

Linear regression equation, how to perform linear regression, linear regression assumptions, how do you find the regression line, how to interpret the results of simple regression.

What is the relationship between parental income and educational attainment or hours spent on social media and anxiety levels? Regression is a versatile statistical tool that can help you answer these types of questions. It’s a tool that lets you model the relationship between two or more variables .

The applications of regression are endless. You can use it as a machine learning algorithm to make predictions. You can use it to establish correlations, and in some cases, you can use it to uncover causal links in your data.

In this article, we’ll tell you everything you need to know about the most basic form of regression analysis: the simple linear regression model.

Simple linear regression is a statistical tool you can use to evaluate correlations between a single independent variable (X) and a single dependent variable (Y). The model fits a straight line to data collected for each variable, and using this line, you can estimate the correlation between X and Y and predict values of Y using values of X.

As a quick example, imagine you want to explore the relationship between weight (X) and height (Y). You collect data from ten randomly selected individuals, and you plot your data on a scatterplot like the one below.

scatterplot

In the scatterplot, each point represents data collected for one of the individuals in your sample. The blue line is your regression line. It models the relationship between weight and height using observed data. Not surprisingly, we see ‌the regression line is upward-sloping, indicating a positive correlation between weight and height. Taller people tend to be heavier than shorter people.

Once you have this line, you can measure how strong the correlation is between height and weight. You can estimate the height of somebody ‌not in your sample by plugging their weight into the regression equation.

The equation for a simple linear regression is:

X is your independent variable

Y is an estimate of your dependent variable

β 0 \beta_0 β 0 ​ is the constant or intercept of the regression line, which is the value of Y when X is equal to zero

β 1 \beta_1 β 1 ​ is the regression coefficient, which is the slope of the regression line and your estimate for the change in Y given a 1-unit change in X

ε \varepsilon ε is the error term of the regression

You may notic‌e the formula for a regression looks very similar to the equation of a line (y=mX+b). That’s because linear regression is a line! It’s a line fitted to data that you can use to estimate the values of one variable using the value of a correlated variable.

You can build a simple linear regression model in 5 steps.

1. Collect data

Collect data for two variables (X and Y). Y is your dependent variable, which is the variable you want to estimate using the regression. X is your independent variable—the variable you use as an input in your regression.

2. Plot the data on a scatter plot

Plot the values of X and Y on a scatter plot with values of X plotted along the horizontal x-axis and values of Y plotted on the vertical y-axis.

3. Calculate a correlation coefficient

Calculate a correlation coefficient to determine the strength of the linear relationship between your two variables.

4. Fit a regression to the data

Find the regression line using the ordinary least-squares method. (You can do this by hand; but it’s much easier to use statistical software like Desmos, Excel, R, or Stata.)

5. Assess the regression line

Once you have the regression line, assess how well your model performs by checking to see how well the model predicts values of Y.

The key assumptions we make when using a simple linear regression model are:

The relationship between X and Y (if it exists) is linear.

Independence

The residuals of your model are independent.

Homoscedasticity

The variance of the residual is constant across values of the independent variable.

The residuals are normally distributed .

You should not use a simple linear regression unless it’s reasonable to make these assumptions.

Simple linear regression involves fitting a straight line to your dataset. We call this line the line of best fit or the regression line. The most common method for finding this line is OLS (or the Ordinary Least Squares Method).

In OLS, we find the regression line by minimizing the sum of squared residuals —also called squared errors. Anytime you draw a straight line through your data, there will be a vertical distance between each ‌point on your scatter plot and the regression line. These vertical distances are called residuals (or errors).

They represent the difference between the actual values of your dependent variable Y i Y_i Y i ​ , and the predicted value of that variable, Y ^ i \widehat{Y}_i Y i ​ . The regression you find with OLS is the line that minimizes the sum of squared residuals.

Graph showing calculating of regression line

You can calculate the OLS regression line by hand, but it’s much easier to do so using statistical software like Excel, Desmos, R, or Stata. In this video, Professor AnnMaria De Mars explains how to find the OLS regression equation using Desmos.

Depending on the software you use, the results of your regression analysis may look ‌different. In general, however, your software will display output tables summarizing the main characteristics of your regression.

The values you should be looking for in these output tables fall under three categories:

Coefficients

Regression statistics

This is the β 0 \beta_0 β 0 ​ value in your regression equation. It is the y-intercept of your regression line, and it is the estimate of Y when X is equal to zero.

Next to your intercept, you’ll see columns in the table showing additional information about the intercept. These include a standard error, p-value, T-stat, and confidence interval. You can use these values to test whether the estimate of your intercept is statistically significant .

Regression coefficient

This is the β 1 \beta_1 β 1 ​ of your regression equation. It’s the slope of the regression line, and it tells you how much Y should change in response to a 1-unit change in X.

Similar to the intercept, the regression coefficient will have columns to the right of it. They'll show a standard error, p-value , T-stat, and confidence interval. Use these values to test whether your parameter estimate of β 1 \beta_1 β 1 ​ is statistically significant.

Regression Statistics

Correlation coefficient (or multiple r).

This is the Pearson Correlation coefficient. It measures the strength of the correlation between X and Y.

R-squared (or the coefficient of determination)

We calculate this value by squaring the correlation coefficient. The independent variable can explain how much of the variance in your dependent variable. You can convert R 2 R^2 R 2 into a percentage by multiplying it by 100.

Standard error of the residuals

The standard error of the residuals is the average value of the errors in your model. It is the average vertical distance between each point on your scatter plot and the regression line. We measure this value in the same units as your dependent variable.

Degrees of freedom

In simple linear regression, the degrees of freedom equal the number of data points you used minus the two estimated parameters. The parameters are the intercept and regression coefficient.

Some software will also output a 5-number summary of your residuals. It'll show the minimum, first quartile , median , third quartile, and maximum values of your residuals.

P-value (or Significance F) - This is the p-value of your regression model.

It returns a hypothesis test's results where the null hypothesis is that no relationship exists between X and Y. The alternative hypothesis is that a linear relationship exists between X and Y.

If you are using a significance level (or alpha level) of 0.05, you would reject the null hypothesis if the p-value is less than or equal to 0.05. You would fail to reject the null hypothesis if your p-value is greater than 0.05.

What are correlations?

A correlation is a measure of the relationship between two variables.

Positive Correlations - If two variables, X and Y, have a positive linear correlation, Y tends to increase as X increases, and Y tends to decrease as X decreases. In other words, the two variables tend to move together in the same direction.

Negative Correlations - Two variables, X and Y, have a negative correlation if Y tends to increase as X decreases and Y tends to decrease as X increases. (i.e., The values of the two variables tend to move in opposite directions).

What’s the difference between the dependent and independent variables in a regression?

A simple linear regression involves two variables: X, the input or independent variable, and Y, the output or dependent variable. The independent variable is the variable you want to estimate using the regression. Its estimated value “depends” on the parameters and other variables of the model.

The independent variable—also called the predictor variable—is an input in the model. Its value does not depend on the other elements of the model.

Is the correlation coefficient the same as the regression coefficient?

The correlation coefficient and the regression coefficient will both have the same sign (positive or negative), but they are not the same. The only case where these two values will be equal is when the values of X and Y have been standardized to the same scale.

What is a correlation coefficient?

A correlation coefficient—or Pearson’s correlation coefficient —measures the strength of the linear relationship between X and Y. It’s a number ranging between -1 and 1. The closer a coefficient correlation is to 0, the weaker the correlation is between X and Y.

The closer the correlation coefficient is to 1 or -1, the stronger the correlation. Points on a scatter plot will be more dispersed around the regression line when the correlation between X and Y is weak, and the points will be more tightly clustered around the regression line when the correlation is strong.

What is the regression coefficient?

The regression coefficient, β 1 \beta_1 β 1 ​ , is the slope of the regression line. It provides you with an estimate of how much the dependent variable, Y, will change in response to a 1-unit increase in the dependent variable, X.

The regression coefficient can be any number from − ∞ -\infty − ∞ to ∞ \infty ∞ . A positive regression coefficient implies a positive correlation between X and Y, and a negative regression coefficient implies a negative correlation.

Can I use linear regression in Excel?

Yes. The easiest way to add a simple linear regression line in Excel is to install and use Excel’s “Analysis Toolpak” add-in. To do this, go to Tools > Excel Add-ins and select the “Analysis Toolpak.”

Next, follow these steps.

In your spreadsheet, enter your data for X and Y in two columns

Navigate to the “Data” tab and click on the “Data Analysis” icon

From the list of analysis tools, select “Regression” and click “OK”

Select the data for Y and X respectively where it says “Input Y Range” and “Input X Range”

If you’ve labeled your columns with the names of your X and Y variables, click on the “Labels” checkbox.

You can further customize where you want your regression in your workbook and what additional information you would like Excel to display.

Once you’ve finished customizing, click “OK”

Your regression results will display next to your data or in a new sheet.

Is linear regression used to establish causal relationships?

Correlations are not equivalent to causation. If two variables are correlated, you cannot immediately conclude ‌one causes the other to change. A linear regression will immediately indicate whether two variables correlate. But you’ll need to include more variables in your model and use regression with causal theories to draw conclusions about causal relationships.

What are some other types of regression analysis?

Simple linear regression is the most basic form of regression analysis. It involves ‌one independent variable and one dependent variable. Once you get a handle on this model, you can move on to more sophisticated forms of regression analysis. These include multiple linear regression and nonlinear regression.

Multiple linear regression is a model that estimates the linear relationship between variables using one dependent variable and multiple predictor variables. Nonlinear regression is a method used to estimate nonlinear relationships between variables.

Explore Outlier's Award-Winning For-Credit Courses

Outlier (from the co-founder of MasterClass) has brought together some of the world's best instructors, game designers, and filmmakers to create the future of online college.

Check out these related courses:

Intro to Statistics

Intro to Statistics

How data describes our world.

Intro to Microeconomics

Intro to Microeconomics

Why small choices have big impact.

Intro to Macroeconomics

Intro to Macroeconomics

How money moves our world.

Intro to Psychology

Intro to Psychology

The science of the mind.

Related Articles

Mountains during sunset representing logarithmic regression

Calculating Logarithmic Regression Step-By-Step

Learn about logarithmic regression and the steps to calculate it. We’ll also break down what a logarithmic function is, why it’s useful, and a few examples.

Overhead view of rows of small potted plants. This visual helps represent the interquartile range

What Is the Interquartile Range (IQR)?

Learn what the interquartile range is, why it’s used in Statistics and how to calculate it. Also read about how it can be helpful for finding outliers.

Outlier Blog Calculate Outlier Formula HighRes

Calculate Outlier Formula: A Step-By-Step Guide

This article is an overview of the outlier formula and how to calculate it step by step. It’s also packed with examples and FAQs to help you understand it.

Further Reading

What is statistical significance & why learn it, mean absolute deviation (mad) - meaning & formula, discrete & continuous variables with examples, population vs. sample: the big difference, why is statistics important, how to make a box plot.

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

Lesson 1: simple linear regression, overview section  .

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. This lesson introduces the concept and basic procedures of simple linear regression.

  • Distinguish between a deterministic relationship and a statistical relationship.
  • Understand the concept of the least squares criterion.
  • Interpret the intercept \(b_{0}\) and slope \(b_{1}\) of an estimated regression equation.
  • Know how to obtain the estimates \(b_{0}\) and \(b_{1}\) from Minitab's fitted line plot and regression analysis output.
  • Recognize the distinction between a population regression line and the estimated regression line.
  • Summarize the four conditions that comprise the simple linear regression model.
  • Know what the unknown population variance \(\sigma^{2}\) quantifies in the regression setting.
  • Know how to obtain the estimated MSE of the unknown population variance \(\sigma^{2 }\) from Minitab's fitted line plot and regression analysis output.
  • Know that the coefficient of determination (\(R^2\)) and the correlation coefficient (r) are measures of linear association. That is, they can be 0 even if there is a perfect nonlinear association.
  • Know how to interpret the \(R^2\) value.
  • Understand the cautions necessary in using the \(R^2\) value as a way of assessing the strength of the linear association.
  • Know how to calculate the correlation coefficient r from the \(R^2\) value.
  • Know what various correlation coefficient values mean. There is no meaningful interpretation for the correlation coefficient as there is for the \(R^2\) value.

Lesson 1 Code Files Section  

STAT501_Lesson01.zip

  • bldgstories.txt
  • carstopping.txt
  • drugdea.txt
  • fev_dat.txt
  • heightgpa.txt
  • husbandwife.txt
  • oldfaithful.txt
  • poverty.txt
  • practical.txt
  • signdist.txt
  • skincancer.txt
  • student_height_weight.txt

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Multiple Linear Regression | A Quick Guide (Examples)

Published on February 20, 2020 by Rebecca Bevans . Revised on June 22, 2023.

Regression models are used to describe relationships between variables by fitting a line to the observed data. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

Multiple linear regression is used to estimate the relationship between  two or more independent variables and one dependent variable . You can use multiple linear regression when you want to know:

  • How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
  • The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

Table of contents

Assumptions of multiple linear regression, how to perform a multiple linear regression, interpreting the results, presenting the results, other interesting articles, frequently asked questions about multiple linear regression.

Multiple linear regression makes all of the same assumptions as simple linear regression :

Homogeneity of variance (homoscedasticity) : the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations : the observations in the dataset were collected using statistically valid sampling methods , and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality : The data follows a normal distribution .

Linearity : the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

regression analysis in research formula

Multiple linear regression formula

The formula for a multiple linear regression is:

y = {\beta_0} + {\beta_1{X_1}} + … + {{\beta_n{X_n}} + {\epsilon}

  • … = do the same for however many independent variables you are testing

B_nX_n

To find the best-fit line for each independent variable, multiple linear regression calculates three things:

  • The regression coefficients that lead to the smallest overall model error.
  • The t statistic of the overall model.
  • The associated p value (how likely it is that the t statistic would have occurred by chance if the null hypothesis of no relationship between the independent and dependent variables was true).

It then calculates the t statistic and p value for each regression coefficient in the model.

Multiple linear regression in R

While it is possible to do multiple linear regression by hand, it is much more commonly done via statistical software. We are going to use R for our examples because it is free, powerful, and widely available. Download the sample dataset to try it yourself.

Dataset for multiple linear regression (.csv)

Load the heart.data dataset into your R environment and run the following code:

This code takes the data set heart.data and calculates the effect that the independent variables biking and smoking have on the dependent variable heart disease using the equation for the linear model: lm() .

Learn more by following the full step-by-step guide to linear regression in R .

To view the results of the model, you can use the summary() function:

This function takes the most important parameters from the linear model and puts them into a table that looks like this:

R multiple linear regression summary output

The summary first prints out the formula (‘Call’), then the model residuals (‘Residuals’). If the residuals are roughly centered around zero and with similar spread on either side, as these do ( median 0.03, and min and max around -2 and 2) then the model probably fits the assumption of heteroscedasticity.

Next are the regression coefficients of the model (‘Coefficients’). Row 1 of the coefficients table is labeled (Intercept) – this is the y-intercept of the regression equation. It’s helpful to know the estimated intercept in order to plug it into the regression equation and predict values of the dependent variable:

The most important things to note in this output table are the next two tables – the estimates for the independent variables.

The Estimate column is the estimated effect , also called the regression coefficient or r 2 value. The estimates in the table tell us that for every one percent increase in biking to work there is an associated 0.2 percent decrease in heart disease, and that for every one percent increase in smoking there is an associated .17 percent increase in heart disease.

The Std.error column displays the standard error of the estimate. This number shows how much variation there is around the estimates of the regression coefficient.

The t value column displays the test statistic . Unless otherwise specified, the test statistic used in linear regression is the t value from a two-sided t test . The larger the test statistic, the less likely it is that the results occurred by chance.

The Pr( > | t | ) column shows the p value . This shows how likely the calculated t value would have occurred by chance if the null hypothesis of no effect of the parameter were true.

Because these values are so low ( p < 0.001 in both cases), we can reject the null hypothesis and conclude that both biking to work and smoking both likely influence rates of heart disease.

When reporting your results, include the estimated effect (i.e. the regression coefficient), the standard error of the estimate, and the p value. You should also interpret your numbers to make it clear to your readers what the regression coefficient means.

Visualizing the results in a graph

It can also be helpful to include a graph with your results. Multiple linear regression is somewhat more complicated than simple linear regression, because there are more parameters than will fit on a two-dimensional plot.

However, there are ways to display your results that include the effects of multiple independent variables on the dependent variable, even though only one independent variable can actually be plotted on the x-axis.

Multiple regression in R graph

Here, we have calculated the predicted values of the dependent variable (heart disease) across the full range of observed values for the percentage of people biking to work.

To include the effect of smoking on the independent variable, we calculated these predicted values while holding smoking constant at the minimum, mean , and maximum observed rates of smoking.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Chi square test of independence
  • Statistical power
  • Descriptive statistics
  • Degrees of freedom
  • Pearson correlation
  • Null hypothesis

Methodology

  • Double-blind study
  • Case-control study
  • Research ethics
  • Data collection
  • Hypothesis testing
  • Structured interviews

Research bias

  • Hawthorne effect
  • Unconscious bias
  • Recall bias
  • Halo effect
  • Self-serving bias
  • Information bias

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Multiple Linear Regression | A Quick Guide (Examples). Scribbr. Retrieved September 25, 2024, from https://www.scribbr.com/statistics/multiple-linear-regression/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, simple linear regression | an easy introduction & examples, an introduction to t tests | definitions, formula and examples, types of variables in research & statistics | examples, what is your plagiarism score.

  • Chester Fritz Library
  • Library of the Health Sciences
  • Thormodsgard Law Library
  • University of North Dakota
  • Research Guides
  • SMHS Library Resources

Statistics - explanations and formulas

Regression analysis.

  • Absolute Risk Reduction
  • Bell-shaped Curve
  • Confidence Interval
  • Control Event Rate
  • Correlation
  • Descriptive Statistics
  • Discrete Stats
  • Experimental Event Rate
  • Forest Plots
  • Hazard Ratio
  • Heterogeneity / Statistical Heterogeneity
  • Inferential Statistics
  • Intention to Treat
  • Internal Validity / External Validity
  • Kaplan-Meier Curves
  • Kruskal-Wallis Test
  • Likelihood Ratios
  • Logistics Regression
  • Mann-Whitney U Test
  • Mean Difference
  • Misclassification Bias
  • Multiple Regression Coefficients
  • Nominal Data
  • Noninferiority Studies
  • Noninferiority Trials
  • Nonparametric Analysis
  • Normal Distribution
  • Number Needed to Treat - including how to calculate
  • Power Analysis
  • Predictive Power
  • Probability
  • Propensity Score
  • Random Sample
  • Relative Risk
  • Sampling Error
  • Spearman Rank Correlation
  • Specificity and Sensitivity
  • Statistical Significance versus Clinical Significance
  • Survivor Analysis
  • Wilcoxon Rank Sum Test
  • Excel formulas
  • Picking the appropriate method

Regression analysis is used to quantify the relationship between a single independent variable and a single dependent variable based on past observations.

  • Regression Analysis video This 40 min video explains regression analysis. Tagged as "best regression video ever!"
  • Simple Regression Analysis Explanation 3:50 min.
  • << Previous: Random Sample
  • Next: Relative Risk >>

Creative Commons License

  • Last Updated: Jul 3, 2024 12:02 PM
  • URL: https://libguides.und.edu/statistics

Regression Analysis: Step by Step Articles, Videos, Simple Definitions

Probability and Statistics > Regression analysis

regression analysis

Regression analysis is a way to find trends in data. For example, you might guess that there’s a connection between how much you eat and how much you weigh; regression analysis can help you quantify that.

Regression analysis will provide you with an equation for a graph so that you can make predictions about your data. For example, if you’ve been putting on weight over the last few years, it can predict how much you’ll weigh in ten years time if you continue to put on weight at the same rate. It will also give you a slew of statistics (including a p-value and a correlation coefficient ) to tell you how accurate your model is. Most elementary stats courses cover very basic techniques, like making scatter plots and performing linear regression . However, you may come across more advanced techniques like multiple regression .

  • Introduction to Regression Analysis

Multiple Regression Analysis

  • Overfitting and how to avoid it
  • Related articles

Technology:

Regression in Minitab

Regression analysis: an introduction.

regression 1

Best of all, you can use the equation to make predictions. For example, how much snow will fall in 2017? y = 2.2923(2017) + 4624.4 = 0.8 inches.

Regression also gives you an R squared value, which for this graph is 0.702. This number tells you how good your model is. The values range from 0 to 1, with 0 being a terrible model and 1 being a perfect model. As you can probably see, 0.7 is a fairly decent model so you can be fairly confident in your weather prediction!

Back to Top

Multiple regression analysis is used to see if there is a statistically significant relationship between sets of variables . It’s used to find trends in those sets of data.

Multiple regression analysis is almost the same as simple linear regression . The only difference between simple linear regression and multiple regression is in the number of predictors (“x” variables) used in the regression.

  • Simple regression analysis uses a single x variable for each dependent “y” variable. For example: (x 1 , Y 1 ).
  • Multiple regression uses multiple “x” variables for each independent variable : (x1) 1 , (x2) 1 , (x3) 1 , Y 1 ).

In one-variable linear regression, you would input one dependent variable (i.e. “sales”) against an independent variable (i.e. “profit”). But you might be interested in how different types of sales effect the regression. You could set your X 1 as one type of sales, your X 2 as another type of sales and so on.

When to Use Multiple Regression Analysis.

Ordinary linear regression usually isn’t enough to take into account all of the real-life factors that have an effect on an outcome. For example, the following graph plots a single variable (number of doctors) against another variable (life-expectancy of women).

From this graph it might appear there is a relationship between life-expectancy of women and the number of doctors in the population . In fact, that’s probably true and you could say it’s a simple fix: put more doctors into the population to increase life expectancy. But the reality is you would have to look at other factors like the possibility that doctors in rural areas might have less education or experience. Or perhaps they have a lack of access to medical facilities like trauma centers.

The addition of those extra factors would cause you to add additional dependent variables to your regression analysis and create a multiple regression analysis model.

Multiple Regression Analysis Output.

Regression analysis is always performed in software, like Excel or SPSS. The output differs according to how many variables you have but it’s essentially the same type of output you would find in a simple linear regression. There’s just more of it:

  • Simple regression: Y = b 0 + b 1 x.
  • Multiple regression: Y = b 0 + b 1 x1 + b 0 + b 1 x2…b 0 …b 1 xn.

The output would include a summary, similar to a summary for simple linear regression, that includes:

  • R (the multiple correlation coefficient ),
  • R squared (the coefficient of determination ),
  • adjusted R-squared ,
  • The standard error of the estimate.

These statistics help you figure out how well a regression model fits the data. The ANOVA table in the output would give you the p-value and f-statistic .

Minimum Sample size

“The answer to the sample size question appears to depend in part on the objectives of the researcher, the research questions that are being addressed, and the type of model being utilized. Although there are several research articles and textbooks giving recommendations for minimum sample sizes for multiple regression, few agree on how large is large enough and not many address the prediction side of MLR .” ~ Gregory T. Knofczynski

If you’re concerned with finding accurate values for squared multiple correlation coefficient, minimizing the shrinkage of the squared multiple correlation coefficient or have another specific goal, Gregory Knofczynski’s paper is a worthwhile read and comes with lots of references for further study. That said, many people just want to run MLS to get a general idea of trends and they don’t need very specific estimates. If that’s the case, you can use a rule of thumb . It’s widely stated in the literature that you should have more than 100 items in your sample. While this is sometimes adequate, you’ll be on the safer side if you have at least 200 observations or better yet—more than 400.

Overfitting in Regression

overfitting

Overfitting is where your model is too complex for your data — it happens when your sample size is too small. If you put enough predictor variables in your regression model, you will nearly always get a model that looks significant .

While an overfitted model may fit the idiosyncrasies of your data extremely well, it won’t fit additional test samples or the overall population. The model’s p-values, R-Squared and regression coefficients can all be misleading. Basically, you’re asking too much from a small set of data.

How to Avoid Overfitting

In linear modeling (including multiple regression ), you should have at least 10-15 observations for each term you are trying to estimate. Any less than that, and you run the risk of overfitting your model. “Terms” include:

  • Interaction Effects,
  • Polynomial expression s (for modeling curved lines),
  • Predictor variables.

While this rule of thumb is generally accepted, Green (1991) takes this a step further and suggests that the minimum sample size for any regression should be 50, with an additional 8 observations per term. For example, if you have one interacting variable and three predictor variables, you’ll need around 45-60 items in your sample to avoid overfitting, or 50 + 3(8) = 74 items according to Green.

There are exceptions to the “10-15” rule of thumb. They include:

  • When there is multicollinearity in your data, or if the effect size is small. If that’s the case, you’ll need to include more terms (although there is, unfortunately, no rule of thumb for how many terms to add!).
  • You may be able to get away with as few as 10 observations per predictor if you are using logistic regression or survival models , as long as you don’t have extreme event probabilities, small effect sizes, or predictor variables with truncated ranges . (Peduzzi et al.)

How to Detect and Avoid Overfitting

The easiest way to avoid overfitting is to increase your sample size by collecting more data. If you can’t do that, the second option is to reduce the number of predictors in your model — either by combining or eliminating them. Factor Analysis is one method you can use to identify related predictors that might be candidates for combining.

1. Cross-Validation

Use cross validation to detect overfitting: this partitions your data, generalizes your model, and chooses the model which works best. One form of cross-validation is predicted R-squared . Most good statistical software will include this statistic, which is calculated by:

  • Removing one observation at a time from your data,
  • Estimating the regression equation for each iteration,
  • Using the regression equation to predict the removed observation.

Cross validation isn’t a magic cure for small data sets though, and sometimes a clear model isn’t identified even with an adequate sample size.

2. Shrinkage & Resampling

Shrinkage and resampling techniques (like this R-module ) can help you to find out how well your model might fit a new sample.

3. Automated Methods

Automated stepwise regression shouldn’t be used as an overfitting solution for small data sets. According to Babyak (2004),

“The problems with automated selection conducted in this very typical manner are so numerous that it would be hard to catalogue all of them [in a journal article].”

Babyak also recommends avoiding univariate pretesting or screening (a “variation of automated selection in disguise”), dichotomizing continuous variables — which can dramatically increase Type I errors , or multiple testing of confounding variables (although this may be ok if used judiciously).

Books: Gonick, L. (1993). The Cartoon Guide to Statistics . HarperPerennial. Lindstrom, D. (2010). Schaum’s Easy Outline of Statistics , Second Edition (Schaum’s Easy Outlines) 2nd Edition. McGraw-Hill Education Journal articles:

  • Babyak, M.A.,(2004). “What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.” Psychosomatic Medicine. 2004 May-Jun;66(3):411-21.
  • Green S.B., (1991) “How many subjects does it take to do a regression analysis?” Multivariate Behavior Research 26:499–510.
  • Peduzzi P.N., et. al (1995). “The importance of events per independent variable in multivariable analysis, II: accuracy and precision of regression estimates.” Journal of Clinical Epidemiology 48:1503–10.
  • Peduzzi P.N., et. al (1996). “A simulation study of the number of events per variable in logistic regression analysis.” Journal of Clinical Epidemiology 49:1373–9.

Check out our YouTube channel for hundreds of videos on elementary statistics, including regression analysis using a variety of tools like Excel and the TI-83.

More articles

  • Additive Model & Multiplicative Model
  • How to Construct a Scatter Plot.
  • How to Calculate Pearson’s Correlation Coefficients.
  • How to Compute a Linear Regression Test Value.
  • Chow Test for Split Data Sets
  • Forward Selection
  • What is Kriging?
  • How to Find a Linear Regression Equation.
  • How to Find a Regression Slope Intercept.
  • How to Find a Linear Regression Slope.
  • Sinusoidal Regression: Definition, Desmos Example, TI-83
  • How to Find the Standard Error of Regression Slope.
  • Mallows’ Cp
  • Validity Coefficient: What it is and how to find it.
  • Quadratic Regression.
  • Quantile Regression In Analysis
  • Quartic Regression
  • Stepwise Regression
  • Unstandardized Coefficient
  • Next: : Weak Instruments

Fun fact: Did you know regression isn’t just for creating trendlines . It’s also a great hack for finding the nth term of a quadratic sequence .

Definitions

  • Assumptions and Conditions for Regression.
  • Betas / Standardized Coefficients.
  • What is a Beta Weight?
  • Bivariate correlation and regression.
  • Bilinear Regression
  • The Breusch-Pagan-Godfrey Test
  • Cook’s Distance.
  • What is a Covariate?
  • Cox Regression .
  • Detrend Data.
  • Exogeneity .
  • Gauss-Newton Algorithm.
  • What is the General Linear Model?
  • What is the Generalized Linear Model?
  • What is the Hausman Test?
  • What is Homoscedasticity?
  • Influential Data.
  • What is an Instrumental Variable?
  • Lack of Fit
  • Lasso Regression.
  • Levenberg–Marquardt Algorithm
  • What is the Line of best fit?
  • What is Logistic Regression?
  • What is the Mahalanobis distance?
  • Model Misspecification.
  • Multinomial Logistic Regression .
  • What is Nonlinear Regression?
  • Ordered Logit / Ordered Logistic Regression
  • What is Ordinary Least Squares Regression?
  • Overfitting .
  • Parsimonious Models .
  • What is Pearson’s Correlation Coefficient?
  • Poisson Regression .
  • Probit Model .
  • What is a Prediction Interval?
  • What is Regularization?
  • Regularized Least Squares .
  • Regularized Regression
  • What are Relative Weights?
  • What are Residual Plots?
  • Reverse Causality .
  • Ridge Regression
  • Root Mean Square Error.
  • Semiparametric models
  • Simultaneity Bias.
  • Simultaneous Equations Model .
  • What is Spurious Correlation?
  • Structural Equations Model
  • What are Tolerance Intervals?
  • Trend Analysis
  • Tuning Parameter
  • What is Weighted Least Squares Regression?
  • Y Hat explained .

Regression is fitting data to a line ( Minitab can also perform other types of regression, like quadratic regression ). When you find regression in Minitab, you’ll get a scatter plot of your data along with the line of best fit , plus Minitab will provide you with:

  • Standard Error (how much the data points deviate from the mean ).
  • R squared : a value between 0 and 1 which tells you how well your data points fit the model.
  • Adjusted R 2 (adjusts R 2 to account for data points that do not fit the model).

Regression in Minitab takes just a couple of clicks from the toolbar and is accessed through the Stat menu.

Example question : Find regression in Minitab for the following set of data points that compare calories consumed per day to weight: Calories consumed daily (Weight in lb): 2800 (140), 2810 (143), 2805 (144), 2705 (145), 3000 (155), 2500 (130), 2400 (121), 2100 (100), 2000 (99), 2350 (120), 2400 (121), 3000 (155).

Step 1: Type your data into two columns in Minitab .

Step 2: Click “Stat,” then click “Regression” and then click “Fitted Line Plot.”

Minitab regression

Step 3: Click a variable name for the dependent value in the left-hand window. For this sample question, we want to know if calories consumed affects weight , so calories is the independent variable (Y) and weight is the dependent variable (X). Click “Calories” and then click “Select.”

Step 4: Repeat Step 3 for the dependent X variable , weight.

regression in Minitab

Step 5: Click “OK.” Minitab will create a regression line graph in a separate window.

Step 4: Read the results. As well as creating a regression graph, Minitab will give you values for S, R-sq and R-sq(adj) in the top right corner of the fitted line plot window. s = standard error . R-Sq = Coefficient of Determination R-Sq(adj) = Adjusted Coefficient of Determination ( Adjusted R Squared ).

That’s it!

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Sustainability
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Explained: Regression analysis

regression analysis in research formula

Previous image Next image

Share this news article on:

Related links.

  • Department of Economics
  • Department of Mathematics
  • Explained: "Linear and nonlinear systems"

Related Topics

  • Mathematics

More MIT News

Three women hold children while outside their homes.

How social structure influences the way people share money

Read full story →

Light securing a data pathway between a computer and a cloud-based computing platform

New security protocol shields data from attackers during cloud-based computation

Photo of the brown, rocky, clay Mars surface.

Mars’ missing atmosphere could be hiding in plain sight

A woman sleeps while wearing an Elemind headband. It has a sensor on the forehead.

Startup helps people fall asleep by aligning audio signals with brainwaves

Jail cells and hallway in a prison.

Study evaluates impacts of summer heat in U.S. prison environments

A close-up photo of an imaging sensor and circuitry.

Fifteen Lincoln Laboratory technologies receive 2024 R&D 100 Awards

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Wallstreet Logo

Trending Courses

Course Categories

Certification Programs

  • Free Courses

Statistics Resources

  • Free Practice Tests
  • On Demand Webinars

Regression Analysis Formula

Last Updated :

21 Aug, 2024

Blog Author :

Wallstreetmojo Team

Edited by :

Ashish Kumar Srivastav

Reviewed by :

Dheeraj Vaidya

Table Of Contents

Regression analysis is the relationship between dependent and independent variables as it depicts how dependent variables will change when one or more independent variables change due to factors. Therefore, the formula for calculation is Y = a + bX + E, where Y is the dependent variable, X is the independent variable, a is the intercept, b is the slope, and E is the residual.

Regression is a statistical tool to predict the dependent variable with the help of one or more independent variables. While running a regression analysis, the main purpose of the researcher is to find out the relationship between the dependent and independent variables. One or multiple independent variables are chosen, which can help predict the dependent variable to predict the dependent variable. In addition, it helps validate whether the predictor variables are good enough to help predict the dependent variable.

A regression analysis formula tries to find the best fit line for the dependent variable with the help of the independent variables. The regression analysis equation is the same as the equation for a line which is:

Regression-Analysis-Formula

  • Y= the dependent variable of the regression equation
  • M= slope of the regression equation
  • x=dependent variable of the regression equation
  • B= constant of the equation

Table of contents

Explanation, relevance and uses, recommended articles.

  • Regression analysis explores the relationship between dependent and independent variables, showing how the dependent variable changes when one or more independent variables are altered due to various factors.
  • The regression analysis formula aims to find the best-fit line that represents the relationship between the dependent and independent variables.
  • Regression analysis is valuable in making informed business decisions and validating hypotheses. By examining the regression between dependent and independent variables, one can assess the potential impact of specific actions on profitability or other outcomes, making it a crucial tool in finance and other fields.

While running a regression, the main purpose of the researcher is to find out the relationship between the dependent and independent variables. Then, one or multiple independent variables chose to help predict the dependent variable. Regression analysis helps in the process of validating whether the predictor variables are good enough to help in predicting the dependent variable.

Let us try and understand the concept of regression analysis with the help of an example. First, let us try to find out the relation between the distance covered by the truck driver and the age of the truck driver. Then, someone does a regression equation to validate whether what he thinks of the relationship between two variables is also validated by the regression equation.

Below is given data for calculation

Regression Analysis Formula Example 1

For the calculation of regression analysis, go to the "Data" tab in Excel and then select the "Data Analysis" option. For further calculation procedure, refer to the given article here - Analysis ToolPak in Excel

 Example 1.1jpg

The regression analysis formula for the above example will be

  • y= 575.754*-3.121+0

In this particular example, we will see which variable is the dependent variable and which variable is the independent variable. The dependent variable in this regression equation is the distance covered by the truck driver, and the independent variable is the age of the truck driver. The regression for this set of dependent and independent variables proves that the independent variable is a good predictor of the dependent variable with a reasonably high coefficient of determination . In addition, the analysis helps validate that the factors in the form of the independent variable are selected correctly. The snapshot below depicts the regression output for the variables. The data set and the variables are present in the Excel sheet attached.

Let us try and understand regression analysis with the help of another example. Let us try to find out the relation between the height of the students of a class and the GPA grade of those students. Then, someone does a regression equation to validate whether what he thinks of the relationship between two variables is also validated by the regression equation.

In this example, Below is given data for calculation in excel

Example 2jpg

For regression analysis calculation, go to the "Data" tab in Excel and select the "Data Analysis" option.

Regression Analysis Formula Example 2.1jpg

The regression for the above example will be

  • y= 2.65*.0034+0
  • y= 0.009198

In this particular example, we will see which variable is the dependent variable and which variable is the independent variable. The dependent variable in this regression equation is the student's GPA, and the independent variable is the student's height. The regression analysis for this set of dependent and independent variables proves that the independent variable is not a good predictor of the dependent variable as the value for the coefficient of determination is negligible. In this case, we need to find another predictor variable to predict the dependent variable for the regression analysis. The snapshot below depicts the regression output for the variables. The data set and the variables are present in the Excel sheet attached.

Regression is a very useful statistical method. One can validate any business decision to validate a hypothesis that a particular action will increase a division's profitability based on the regression between the dependent and independent variables. Therefore, the regression analysis equation plays a very important role in finance. In addition, a lot of forecasting is performed using regression. For example, one can predict the sales of a particular segment in advance with the help of macroeconomic indicators that have a very good correlation with that segment. Both linear and multiple regressions are useful for practitioners to make predictions of the dependent variables and validate the independent variables as a predictor of the dependent variables.

Frequently Asked Questions (FAQs)

Regression analysis relies on several assumptions. First, it assumes a linear relationship between the independent and dependent variables. It also assumes that the observations in the dataset are independent of each other, meaning that one observation does not influence another. The assumption of homoscedasticity states that the variance of the errors or residuals is constant across all levels of the independent variables. 

The regression analysis has limitations to consider. It is only suitable for analyzing variables that exhibit a linear relationship, potentially missing complex nonlinear relationships. While it can identify associations, it cannot establish causation, requiring additional evidence and consideration of other factors. Violating the assumptions can affect the accuracy and reliability of the results. Outliers or influential observations can also disproportionately impact the outcomes, leading to biased estimates. 

Sir Francis Galton initially developed regression analysis in the late 19th century. Still, it was further refined and formalized by other statisticians, such as Karl Pearson and Ronald Fisher, who made significant contributions to the field. 

This article has been a guide to Regression Analysis Formula. Here, we discuss performing regression calculations using data analysis, examples, and a downloadable Excel template. You can learn more about statistical modeling from the following articles: -

  • Definition of Gini Coefficient
  • Regression Analysis Excel
  • Formula of R Squared
  • Examples of Linear Regression 

Youtube

Cart

  • SUGGESTED TOPICS
  • The Magazine
  • Newsletters
  • Managing Yourself
  • Managing Teams
  • Work-life Balance
  • The Big Idea
  • Data & Visuals
  • Case Selections
  • HBR Learning
  • Topic Feeds
  • Account Settings
  • Email Preferences

A Refresher on Regression Analysis

regression analysis in research formula

Understanding one of the most important types of data analysis.

You probably know by now that whenever possible you should be making data-driven decisions at work . But do you know how to parse through all the data available to you? The good news is that you probably don’t need to do the number crunching yourself (hallelujah!) but you do need to correctly understand and interpret the analysis created by your colleagues. One of the most important types of data analysis is called regression analysis.

  • Amy Gallo is a contributing editor at Harvard Business Review, cohost of the Women at Work podcast , and the author of two books: Getting Along: How to Work with Anyone (Even Difficult People) and the HBR Guide to Dealing with Conflict . She writes and speaks about workplace dynamics. Watch her TEDx talk on conflict and follow her on LinkedIn . amyegallo

regression analysis in research formula

Partner Center

  • Search Search Please fill out this field.

What Is Regression?

Understanding regression, calculating regression, the bottom line.

  • Macroeconomics

Regression: Definition, Analysis, Calculation, and Example

regression analysis in research formula

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between a dependent variable and one or more independent variables.

Linear regression is the most common form of this technique. Also called simple regression or ordinary least squares (OLS), linear regression establishes the linear relationship between two variables.

Linear regression is graphically depicted using a straight line of best fit with the slope defining how the change in one variable impacts a change in the other. The y-intercept of a linear regression relationship represents the value of the dependent variable when the value of the independent variable is zero. Nonlinear regression models also exist, but are far more complex.

Key Takeaways

  • Regression is a statistical technique that relates a dependent variable to one or more independent variables.
  • A regression model is able to show whether changes observed in the dependent variable are associated with changes in one or more of the independent variables.
  • It does this by essentially determining a best-fit line and seeing how the data is dispersed around this line.
  • Regression helps economists and financial analysts in things ranging from asset valuation to making predictions.
  • For regression results to be properly interpreted, several assumptions about the data and the model itself must hold.

In economics, regression is used to help investment managers value assets and understand the relationships between factors such as commodity prices and the stocks of businesses dealing in those commodities.

While a powerful tool for uncovering the associations between variables observed in data, it cannot easily indicate causation. Regression as a statistical technique should not be confused with the concept of regression to the mean, also known as mean reversion .

Joules Garcia / Investopedia

Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and  multiple linear regression , although there are nonlinear regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to predict the outcome. Analysts can use stepwise regression to examine each independent variable contained in the linear regression model.

Regression can help finance and investment professionals. For instance, a company might use it to predict sales based on weather, previous sales, gross domestic product (GDP) growth, or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering the costs of capital.

Regression and Econometrics

Econometrics is a set of statistical techniques used to analyze data in finance and economics. An example of the application of econometrics is to study the income effect using observable data. An economist may, for example, hypothesize that as a person increases their income , their spending will also increase.

If the data show that such an association is present, a regression analysis can then be conducted to understand the strength of the relationship between income and consumption and whether or not that relationship is statistically significant.

Note that you can have several independent variables in an analysis—for example, changes to GDP and inflation in addition to unemployment in explaining stock market prices. When more than one independent variable is used, it is referred to as  multiple linear regression . This is the most commonly used tool in econometrics.

Econometrics is sometimes criticized for relying too heavily on the interpretation of regression output without linking it to economic theory or looking for causal mechanisms. It is crucial that the findings revealed in the data are able to be adequately explained by a theory.

Linear regression models often use a least-squares approach to determine the line of best fit. The least-squares technique is determined by minimizing the sum of squares created by a mathematical function. A square is, in turn, determined by squaring the distance between a data point and the regression line or mean value of the data set.

Once this process has been completed (usually done today with software), a regression model is constructed. The general form of each type of regression model is:

Simple linear regression:

Y = a + b X + u \begin{aligned}&Y = a + bX + u \\\end{aligned} ​ Y = a + b X + u ​

Multiple linear regression:

Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + . . . + b t X t + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term \begin{aligned}&Y = a + b_1X_1 + b_2X_2 + b_3X_3 + ... + b_tX_t + u \\&\textbf{where:} \\&Y = \text{The dependent variable you are trying to predict} \\&\text{or explain} \\&X = \text{The explanatory (independent) variable(s) you are } \\&\text{using to predict or associate with Y} \\&a = \text{The y-intercept} \\&b = \text{(beta coefficient) is the slope of the explanatory} \\&\text{variable(s)} \\&u = \text{The regression residual or error term} \\\end{aligned} ​ Y = a + b 1 ​ X 1 ​ + b 2 ​ X 2 ​ + b 3 ​ X 3 ​ + ... + b t ​ X t ​ + u where: Y = The dependent variable you are trying to predict or explain X = The explanatory (independent) variable(s) you are  using to predict or associate with Y a = The y-intercept b = (beta coefficient) is the slope of the explanatory variable(s) u = The regression residual or error term ​

Example of How Regression Analysis Is Used in Finance

Regression is often used to determine how specific factors—such as the price of a commodity, interest rates, particular industries, or sectors—influence the price movement of an asset. The aforementioned CAPM is based on regression, and it's utilized to project the expected returns for stocks and to generate costs of capital. A stock’s returns are regressed against the returns of a broader index, such as the S&P 500, to generate a beta for the particular stock.

Beta is the stock’s risk in relation to the market or index and is reflected as the slope in the CAPM. The return for the stock in question would be the dependent variable Y, while the independent variable X would be the market risk premium.

Additional variables such as the market capitalization of a stock, valuation ratios, and recent returns can be added to the CAPM to get better estimates for returns. These additional factors are known as the Fama-French factors, named after the professors who developed the multiple linear regression model to better explain asset returns.

Why Is It Called Regression?

Although there is some debate about the origins of the name, the statistical technique described above most likely was termed “regression” by Sir Francis Galton in the 19th century to describe the statistical feature of biological data (such as heights of people in a population) to regress to some mean level. In other words, while there are shorter and taller people, only outliers are very tall or short, and most people cluster somewhere around (or “regress” to) the average.

What Is the Purpose of Regression?

In statistical analysis, regression is used to identify the associations between variables occurring in some data. It can show the magnitude of such an association and determine its statistical significance. Regression is a powerful tool for statistical inference and has been used to try to predict future outcomes based on past observations.

How Do You Interpret a Regression Model?

A regression model output may be in the form of Y = 1.0 + (3.2) X 1 - 2.0( X 2 ) + 0.21.

Here we have a multiple linear regression that relates some variable Y with two explanatory variables X 1 and X 2 . We would interpret the model as the value of Y changes by 3.2× for every one-unit change in X 1 (if X 1 goes up by 2, Y goes up by 6.4, etc.) holding all else constant. That means controlling for X 2 , X 1 has this observed relationship. Likewise, holding X1 constant, every one unit increase in X 2 is associated with a 2× decrease in Y. We can also note the y-intercept of 1.0, meaning that Y = 1 when X 1 and X 2 are both zero. The error term (residual) is 0.21.

What Are the Assumptions That Must Hold for Regression Models?

To properly interpret the output of a regression model, the following main assumptions about the underlying data process of what you are analyzing must hold:

  • The relationship between variables is linear;
  • There must be homoskedasticity , or the variance of the variables and error term must remain constant;
  • All explanatory variables are independent of one another;
  • All variables are normally distributed .

Regression is a statistical method that tries to determine the strength and character of the relationship between one dependent variable and a series of other variables. It is used in finance, investing, and other disciplines.

Regression analysis uncovers the associations between variables observed in data, but cannot easily indicate causation.

Margo Bergman. “ Quantitative Analysis for Business: 12. Simple Linear Regression and Correlation .” University of Washington Pressbooks, 2022.

Margo Bergman. “ Quantitative Analysis for Business: 13. Multiple Linear Regression .” University of Washington Pressbooks, 2022.

Fama, Eugene F., and Kenneth R. French, via Wiley Online Library. “ The Cross-Section of Expected Stock Returns .” The Journal of Finance , vol. 47, no. 2, June 1992, pp. 427–465.

Stanton, Jeffrey M., via Taylor & Francis Online. “ Galton, Pearson, and the Peas: A Brief History of Linear Regression for Statistics Instructors .” Journal of Statistics Education , vol. 9, no. 3, 2001.

CFA Institute. “ Basics of Multiple Regression and Underlying Assumptions .”

regression analysis in research formula

  • Terms of Service
  • Editorial Policy
  • Privacy Policy

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

The PMC website is updating on October 15, 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Cardiopulm Phys Ther J
  • v.20(3); 2009 Sep

Regression Analysis for Prediction: Understanding the Process

Phillip b palmer.

1 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Dennis G O'Connell

2 Hardin-Simmons University, Department of Physical Therapy, Abilene, TX

Research related to cardiorespiratory fitness often uses regression analysis in order to predict cardiorespiratory status or future outcomes. Reading these studies can be tedious and difficult unless the reader has a thorough understanding of the processes used in the analysis. This feature seeks to “simplify” the process of regression analysis for prediction in order to help readers understand this type of study more easily. Examples of the use of this statistical technique are provided in order to facilitate better understanding.

INTRODUCTION

Graded, maximal exercise tests that directly measure maximum oxygen consumption (VO 2 max) are impractical in most physical therapy clinics because they require expensive equipment and personnel trained to administer the tests. Performing these tests in the clinic may also require medical supervision; as a result researchers have sought to develop exercise and non-exercise models that would allow clinicians to predict VO 2 max without having to perform direct measurement of oxygen uptake. In most cases, the investigators utilize regression analysis to develop their prediction models.

Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses in scientific literature: prediction, including classification, and explanation. The following provides a brief review of the use of regression analysis for prediction. Specific emphasis is given to the selection of the predictor variables (assessing model efficiency and accuracy) and cross-validation (assessing model stability). The discussion is not intended to be exhaustive. For a more thorough explanation of regression analysis, the reader is encouraged to consult one of many books written about this statistical technique (eg, Fox; 5 Kleinbaum, Kupper, & Muller; 12 Pedhazur; 15 and Weisberg 16 ). Examples of the use of regression analysis for prediction are drawn from a study by Bradshaw et al. 3 In this study, the researchers' stated purpose was to develop an equation for prediction of cardiorespiratory fitness (CRF) based on non-exercise (N-EX) data.

SELECTING THE CRITERION (OUTCOME MEASURE)

The first step in regression analysis is to determine the criterion variable. Pedhazur 15 suggests that the criterion have acceptable measurement qualities (ie, reliability and validity). Bradshaw et al 3 used VO 2 max as the criterion of choice for their model and measured it using a maximum graded exercise test (GXT) developed by George. 6 George 6 indicated that his protocol for testing compared favorably with the Bruce protocol in terms of predictive ability and had good test-retest reliability ( ICC = .98 –.99). The American College of Sports Medicine indicates that measurement of VO 2 max is the “gold standard” for measuring cardiorespiratory fitness. 1 These facts support that the criterion selected by Bradshaw et al 3 was appropriate and meets the requirements for acceptable reliability and validity.

SELECTING THE PREDICTORS: MODEL EFFICIENCY

Once the criterion has been selected, predictor variables should be identified (model selection). The aim of model selection is to minimize the number of predictors which account for the maximum variance in the criterion. 15 In other words, the most efficient model maximizes the value of the coefficient of determination ( R 2 ). This coefficient estimates the amount of variance in the criterion score accounted for by a linear combination of the predictor variables. The higher the value is for R 2 , the less error or unexplained variance and, therefore, the better prediction. R 2 is dependent on the multiple correlation coefficient ( R ), which describes the relationship between the observed and predicted criterion scores. If there is no difference between the predicted and observed scores, R equals 1.00. This represents a perfect prediction with no error and no unexplained variance ( R 2 = 1.00). When R equals 0.00, there is no relationship between the predictor(s) and the criterion and no variance in scores has been explained ( R 2 = 0.00). The chosen variables cannot predict the criterion. The goal of model selection is, as stated previously, to develop a model that results in the highest estimated value for R 2 .

According to Pedhazur, 15 the value of R is often overestimated. The reasons for this are beyond the scope of this discussion; however, the degree of overestimation is affected by sample size. The larger the ratio is between the number of predictors and subjects, the larger the overestimation. To account for this, sample sizes should be large and there should be 15 to 30 subjects per predictor. 11 , 15 Of course, the most effective way to determine optimal sample size is through statistical power analysis. 11 , 15

Another method of determining the best model for prediction is to test the significance of adding one or more variables to the model using the partial F-test . This process, which is further discussed by Kleinbaum, Kupper, and Muller, 12 allows for exclusion of predictors that do not contribute significantly to the prediction, allowing determination of the most efficient model of prediction. In general, the partial F-test is similar to the F-test used in analysis of variance. It assesses the statistical significance of the difference between values for R 2 derived from 2 or more prediction models using a subset of the variables from the original equation. For example, Bradshaw et al 3 indicated that all variables contributed significantly to their prediction. Though the researchers do not detail the procedure used, it is highly likely that different models were tested, excluding one or more variables, and the resulting values for R 2 assessed for statistical difference.

Although the techniques discussed above are useful in determining the most efficient model for prediction, theory must be considered in choosing the appropriate variables. Previous research should be examined and predictors selected for which a relationship between the criterion and predictors has been established. 12 , 15

It is clear that Bradshaw et al 3 relied on theory and previous research to determine the variables to use in their prediction equation. The 5 variables they chose for inclusion–gender, age, body mass index (BMI), perceived functional ability (PFA), and physical activity rating (PAR)–had been shown in previous studies to contribute to the prediction of VO 2 max (eg, Heil et al; 8 George, Stone, & Burkett 7 ). These 5 predictors accounted for 87% ( R = .93, R 2 = .87 ) of the variance in the predicted values for VO 2 max. Based on a ratio of 1:20 (predictor:sample size), this estimate of R , and thus R 2 , is not likely to be overestimated. The researchers used changes in the value of R 2 to determine whether to include or exclude these or other variables. They reported that removal of perceived functional ability (PFA) as a variable resulted in a decrease in R from .93 to .89. Without this variable, the remaining 4 predictors would account for only 79% of the variance in VO 2 max. The investigators did note that each predictor variable contributed significantly ( p < .05 ) to the prediction of VO 2 max (see above discussion related to the partial F-test).

ASSESSING ACCURACY OF THE PREDICTION

Assessing accuracy of the model is best accomplished by analyzing the standard error of estimate ( SEE ) and the percentage that the SEE represents of the predicted mean ( SEE % ). The SEE represents the degree to which the predicted scores vary from the observed scores on the criterion measure, similar to the standard deviation used in other statistical procedures. According to Jackson, 10 lower values of the SEE indicate greater accuracy in prediction. Comparison of the SEE for different models using the same sample allows for determination of the most accurate model to use for prediction. SEE % is calculated by dividing the SEE by the mean of the criterion ( SEE /mean criterion) and can be used to compare different models derived from different samples.

Bradshaw et al 3 report a SEE of 3.44 mL·kg −1 ·min −1 (approximately 1 MET) using all 5 variables in the equation (gender, age, BMI, PFA, PA-R). When the PFA variable is removed from the model, leaving only 4 variables for the prediction (gender, age, BMI, PA-R), the SEE increases to 4.20 mL·kg −1 ·min −1 . The increase in the error term indicates that the model excluding PFA is less accurate in predicting VO 2 max. This is confirmed by the decrease in the value for R (see discussion above). The researchers compare their model of prediction with that of George, Stone, and Burkett, 7 indicating that their model is as accurate. It is not advisable to compare models based on the SEE if the data were collected from different samples as they were in these 2 studies. That type of comparison should be made using SEE %. Bradshaw and colleagues 3 report SEE % for their model (8.62%), but do not report values from other models in making comparisons.

Some advocate the use of statistics derived from the predicted residual sum of squares ( PRESS ) as a means of selecting predictors. 2 , 4 , 16 These statistics are used more often in cross-validation of models and will be discussed in greater detail later.

ASSESSING STABILITY OF THE MODEL FOR PREDICTION

Once the most efficient and accurate model for prediction has been determined, it is prudent that the model be assessed for stability. A model, or equation, is said to be “stable” if it can be applied to different samples from the same population without losing the accuracy of the prediction. This is accomplished through cross-validation of the model. Cross-validation determines how well the prediction model developed using one sample performs in another sample from the same population. Several methods can be employed for cross-validation, including the use of 2 independent samples, split samples, and PRESS -related statistics developed from the same sample.

Using 2 independent samples involves random selection of 2 groups from the same population. One group becomes the “training” or “exploratory” group used for establishing the model of prediction. 5 The second group, the “confirmatory” or “validatory” group is used to assess the model for stability. The researcher compares R 2 values from the 2 groups and assessment of “shrinkage,” the difference between the two values for R 2 , is used as an indicator of model stability. There is no rule of thumb for interpreting the differences, but Kleinbaum, Kupper, and Muller 12 suggest that “shrinkage” values of less than 0.10 indicate a stable model. While preferable, the use of independent samples is rarely used due to cost considerations.

A similar technique of cross-validation uses split samples. Once the sample has been selected from the population, it is randomly divided into 2 subgroups. One subgroup becomes the “exploratory” group and the other is used as the “validatory” group. Again, values for R 2 are compared and model stability is assessed by calculating “shrinkage.”

Holiday, Ballard, and McKeown 9 advocate the use of PRESS-related statistics for cross-validation of regression models as a means of dealing with the problems of data-splitting. The PRESS method is a jackknife analysis that is used to address the issue of estimate bias associated with the use of small sample sizes. 13 In general, a jackknife analysis calculates the desired test statistic multiple times with individual cases omitted from the calculations. In the case of the PRESS method, residuals, or the differences between the actual values of the criterion for each individual and the predicted value using the formula derived with the individual's data removed from the prediction, are calculated. The PRESS statistic is the sum of the squares of the residuals derived from these calculations and is similar to the sum of squares for the error (SS error ) used in analysis of variance (ANOVA). Myers 14 discusses the use of the PRESS statistic and describes in detail how it is calculated. The reader is referred to this text and the article by Holiday, Ballard, and McKeown 9 for additional information.

Once determined, the PRESS statistic can be used to calculate a modified form of R 2 and the SEE . R 2 PRESS is calculated using the following formula: R 2 PRESS = 1 – [ PRESS / SS total ], where SS total equals the sum of squares for the original regression equation. 14 Standard error of the estimate for PRESS ( SEE PRESS ) is calculated as follows: SEE PRESS =, where n equals the number of individual cases. 14 The smaller the difference between the 2 values for R 2 and SEE , the more stable the model for prediction. Bradshaw et al 3 used this technique in their investigation. They reported a value for R 2 PRESS of .83, a decrease of .04 from R 2 for their prediction model. Using the standard set by Kleinbaum, Kupper, and Muller, 12 the model developed by these researchers would appear to have stability, meaning it could be used for prediction in samples from the same population. This is further supported by the small difference between the SEE and the SEE PRESS , 3.44 and 3.63 mL·kg −1 ·min −1 , respectively.

COMPARING TWO DIFFERENT PREDICTION MODELS

A comparison of 2 different models for prediction may help to clarify the use of regression analysis in prediction. Table ​ Table1 1 presents data from 2 studies and will be used in the following discussion.

Comparison of Two Non-exercise Models for Predicting CRF

VariablesHeil et al = 374Bradshaw et al = 100
Intercept36.58048.073
Gender (male = 1, female = 0)3.7066.178
Age (years)0.558−0.246
Age −7.81 E-3
Percent body fat−0.541
Body mass index (kg-m )−0.619
Activity code (0-7)1.347
Physical activity rating (0–10)0.671
Perceived functional abilty0.712
)
.88 (.77).93 (.87)
4.90·mL–kg ·min 3.44 mL·kg min
12.7%8.6%

As noted above, the first step is to select an appropriate criterion, or outcome measure. Bradshaw et al 3 selected VO 2 max as their criterion for measuring cardiorespiratory fitness. Heil et al 8 used VO 2 peak. These 2 measures are often considered to be the same, however, VO 2 peak assumes that conditions for measuring maximum oxygen consumption were not met. 17 It would be optimal to compare models based on the same criterion, but that is not essential, especially since both criteria measure cardiorespiratory fitness in much the same way.

The second step involves selection of variables for prediction. As can be seen in Table ​ Table1, 1 , both groups of investigators selected 5 variables to use in their model. The 5 variables selected by Bradshaw et al 3 provide a better prediction based on the values for R 2 (.87 and .77), indicating that their model accounts for more variance (87% versus 77%) in the prediction than the model of Heil et al. 8 It should also be noted that the SEE calculated in the Bradshaw 3 model (3.44 mL·kg −1 ·min −1 ) is less than that reported by Heil et al 8 (4.90 mL·kg −1 ·min −1 ). Remember, however, that comparison of the SEE should only be made when both models are developed using samples from the same population. Comparing predictions developed from different populations can be accomplished using the SEE% . Review of values for the SEE% in Table ​ Table1 1 would seem to indicate that the model developed by Bradshaw et al 3 is more accurate because the percentage of the mean value for VO 2 max represented by error is less than that reported by Heil et al. 8 In summary, the Bradshaw 3 model would appear to be more efficient, accounting for more variance in the prediction using the same number of variables. It would also appear to be more accurate based on comparison of the SEE% .

The 2 models cannot be compared based on stability of the models. Each set of researchers used different methods for cross-validation. Both models, however, appear to be relatively stable based on the data presented. A clinician can assume that either model would perform fairly well when applied to samples from the same populations as those used by the investigators.

The purpose of this brief review has been to demystify regression analysis for prediction by explaining it in simple terms and to demonstrate its use. When reviewing research articles in which regression analysis has been used for prediction, physical therapists should ensure that the: (1) criterion chosen for the study is appropriate and meets the standards for reliability and validity, (2) processes used by the investigators to assess both model efficiency and accuracy are appropriate, 3) predictors selected for use in the model are reasonable based on theory or previous research, and 4) investigators assessed model stability through a process of cross-validation, providing the opportunity for others to utilize the prediction model in different samples drawn from the same population.

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

regression analysis in research formula

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • AI Essentials for Business
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Creating Brand Value
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading Change and Organizational Renewal
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

What Is Regression Analysis in Business Analytics?

Business professional using calculator for regression analysis

  • 14 Dec 2021

Countless factors impact every facet of business. How can you consider those factors and know their true impact?

Imagine you seek to understand the factors that influence people’s decision to buy your company’s product. They range from customers’ physical locations to satisfaction levels among sales representatives to your competitors' Black Friday sales.

Understanding the relationships between each factor and product sales can enable you to pinpoint areas for improvement, helping you drive more sales.

To learn how each factor influences sales, you need to use a statistical analysis method called regression analysis .

If you aren’t a business or data analyst, you may not run regressions yourself, but knowing how analysis works can provide important insight into which factors impact product sales and, thus, which are worth improving.

Access your free e-book today.

Foundational Concepts for Regression Analysis

Before diving into regression analysis, you need to build foundational knowledge of statistical concepts and relationships.

Independent and Dependent Variables

Start with the basics. What relationship are you aiming to explore? Try formatting your answer like this: “I want to understand the impact of [the independent variable] on [the dependent variable].”

The independent variable is the factor that could impact the dependent variable . For example, “I want to understand the impact of employee satisfaction on product sales.”

In this case, employee satisfaction is the independent variable, and product sales is the dependent variable. Identifying the dependent and independent variables is the first step toward regression analysis.

Correlation vs. Causation

One of the cardinal rules of statistically exploring relationships is to never assume correlation implies causation. In other words, just because two variables move in the same direction doesn’t mean one caused the other to occur.

If two or more variables are correlated , their directional movements are related. If two variables are positively correlated , it means that as one goes up or down, so does the other. Alternatively, if two variables are negatively correlated , one goes up while the other goes down.

A correlation’s strength can be quantified by calculating the correlation coefficient , sometimes represented by r . The correlation coefficient falls between negative one and positive one.

r = -1 indicates a perfect negative correlation.

r = 1 indicates a perfect positive correlation.

r = 0 indicates no correlation.

Causation means that one variable caused the other to occur. Proving a causal relationship between variables requires a true experiment with a control group (which doesn’t receive the independent variable) and an experimental group (which receives the independent variable).

While regression analysis provides insights into relationships between variables, it doesn’t prove causation. It can be tempting to assume that one variable caused the other—especially if you want it to be true—which is why you need to keep this in mind any time you run regressions or analyze relationships between variables.

With the basics under your belt, here’s a deeper explanation of regression analysis so you can leverage it to drive strategic planning and decision-making.

Related: How to Learn Business Analytics without a Business Background

What Is Regression Analysis?

Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression).

According to the Harvard Business School Online course Business Analytics , regression is used for two primary purposes:

  • To study the magnitude and structure of the relationship between variables
  • To forecast a variable based on its relationship with another variable

Both of these insights can inform strategic business decisions.

“Regression allows us to gain insights into the structure of that relationship and provides measures of how well the data fit that relationship,” says HBS Professor Jan Hammond, who teaches Business Analytics, one of three courses that comprise the Credential of Readiness (CORe) program . “Such insights can prove extremely valuable for analyzing historical trends and developing forecasts.”

One way to think of regression is by visualizing a scatter plot of your data with the independent variable on the X-axis and the dependent variable on the Y-axis. The regression line is the line that best fits the scatter plot data. The regression equation represents the line’s slope and the relationship between the two variables, along with an estimation of error.

Physically creating this scatter plot can be a natural starting point for parsing out the relationships between variables.

Credential of Readiness | Master the fundamentals of business | Learn More

Types of Regression Analysis

There are two types of regression analysis: single variable linear regression and multiple regression.

Single variable linear regression is used to determine the relationship between two variables: the independent and dependent. The equation for a single variable linear regression looks like this:

Single Variable Linear Regression Formula

In the equation:

  • ŷ is the expected value of Y (the dependent variable) for a given value of X (the independent variable).
  • x is the independent variable.
  • α is the Y-intercept, the point at which the regression line intersects with the vertical axis.
  • β is the slope of the regression line, or the average change in the dependent variable as the independent variable increases by one.
  • ε is the error term, equal to Y – ŷ, or the difference between the actual value of the dependent variable and its expected value.

Multiple regression , on the other hand, is used to determine the relationship between three or more variables: the dependent variable and at least two independent variables. The multiple regression equation looks complex but is similar to the single variable linear regression equation:

Multiple Regression Formula

Each component of this equation represents the same thing as in the previous equation, with the addition of the subscript k, which is the total number of independent variables being examined. For each independent variable you include in the regression, multiply the slope of the regression line by the value of the independent variable, and add it to the rest of the equation.

How to Run Regressions

You can use a host of statistical programs—such as Microsoft Excel, SPSS, and STATA—to run both single variable linear and multiple regressions. If you’re interested in hands-on practice with this skill, Business Analytics teaches learners how to create scatter plots and run regressions in Microsoft Excel, as well as make sense of the output and use it to drive business decisions.

Calculating Confidence and Accounting for Error

It’s important to note: This overview of regression analysis is introductory and doesn’t delve into calculations of confidence level, significance, variance, and error. When working in a statistical program, these calculations may be provided or require that you implement a function. When conducting regression analysis, these metrics are important for gauging how significant your results are and how much importance to place on them.

Business Analytics | Become a data-driven leader | Learn More

Why Use Regression Analysis?

Once you’ve generated a regression equation for a set of variables, you effectively have a roadmap for the relationship between your independent and dependent variables. If you input a specific X value into the equation, you can see the expected Y value.

This can be critical for predicting the outcome of potential changes, allowing you to ask, “What would happen if this factor changed by a specific amount?”

Returning to the earlier example, running a regression analysis could allow you to find the equation representing the relationship between employee satisfaction and product sales. You could input a higher level of employee satisfaction and see how sales might change accordingly. This information could lead to improved working conditions for employees, backed by data that shows the tie between high employee satisfaction and sales.

Whether predicting future outcomes, determining areas for improvement, or identifying relationships between seemingly unconnected variables, understanding regression analysis can enable you to craft data-driven strategies and determine the best course of action with all factors in mind.

Do you want to become a data-driven professional? Explore our eight-week Business Analytics course and our three-course Credential of Readiness (CORe) program to deepen your analytical skills and apply them to real-world business problems.

regression analysis in research formula

About the Author

20240904-161632-1

  • CelerData Overview
  • CelerData Cloud BYOC
  • Lakehouse Analytics
  • Real-Time Analytics
  • Customer-Facing Analytics
  • Trino/Presto
  • Apache Druid
  • Apache Iceberg
  • CelerData Blog
  • Documentation
  • Whitepapers & Case Studies
  • Webinars & Videos

letter

Regression Analysis

What is a regression analysis, definition and basic concepts.

Regression analysis helps you understand relationships between variables. This method predicts the value of one variable based on another. You use regression to explore how changes in one factor affect another.

Understanding Dependent and Independent Variables

Dependent variables represent outcomes you want to predict. Independent variables are factors you believe influence these outcomes. You analyze how independent variables impact dependent ones through regression.

Types of Regression Analysis

Several types exist. Linear regression finds a straight-line relationship. Logistic regression handles binary outcomes. Polynomial regression explores non-linear relationships. Each type serves different purposes in analysis.

Historical Background

Regression analysis has a rich history. Understanding its origins helps appreciate its evolution.

Origin and Evolution

The 19th century saw the birth of regression. Mathematicians like Legendre and Gauss developed the method of least squares. This technique laid the groundwork for modern regression analysis.

Key Contributors to the Field

Francis Galton coined the term "regression." His work on biological data led to significant insights. Many researchers have since expanded on these concepts for regression analysis.

Practical Applications of Regression Analysis

Regression Analysis offers valuable insights across various fields. You can use this method to make informed decisions in Business and science. Understanding these applications helps you Learn Business Analytics effectively.

Use in Business and Economics

Regression plays a crucial role in Business. Companies use it for forecasting and understanding market trends.

Forecasting and Trend Analysis

Businesses rely on Regression to predict future sales and revenue. For example, analyzing age, education, and experience helps predict a salesperson's total yearly sales. This Analysis guides strategic planning and resource allocation. Accurate forecasts improve decision-making and profitability.

Market Research and Consumer Behavior

Regression Analysis uncovers consumer behavior patterns. Businesses analyze factors like price, promotion, and product features. This insight helps tailor marketing strategies. Companies enhance customer satisfaction by understanding preferences. Business Analytics becomes a powerful tool for growth.

Applications in Science and Engineering

Regression is essential in scientific research and engineering projects. Researchers use it to model complex systems and predict outcomes.

Predictive Modeling

Scientists apply Regression to create predictive models. These models estimate future events based on historical data. Engineers use them to optimize processes and improve efficiency. Predictive modeling supports innovation and problem-solving.

Experimental Data Analysis

Regression Analysis evaluates experimental data. Researchers identify relationships between variables. This process validates hypotheses and enhances scientific understanding. Accurate data interpretation leads to breakthroughs in various fields.

Regression Analysis serves as a cornerstone of Business Essentials and scientific inquiry. By mastering these applications, you gain a solid Business Background. Experts like Catherine Cote emphasize its importance in modern Analytics. Embrace these tools to excel in your field.

Tools for Implementing Regression Analysis

Software and programming languages.

IBM SPSS Statistics provides powerful tools for Regression Analysis. You can use this software to perform simple linear regression and multiple linear regression. IBM SPSS offers advanced statistical analysis capabilities. Explore SPSS Statistics to discover IBM SPSS Statistics features that enhance data analysis.

R is a popular programming language for Regression Analysis. R allows you to handle complex data sets with ease. Many statisticians prefer R for its flexibility in modeling.

Python is another versatile language for Regression. Python includes libraries like Sklearn for implementing linear regression. Python's simplicity makes it accessible for beginners.

SPSS Statistics Grad Pack offers a user-friendly interface. This pack helps students and educators perform Regression Analysis efficiently. IBM SPSS Statistics Grad Pack supports educational needs.

SAS provides robust solutions for Regression. SAS excels in handling large data sets. Businesses often rely on SAS for its reliability in analysis.

Importance and Assumptions of Regression Analysis

Regression analysis holds a significant place in statistical modeling. This method helps in predicting values by examining relationships between variables. Understanding the assumptions of regression analysis is crucial for accurate results.

Key Assumptions

Linearity and independence.

Regression analysis assumes a linear relationship between the independent and dependent variables. The model predicts outcomes based on this linear association. Independence of observations is another key assumption. Each data point should not influence another. Violation of these assumptions can lead to inaccurate predictions.

Homoscedasticity and Normality

Homoscedasticity refers to the constant variance of errors across all levels of the independent variable. This assumption ensures that the spread of residuals remains consistent. Normality of error terms is also essential. Errors should follow a normal distribution for valid regression results. Meeting these assumptions strengthens the reliability of the analysis.

Importance in Data Analysis

Decision making and strategy.

Regression analysis aids in decision-making by providing insights into data patterns. Businesses use regression to forecast future events and trends. This predictive capability supports strategic planning. Decision-makers rely on regression to allocate resources effectively.

Identifying Relationships and Patterns

Regression helps identify correlations between variables. This analysis uncovers patterns that may not be immediately apparent. Understanding these relationships enables better event satisfaction. Analysts use regression to explore causation and correlation. This exploration leads to informed conclusions and actions.

Regression analysis serves as a valuable tool in various fields. Mastering its assumptions and applications enhances data-driven decision-making. Embracing regression techniques empowers you to make informed choices.

Steps to Verify Data Assumptions

Data collection and preparation.

Data collection forms the foundation of any regression analysis. You must ensure that the data accurately represents the population . This step is crucial for both single variable linear regression and multiple regression models.

Cleaning and Organizing Data

Cleaning involves removing errors and inconsistencies. You should check for missing values and correct them. Organizing data helps in understanding the relationships between variables. Proper organization aids in constructing a reliable regression equation. Excel can be a useful tool for organizing and visualizing data.

Checking for Outliers and Anomalies

Outliers can skew results and lead to inaccurate predictions. Identifying outliers requires statistical tests. Removing these anomalies ensures a more accurate regression line. A clean dataset enhances customer satisfaction by providing reliable insights.

Statistical Tests and Validation

Statistical tests validate the assumptions of regression analysis. These tests confirm the linear relationship between variables. Validation ensures that the regression model accurately predicts outcomes.

Residual Analysis

Residual analysis checks the difference between observed and predicted values. This analysis helps identify patterns not captured by the regression equation. A consistent pattern indicates a problem with the model. Residuals should show no clear pattern if the model fits well.

Goodness-of-Fit Tests

Goodness-of-fit tests evaluate how well the regression line fits the data. These tests measure the accuracy of predictions. A high goodness-of-fit indicates a strong relationship between variables. This test is essential for both linear regression and multiple regression models.

Regression analysis requires careful verification of data assumptions. Ensuring data accuracy leads to better customer experience and satisfaction. By following these steps, you enhance the reliability of your regression models. Harvard Business School Online offers resources to learn more about these techniques.

Examples of Successful Applications

Case studies in various industries, healthcare and medicine.

Healthcare professionals use regression analysis to improve patient outcomes. Linear models help predict disease progression. Researchers analyze factors like age, lifestyle, and medical history. This approach aids in developing personalized treatment plans. Hospitals utilize regression to allocate resources efficiently.

Finance and Investment

Financial analysts apply regression to assess market trends. Linear regression models forecast stock prices based on historical data. Investors rely on these predictions for decision-making. Risk management in insurance benefits from regression analysis. Companies estimate claims costs and adjust strategies accordingly.

Lessons Learned and Best Practices

Challenges and solutions.

Implementing regression analysis presents challenges. Data quality affects the accuracy of linear models. Businesses must ensure clean and organized datasets. Identifying outliers is crucial for reliable predictions. Analysts use statistical tests to validate assumptions. Continuous learning enhances the effectiveness of regression techniques.

Future Trends and Innovations

Future trends in regression analysis focus on automation. Machine learning integrates with linear models for enhanced predictions. Businesses explore real-time data analysis for immediate insights. Innovations drive more accurate and efficient forecasting. Embracing these advancements strengthens competitive advantage.

Related Solutions and Resources for Further Learning

Advanced topics in regression analysis, multivariate regression.

Multivariate regression allows you to analyze multiple independent variables simultaneously. This method helps you understand complex relationships in data. Businesses often use multivariate regression to explore how different factors impact sales. Researchers find this approach valuable for examining interactions between variables.

Nonlinear Regression

Nonlinear regression deals with data that do not fit a straight line. This technique models curves and more complex patterns. Scientists use nonlinear regression to study growth rates and biological processes. Engineers apply it to optimize systems with non-linear behaviors. Understanding nonlinear regression expands your ability to handle diverse datasets.

Regression analysis offers valuable insights into relationships between variables. Understanding this technique enhances decision-making in various fields. Linear regression provides a clear mathematical formula for predictions. You can apply these insights to business and academic studies. Mastering linear regression helps you identify patterns and trends. This knowledge empowers you to make informed choices. Explore further learning opportunities to deepen your understanding. Use regression analysis to improve outcomes and strategies.

starrocks_slack

Join StarRocks Community on Slack

Recommended Resources

Group 372

Trino vs. StarRocks: Get Data Warehouse Performance on the Data Lake

Group 381

5 Brilliant Lakehouse Architectures from Tencent, WeChat, and More

StarRocks airbnb Unified OLAP (3)

Airbnb Builds a New Generation of Fast Analytics Experience with StarRocks

Have questions? Talk to a CelerData expert.

Our solutions architects are standing by to help answer all of your questions about our products and set you up with a personalized demo of CelerData Cloud.

Pardon Our Interruption

As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen:

  • You've disabled JavaScript in your web browser.
  • You're a power user moving through this website with super-human speed.
  • You've disabled cookies in your web browser.
  • A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this support article .

To regain access, please make sure that cookies and JavaScript are enabled before reloading the page.

  • Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

When Should I Use Regression Analysis?

By Jim Frost 183 Comments

Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician , I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level !

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

Related post : What are Independent and Dependent Variables?

Use Regression to Analyze a Wide Variety of Relationships

An example regression model to illustrate when to us regression.

  • Model multiple independent variables
  • Include continuous and categorical variables
  • Use polynomial terms to model curvature
  • Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

  • Do socio-economic status and race affect educational achievement?
  • Do education and IQ affect earnings?
  • Do exercise habits and diet effect weight?
  • Are drinking coffee and smoking cigarettes related to mortality risk?
  • Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

Regression models help you prevent spurious correlations from confusing your results by controlling for confounders.

How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

Related post : Confounding Variables and Omitted Variable Bias

How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics . The p-values help determine whether the relationships that you observe in your sample also exist in the larger population . I’ve written an entire blog post about how to interpret regression coefficients and their p-values , which I highly recommend.

Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

  • Specify the correct model . As we saw, if you fail to include all the important variables in your model, the results can be biased.
  • Check your residual plots . Be sure that your model fits the data adequately.
  • Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem .

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data .

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Share this:

regression analysis in research formula

Reader Interactions

' src=

July 12, 2023 at 1:42 pm

Jim, I am trying to predict a categorical variable (college major category, where there are 5 different categories).

I have 3 different continuous variables (CAREER INTERESTS, which has 6 different subscales), PERSONALITY (which is the Big Five) and MORAL PREFERENCES (which uses the MFQ30 questionnaire, that has 5 subscales).

I am confused about what type of regression (hierarchical, etc.) I could use in this study. What are your thoughts?

' src=

July 17, 2023 at 12:18 am

Because your dependent variable is a categorical variable consider using Nominal Logistic Regression, also known as Multinomial Logistic Regression or Polytomous Logistic Regression. These terms are used interchangeably to describe a statistical method used for predicting the outcome of a categorical dependent variable based on one or more predictor variables.

' src=

January 9, 2023 at 12:03 am

First of all, Many thanks for this fantastic website that makes statistics seem a little bit simpler and more clear. It’s a fantastic resource. I have dataset of an experiment. It have dependent variable Choice reaction time(CRT) and independent variable Visual task. (This visual task includes two types of task; cognitive involved questions and minimizes cognitive questions. These questions are of three types questions which include choices/options(2,4,8)/bits(1,2,3) only two options to choose one answer, 4options questions and 8 options in questions. First i used Linear regression to check the best fitting of model(Hicks law) in SPSS. But unfortunately the value of r-square was very very low. Now, my professor push me to make new model by using that dataset. Please suggest me some steps and hints so i will start working on it.

' src=

December 14, 2022 at 3:59 am

Following are my research objectives a. To identify youth’s competencies in entreprenuership in the area. b. To identify the factor of youth involvement in agricultural entreprenuership in the area.

I have used opinion based question designed on 5-point likert scale item except demographic question in the beginning of my survey. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire. My question is which analysis is suitable for my research? Regresion analysis or descriptive analysis or both?

December 14, 2022 at 5:57 pm

The question of whether there is a dependent variable and one or more independent variables is separate from the question of whether you need to use inferential or descriptive statistics. And regression analysis can be either a descriptive or inferential procedure. Although, it is almost always an inferential procedure. Let’s go through these issues.

If you just want to describe a sample and you’re not generalizing from the sample to a population, you’re performing descriptive statistics. In this case, you don’t need to use hypothesis testing and confidence intervals.

However, if you have a representative sample and you want to infer the properties of an entire population, then you need to perform hypothesis testing and look at confidence intervals. Read my post about the Difference between Descriptive and Inferential Statistics for more information.

Regression analysis can apply to either of these cases. You perform the same analysis but if you’re only describing the sample, you can ignore the p-values and confidence intervals. Instead, you’ll focus on using the coefficients to describe the relationships between the variables within the sample. There’s less to worry about but you only know what is happening within that sample and can’t apply the results to a larger population. Conversely, if you do want to generalize to a population, then you must consider the p-values and confidence intervals and determine whether the coefficients are statistically significant. Most analysts performing regression analysis do want to generalize to the population, making it an inferential procedure.

However, regression analysis does specify independent and dependent variables. If you don’t need to specify those types of variables, then just use a correlation. Likert data is ordinal data. And for that data type, you need use Spearman’s correlation. And, like regression analysis, correlation can be either a descriptive or inferential procedure. You either pay attention to the p-values (inferential) or not (descriptive). In both cases, you are interested in the correlation coefficients. You’ll see the relationships between the variables without need to specify independent and dependent variables. You could calculate medians or modes for each item but not the mean because that’s not appropriate for ordinal data.

I hope that helps!

' src=

December 12, 2022 at 9:18 am

Hi Jim, Supposing I’m interested in establishing an explanatory relationship between two variables, profits and average age of employees using regression analysis and I have access to data from the entire population of interest e.g. all the 30 firms in a particular industry, do I still need to perform statistical inference? What would be the meaning of p-values , F tests etc, given that I am not intending to generalize the results for firms outside the industry? Do I still need to perform power analysis given that I have access to the entire population of 30 firms? Is the population of 30 firms too small for reliable statistical deductions? Thanks in advance Jim.

December 13, 2022 at 5:11 pm

Hi Patrick,

If you are truly interested in only those 30 companies and have access to data for all their employees, then you don’t need to perform inferential statistics. You’ve got the entire population. Hence, you know the population parameters. Hypothesis tests account for sampling error. But when you measure the entire population, there is zero sampling error and, hence, zero need to perform a hypothesis test.

However, if your average ages are based on only a sample of the employees in the 30 firms, then you’re still working with samples. To generalize from the sample to the population of all employees at the 30 firms, you’d need to use hypothesis testing in that case.

So, you just need to determine whether you really have access to the data for the entire population.

' src=

December 8, 2022 at 1:52 am

Hi, Following are my research objectives a. To investigate effectiveness of asynchronous and synchronous mode of online education. b. To identify challenges that both teachers and students encounter in synchronous and asynchronous mode of online education. I have used pearson correlation to find relationship of effectiveness of synchronous mode with asynchronous mode and challenges of online mode and vice versa. I have used opinion based question designed on 5-point likert scale item. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire. My question is that correlation is sufficient or i have to run other test for proving my hypothesis.

December 10, 2022 at 8:28 pm

Because you have Likert scale data, you should use Pearson’s correlation because that is more appropriate for ordinal data.

Another possibility would be to use a nonparametric test and evaluate the median difference between the asynchronous and synchronous modes of education for each item.

' src=

November 21, 2022 at 3:45 am

A scientist determined the intensity of solar radiation and temperature of plantains every hour throughout the day. He used correlation to describe the association between the two variables. A friend said he would get more information using regression. What are your views?

November 22, 2022 at 4:15 pm

Yes, I’d agree the regression provides more information that correlation. But it’s also important to understand how correlation and regression presents effect sizes differently because in some cases you might want to use correlation even though it provides less information.

Correlation gives you a standardized effect size (i.e., the correlation coefficient). Standardized effect sizes don’t provide information using the natural units of the data. In other words, you can’t relate a correlation coefficient to what’s going on with the natural data units. However, it does allow you to compare correlations between dissimilar variables.

Conversely, regression gives you unstandardized effect sizes in the coefficients. They tell you exactly what’s going on between an independent variable and dependent variable using the DV’s natural data units. But it’s harder to compare results between regression models with dissimilar DV units. Although regression has its own standardize measure of the overall strength of the model in the R-squared–but not the individual variables. Additionally, in regression, you can standardize the regression coefficients, which facilitates comparisons within a regression model but not between them.

In some cases, while correlation gives you less information, you might want to use it to facilitate comparisons between studies.

Regression allows you to predict the mean outcome. It also gives you to the tools to understand the amount of error between the predicted and observed values. Additionally, you can model a variety of different types of relationships (curved and interactions) Correlation doesn’t provide those.

So, yes, in general, regression provides more information, but it also provides a different take on the nature of the relationships.

' src=

February 1, 2022 at 6:39 am

First, congrats and many thanks on this wonderful website, which makes statistics look a bit easier and understandable. Its a great resource, both for students and professionals. Thanks again.

A request for bit of help, if you’d be kind enough to comment. Doing some research on pharmaceutical industry, regulations and its effects. I am looking at a) probable effects (if any) of drug price increases on other consumption categories (like food and travel), and b) the effects of pricing regulations on drug shortages. In ‘a’, I’ve got inflation data and average consumption expense by quintiles. In ‘b’, I’ve got last 6 year data on drug shortages, mainly due to government administered pricing. However, I’d need to show statistical significance (additionally, if it could predict anything statistically significant about drug shortages in the future).

What kind of stat methodology would be appropriate in terms of ‘a’ and ‘b’? Would appreciate your help.

' src=

December 11, 2021 at 7:39 pm

Thank you so much Sir.

' src=

August 7, 2021 at 7:01 am

Hello Mr. Jim,

Thank you very much for your opinion. Much helpful.

I’ve another case with 2 DV and multiple IDV and the scope is to determine the validity of data. So for this case, can I run MANOVA as regression analysis and look for significant value and null hypothesis for validity test?

Hoping to hear from you soon.

Kind Regards, A.Kaur

August 6, 2021 at 12:17 pm

Thank you for your reply Mr. Jim. My goal is to predict which approach best predicts CRI measure.

CRI-I: Disaster Management Cycle (DMC) based approach (Variable: PP, RS, RC, MP-contain all indices according to its phases) CRI- II: Sustainability based approach (Physical, Economy, Social-contain all indices according to its phases) CRI-III: Overall indices of data (24 indices from all the listed variable)

I’ve chosen PP and MP as my DV, and RS and RC as my IDV, since my goal focus on DMC.

Hope I’m clear now. And hoping to hear from you soon Mr. Jim. Thank you.

August 7, 2021 at 12:04 am

One approach would be to fit a regression model for each approach and the DV. Then assess the goodness-of-fit measures. You’d be particularly interested in the standard error of the regression . This measure tells you how wrong the model is typically. You’d be looking for the model that produces the lowest value because it indicates it’s less wrong at predicting the outcome.

July 31, 2021 at 1:31 pm

Good day Mr. Jim,

I’ve decided to run regression analysis after correlation test. My research is about reliability and validity of dataset for 3 approaches of community resilience index(CRI) based DMC, sustainability and overall indices approach. So now, I’m literally confused on how can to interpret data with regression analysis? Can I used OLS and GLM to interpret data?

3 approaches: 1:PP,RS,RC,MP {DMC} 2: PY,EC,SC {Sustainability} 3: Overall indices {24 indices}

For your information all those approaches are proposed in 1 dataset that contains 24 indices. Add on, I’ve previously conducted Likert questionnaire(5 scale) to collect my data.

I hope my question is clear. Hoping to hear from you soon.

August 4, 2021 at 4:38 pm

I’m sorry but I don’t completely understand what your goal is for your analysis. Are you trying to determine which approach best predicts sustainability? What are your IVs and DV. It wasn’t totally clear from your description. Thanks!

' src=

July 16, 2021 at 1:56 am

Going through your blog gave me a good understand when to use regression analysis, honestly it’s an amazing blog

July 19, 2021 at 10:22 pm

Thanks so much, Robin!

' src=

May 18, 2021 at 7:02 pm

Hey Jim, thanks for all the information. I would like to ask: are there any limitations in the multiple regression method? Is there other method in mathematics that can be more accurate than a regression?

Sincerly, Mythili

May 20, 2021 at 1:46 am

Hi Mythili,

There are definitely limitations for regression! That’s a broad question that could be answered with a book. But, a good place to start is to consider the assumptions for least squares regression . Click the link to learn more. You can think of those as limitations because if you violate the assumptions, you can’t necessarily trust the results! In fact, when you violate an assumption, you might need to switch to a different analysis or perform it a different way.

Additionally, the Gauss-Markov theorem states that least squares regression is the most efficient regression, but only when you satisfy those assumptions!

' src=

May 15, 2021 at 4:19 pm

Hi Sir, In regression analysis specifically multiple linear regression, should all variables (dependent and independent variables) be normally distributed?

Thank you, Helena

May 15, 2021 at 11:08 pm

In least squares regression analysis, you don’t assess the normality of the variables. Instead, you assess the normality of the residuals. However, there is some correlation because if you have dependent variable that follows a very non-normal distribution, it can be harder to obtain normal residuals. But it’s really the residuals that you need to focus on. I discuss that in my article about the least squares (OLS) regression assumptions .

' src=

April 18, 2021 at 11:12 pm

Hi Sir, I’m currently a senior high school student and currently struggling on my quantitative research. As a statistician what would you recommend a statistical treatment to use in identifying an impact? To answer the question “What is the impact of the development of educational brochure in minimizing cyber bullying in terms of? 3.1 Mental health 3.2 Self-Esteem”.

Waiting for your reply, desperate for answers lol Jane

' src=

April 16, 2021 at 7:21 am

Hi Jim, thank you

So would you advise an ordinal regression or another? i have a survey identifying if they use the new social media- which will place them into 2 groups. Then compare the 2 groups (1- use the new social media, 2- don’t use it) with a control (FB use) to compare their happiness scores (obtained from a survey aswell- higher score=more happier). The conclusions i can draw- would it be causal? or more an indication that for example the new users have lower happiness.

-Also is there a graph that can be drawn after a regression?

On a side note- when would it be advisable to do correlations? for example have both groups complete happiness score and conduct correlations for this and a regression to control for covariates? or is this not statistically advisable

April 16, 2021 at 3:46 pm

I highly recommend you get my book about regression analysis because I think it would be really helpful with these nuts and bolts types of questions. You can find it in My Web Store .

As for the type of regression, as I mentioned, that depends largely on what you use for your dependent variable. If it’s a single Likert item, then you’d use ordinal logistic regression. If it’s the sum or average of multiple Likert items, you can often use the regular least squares regression. But, I don’t have a good handle on exactly how you’re defining your dependent variable.

There are graphs you can create afterwards to illustrate the results. I cover those in my book. I don’t have a good post to refer you to that shows them. Fitted line plots are good when you have simple regression (just one independent variable), but when you have more there are other types.

You can do correlations but be aware that they don’t control for other variables. If there are confounders, your correlations might exhibit omitted variable bias and differ from the relationships you’ll find in the regression model. Personally, I would just stick to the regression results because they control for confounders that you include in the model.

April 15, 2021 at 4:46 pm

hi Sorry- as you can tell im a little confused on what best to do. As is it advisable to do 2 groups- users of the new social media and non users of that new social media. Then do a T-test to compare their happiness scores. Then have participants answer facebook use questionnaire to control for this by conducting a hierarchical regression where i enter this in- to identify how much this variance is explained by Facebook use?

Many thanks

April 15, 2021 at 10:28 pm

Hi Sam, you wouldn’t be able to do all of that with t-tests. I think regression is a better bet. You can still include an indicator variable to identify the two groups you mention AND include the controlling variables in that model. That way you can determine whether the difference between those two groups is statistically significant while controlling for the other IVs. All in one regression model!

April 15, 2021 at 8:26 am

Hi I wanted to ask if regression is the best test for me- I am looking at happiness scores and time spent on a new social media site. As other social media sites have a relationship with happiness and that people don’t use one social media site- i was going to control for this ‘other social media’ use. My 1st group would be the new social media site and Facebook users and the 2nd group would be Facebook users. They would do a happiness questionnaire and questionnaire about their time/use. Any advice I really appreciate it

I have read around and found partial correlations- do you advice that? So instead participants would complete a questionnaire on their use on this new social media, then also do a questionnaire on their Facebook use and do a happiness questionnaire. I would do a partial correlation between the new social media app use and happiness score, while controlling for Facebook use.

April 15, 2021 at 10:22 pm

This case sounds like a good time to use regression analysis. The type of regression depends largely on the nature of the dependent variable. It’s for a survey. Perhaps it’s a Likert scale item? If it’s an item, that’s an ordinal scale and you’d need to use ordinal logistic regression. If you’re summing multiple items for the DV, you might be able to use regular linear regression. Ordinal independent variables are a bit problematic. You’d need to use them as either continuous or categorical variables. You’d include the questions about FB use to control for that.

' src=

April 13, 2021 at 5:10 am

Thank you very much for your answer,

I understand your point of view. However that data set consist of companies investing the largest sums to R&D and not companies with also the best results. Some of them even shows up with a loss of operating profit. Is that still a factor of biasing my results?

Have a nice day, Natasha

' src=

April 12, 2021 at 11:36 am

thank you it was very useful

April 12, 2021 at 11:24 am

I am working on my thesis which is about evaluating the motivation of firms to invest in R&D of new products. I am specifically interested in automotive sector. I have a data of R&D ranking of the world top 2500 companies (by industry) which consist of data about their R&D expenses, (also R&D one-year growth), net sales (also net sales one-year growth), R&D intensity, Capex, operational profit, (also one-year growth), profitability, employees (also one-year growth), market cap (also one-year growth).

My question is that which type of analysis would you recommend to fulfill the topic requirements?

April 13, 2021 at 12:29 am

Hi Natasha,

You could certainly use regression analysis to see which variables related to R&D spending.

However, be aware that by using that list of companies, you are potentially biasing your results. For one thing, it’s a list of top R&D companies and you’d certainly want more of a mix of companies across the full range of R&D. You can learn from those who weren’t so good at R&D too. Also, by using a list of the top R&D companies, you’ll introduce some survival bias into the results because these are companies that made it and made it big (presumably). Again, you’d want mix of companies that had varying degrees of success and even some failures! If you limit your data to top companies and particularly top companies in R&D, you’ll limit how much can learn. You might still be able to learn some, but just be aware that you’re potentially biasing your results.

' src=

April 8, 2021 at 8:05 pm

Hi Mr. Jim! Thank you so much for your response. Well appreciated!

April 8, 2021 at 11:07 pm

You’re very welcome, Violetta!

April 8, 2021 at 2:08 am

Hi! I’m currently doing my research paper, and i am confused whether i can use regression analysis since my title is “New Normal Workplace Setting towards Employee’s Engagement with their Workloads” as for the moment I have used correlational approach since it deals with the relationship of two variables. But still im confused on what would be the best in my research. Hope i can get a response soon. Thank you so much!

April 8, 2021 at 3:56 pm

Hi Violetta,

If you’re just working with just two variables, you have a choice. You can use either correlation or regression. You can even use both together! It depends on the goals of your research. Correlation coefficient are standardized measures of an effect size while regression coefficients are unstandardized effect sizes. I write about the difference between standardized and unstandardized effect sizes . Click the link to read about that. I discuss both correlation and coefficient in that context. It should help you decide what is best for your research goals.

' src=

March 3, 2021 at 9:34 am

Hi Jim, I am undertaking a Msc dissertation and would like to ask questions on analysis please. The research is health related and I am looking at determinants of outcome. I have 5 continuous data independent variables and I would like to know if they have an association with the outcome of a treatment. They involve age, temperature and blood test values. The dependent variable is binary that is the treatment was yes successful or not. I am looking to do a logistic regression analysis. Questions I have: 1. Do I first need to do tests to find out if there is statistical significance of each variable before I do the regression analysis or can I go straight in? 2. If so will I need to carry out tests to find out if I have skewed data in order to know whether I need to do parametric or non parametric tests? Thank you.

March 3, 2021 at 6:01 pm

You should go in with a bunch theory and background knowledge about the independent variables you should include. Look to other research studies for guides. When you have a set of IVs identified, it’s usually ok to include them all and see what’s significant. An important caveat is if you have a small number of observations you don’t want to overfit your model . However, statistical significance shouldn’t be your only guide for which variables to include and exclude.

To read learn more about model specification, ready my post about specifying your regression model . I write about it in the context of linear regression rather than binary logistic regression, but the ideas are the same.

In terms of the distribution of your data, typically, you assess the residuals rather than the data itself. Usually, you can assess the residual plots .

' src=

January 4, 2021 at 12:01 pm

Looks like treating both ordinal variables as continuous seems to solve my problems with non-mutually exclusive levels of the variables if I enter the variables as categorical. My main concern is to look at the variable as a whole not by its levels so it might be what I need; the measurement ranges were based on a an established rating system and does not have any weight for my analysis. Tho, I’ll have to looks more into it as well as the residual plot etc before deciding. Thank you for highlighting this option!

Is it correct if I assign the numerical value to the levels like this? 1 to 5, from lowest to highest.

Spacing 1: less than 60mm 2: 60-200mm 3: 200-600mm 4:0.6-2m 5: more than 2m

length 1: less than 1m 2: 1-3m 3: 3-10m 4: 10-20m 4: more than 20m

As for the data repetition, what I mean was say data for Site A is:

Set 1 (quantity: 25) SP3 PER5 Set 2 (quantity: 30) SP4 PER6 set 3 (quantity: 56) SP2 PER3

so in the data input I’d entered set 1 data 25 times, set 2 data 30 times and set 3 data 56 times. From what I have gathered from fellow student and my lecturer, it is correct but I’d like a confirmation from a statistician. Thanks again!

December 31, 2020 at 5:44 am

I’m sorry, again the levels disappeared. maybe bc I used (>) and (<) so it's messing up the coding of the comment.

spacing levels:

SP1: less than 60mm SP2: 60-200mm SP3: 200-600mm SP4:0.6-2m SP5: more than 2m

length level:

PER1: more than 20m PER2: 10-20m PER3: 3-10m PER4: 1-3m PER4: less than 1m

Spacing and Length were recoded as ranges since they were estimate and not measured individually as it'd take too much time to measure each one (1 set of cracks may have at least 10 cracks, some can reach 50 or more and the measurement are not exactly the same between cracks belonging to the same set).

I've input the dummy like in my previous reply when running the model, tho the resulting equation I've provided does not include the length. Can ordinal variable be converted/treated into continuous variables?

Also, since each set has their own quantities, so I repeated the data in the input according to their quantity. Is that the right way of doing it?

January 2, 2021 at 7:10 pm

Technically those are ordinal variables. I write about this in more detail in my book about regression analysis , but you can enter these variables as either continuous variables (if you assign a numeric value to the groups) or as categorical variables. If you go the categorical route, you’ll need to use the indicator variable scheme and leave out a reference level approach as we discussed. The approach you should use depends on a combination of your analysis goals, the nature of your data, and the ability to adequately fit the model (i.e., properties of the residual plots).

I don’t exactly know what you mean by “repeated the data in the input.” However, you have levels for each categorical variable. Let’s use the lowest level for each variable as the reference level. Here’s how you’d use indicator variables to include both categorical variables in your model (some statistical software will do that for you behind the scenes).

Spacing variable: Leave out SP1. It’s the reference. Include and indicator variable for: SP2 SP3 SP4 SP5

Length Variable: Leave PER5 out as reference. Include indicator variables for: PER1 PER2 PER3 PER4

And just code each indicator variable appropriately based on the presence or absences of the corresponding characteristic. All zeros in a set of indictor variables for a categorical variable represents the reference level for that categorical variable.

As you can see, you’ll need to include many indicator variables (8), which is a drawback of entering them as categorical variables. You can quickly get into overfitting your model .

December 30, 2020 at 12:32 am

I’m sorry I had just noticed that the levels are missing

December 28, 2020 at 11:48 am

For my case, I’m studying the cracks set on a rock face and I have two independent categorical variables (spacing and length) that have 5 levels of measurement ranges each. Dependant variable is the blasted rock size i.e I want to know how the spacing and length of the existing cracks on a rock face would effect the size of blasted rocks.

E.g: For Spacing: SP1 = 2m

I’ve coded the levels to run the regression model into:

SP1 SP2 SP3 SP4 SP1 1 0 0 0 SP2 0 1 0 0 SP3 0 0 1 0 SP4 0 0 0 1 SP5 0 0 0 0

From the coding (leaving SP5 out as the reference level) above, after running the model, I have obtained the equation:

Blasted rock size (mm) = 1849.146 + 332.224SP1 + 137.624SP2 – 115.268SP3 – 103.604SP4

1 rock slope could consist of 2 or more crack sets hence the situation where more than 1 levels of spacing and length can be observed. As an example, rock face A consist of 3 crack sets with set #1 having SP1, set #2 with SP3 and set #3 have SP4. To predict blasted rock size for rock face A using the equation, I’ll have to insert “1” for SP1, SP3 and SP4. Which is actually the wrong way of doing it since they are not mutually exclusive? Or can I calculate each crack set separately using the same equation then average the of blasted rock size for these 3 crack sets?

From the method in your explanation, does this mean that I’ll have to separate each level into 10 different variables and code them as 1=yes and 0=no? If so, for spacing, will the coding be

SP1 SP2 SP3 SP4 SP5 SP1 1 0 0 0 0 SP2 0 1 0 0 0 SP3 0 0 1 0 0 SP4 0 0 0 1 0 SP5 0 0 0 0 1

in the input table which would be similar to the initial one except with SP5 included? But if I were to include all levels when running the model, SPSS would automatically excluded 1 level since I ran several rock faces (belonging to a single location) in a model so all levels of spacing and length are present in the data set.

The other way that I can think of is to create interaction for all possible combinations and dummy code them but wouldn’t that end up with a super long equation?

I’m sorry for imposing like this but I couldn’t grasp this problem on my own. Your help is very much appreciated.

December 31, 2020 at 12:51 am

Ah, ok, it sounds like you have two separate categorical variables. In that case, for each observation, you can have one level for each variable. Additionally, for each categorical variable, you’ll leave out one level for its own reference level.

I do have a question. spacing and length sound like continuous measurements. Why are you including them as categorical variables? There might be a good reason why but it almost seems like you can include them as continuous predictors. Perhaps you don’t have the raw measurements but instead they’re in groups? In which case, they might actually be ordinal variables. You can include ordinal variables as categorical variables. But sometimes they’ll still work as continuous variables.

December 26, 2020 at 12:12 am

I see, sorry I couldn’t fully understand your previous reply before this, thanks for the clarification. However, I am dealing with a situation where 2 or more levels of a variable could be observed simultaneously, is it theoretically right to use dummy or is there other method around it?

December 27, 2020 at 2:30 am

That sounds like you’re dealing with more than one variable rather than one categorical variable. Within an individual categorical variable, the levels of the variable are mutually exclusive. In your case, you need to sort out which categorical variables you have and be sure that the levels are mutually exclusive. If you looking at the presence and absence of certain characteristics, you can use a series of indicator variables. If these characteristics are not the mutually exclusive levels of a single categorical variable, you don’t use the rule about leaving one out.

For example, in a medical setting, you might include characteristics of a patient using a series of indicator variables: gender (1 = female 0 = male), high blood pressure (1 = Yes, 0 = No), On medication, etc. These are separate characteristics (not part of one larger categorical variable) and you can just include an indicator variable to indicate the presence or absence of that characteristic.

Perhaps that it what you need? But be aware that what you describe with multiple levels possible does not work for a single categorical variable. But the method I describe might be what you need if you’re talking about separate characteristics.

' src=

December 24, 2020 at 2:03 am

Thank you , sir

December 18, 2020 at 12:54 am

Thanks for the answer Jim,

does that mean predicted value for when both L4 and L1 are observed and when only L1 is observed without L4 is the same? (Y = 133)

thanks again!

December 18, 2020 at 1:03 am

The groups must be mutually exclusive. Hence, an observation could not be in both L1 and L4.

December 16, 2020 at 4:58 am

I have a question regarding categorical variables dummy coding, I can’t seem to find any post about this topic. Hope you don’t mind me asking here.

I ran a regression model with categorical variable containing 4 level: using the 4th level as the reference group. Meaning in the equation there will only be level 1 to 3 since level 4 is the reference. Say, the equation is Y = 120 + 13L1 – 6L2 + 15L3, to predict the Y with L4 then I’ll have Y = 120, right?

My question is what if I want to predict Y when there is L1 but no L4? if I calculate Y = 120 + 13L that would mean I am including L4 in the equation, or am I wrong about this?

Thank you in advance.

December 17, 2020 at 11:28 pm

I cover how this works in my book about regression analysis . If you’re using regression for a project, you might consider it.

It sounds like you’re approach is correct. You always leave one level out for the reference group. And, yes, given your equation, the predicted value for level 4 is 120.

For observations where the subject/item belongs to group 1, your equation stays the same, but you enter a 1 for L1 and 0s for L2 and L3. Hence, the predicted value is 133. In other words, you don’t change the equation given the level, you change the X values in the equation. When an observation belongs to group 4, you’ll enter 0s for L1, l2, and L3, which is why the predicted Y is 120. For a given categorical variable, you’ll only enter a single 1 for observations that belong to a non-reference group, and all 0s for observations belonging to the reference group. But the equation stays the same in all cases. I hope that makes sense!

' src=

December 14, 2020 at 5:35 am

May I just ask if there is a difference between a true and simple linear regression model? I can only think that their difference is the presence of a random error. Thanks a lot!

December 14, 2020 at 8:48 pm

Hi Anthony,

I’ve never heard the dichotomy state as being true vs. simple linear regression. I take true models to refer to the model that is correctly specified for the population. A simple regression model is just one that has a single predictor whereas multiple regression has more than one predictor. The true model has as many terms as are required, which includes predictors and other terms that fit curvature and interaction as needed.

' src=

December 13, 2020 at 3:04 pm

Hi Jim, I find your explanation to questions very good and so important. Thanks for that. Please I need your help in my thesis work. My question is if for example I want to measure say level of resilience capacity in a company’s safety management system. What tool would you advise. Regression or which other one ? Thanks Kwame

December 14, 2020 at 9:01 pm

The type of analysis you use depends the data you collect as well as a variety of other factors. The answer is entirely specific to your research question, field of study, data, etc. After you make those determinations, you can begin to figure out which type of analysis to use. I recommend researching your study area to answer all of those questions, including which type of analysis to use. If you need help after you start developing the answers to the preliminary question, I’d be able to provide more input.

Also, I really recommend reading my post about designing a study that includes statistical analyses . That’ll help you understand what type of information you need to collect and questions you need to answer.

' src=

November 12, 2020 at 11:12 pm

Thank you so much for your answer, Jim!

November 12, 2020 at 11:53 am

hello Jim, I have a question. I have one independent variable, and two dependent variables, I will explain the case before asking you a question. So, I obtain the data for independent variable using a questionnaire, and one of my dependent variable is also using a questionnaire. But, another dependent variable, which is my second variable, the data is from official website which is secondary data, different from the another variables. Then, I have a question, Is it okay if I use regression analysis to analyze these three variables? Or I have to use another statistical analysis that suit the best to analyze these variables? Thanks in advance.

November 12, 2020 at 4:37 pm

Most forms of regression analysis allow you to use one dependent variable and multiple independent variables. Because you have two dependent variables, you’ll need to fit two regression models, one for each dependent variable.

In regression, you need to be able to tie together all corresponding values of an observation for the dependent variable and the independent variables. We’ll use an example with people. To fit a regression model, for each person, you’ll need to know their values for the dependent variable and all the independent variables in the model. In your case, it sounds like you’re mixing data from an official website and a survey. If those data sources contain the same people and you can link their values as describes, that can work. However, if those data sources have different people, or you can’t link their scores, you won’t be able to perform regression analysis.

' src=

November 6, 2020 at 9:55 am

Hi Jim, if you’ve got three predictors and one dependent variable, is it ever worth doing linear regression on each individual predictor beforehand or should you just dive into the multiple regression? Thanks a lot!

November 6, 2020 at 8:48 pm

Hi Kristian,

You should probably just dive right into multiple regression. There’s a risk of being misled by starting out with regressions with individual predictors. It’s possible that omitted variable bias can increase or decrease the observed effect. By leaving out the other predictors, the model can’t control for them, which can cause that bias.

However, that said, it’s often a good idea to graph the relationship between pairs of variables using scatterplots to get an idea of the nature of each relationship. That’s a great place to start. Those plots not only reveal the direction of the relationship but also whether you need to model curvature.

I’d start with graphs and then try modeling with all the variables. You can always remove insignificant variables.

' src=

October 2, 2020 at 1:00 pm

Hi Jim, do you think it is correct to estimate a regression model based on historical data as Y=aX+b and then use the model for the forecast as Y=aX? Would this be biased?

if the variables involved are growth rates, would it be preferable to directly estimate the model without the intercept?

Thank you in advance Stefania

October 4, 2020 at 12:56 am

Hi Stefania,

The answer to that question depends on a very close understanding of the subject area. However, there are very few cases where fitting a model without a constant is advisable. Bias would be very likely. Read my article about the y-intercept , where I discuss this issue specifically.

' src=

September 30, 2020 at 3:22 am

Nice article. Thank you for sharing.

' src=

August 19, 2020 at 12:13 pm

If your outcome variable is a pass or fail, then it is binomial logistic. My undergrad thesis was on this topic. May be I can offer some help as this topic is of interest to me. Azad ( [email protected] )

' src=

August 6, 2020 at 2:36 am

Sir , what is cox regression analysis ?

' src=

August 6, 2020 at 12:52 am

A friend recommended your help with a stats question for my dissertation. I am currently looking at data regarding pass rate and student characteristics. I have collected multiple data points. One example is student pass rate (pass or rate) and observation hours (continuous variable (0-1000). Would this be a binomial logistic regression? Can that be performed in Excel?

Additionally I am looking at pass rate in relation to faculty characteristics. Another example is pass rate (percentage of 100% maybe continuous data 0-100) and categorical data (Level of degree – bachelor, masters, doctorate)? Additionally, pass rate (percentage of 100) and ratio of faculty to student within classroom (continuous Data) which test would be appropriate for this type of data comparison? Linear regression?

Thanks for your guidance!

' src=

July 24, 2020 at 7:14 am

Hi Jim. Concepts were well explained. Thank you so much for making this content available.

I have the data of Mortgage loan customers who are currently in default. There are various parameters why default would have happened. But predominantly there are two factors where we would have gone wrong while sanctioning the loan one is underwriting the loan( Credit Risk) and/or Property Valuation (Technical Risk). I have data of sub parameters coming under credit and technical risk at the point of sanction.

Now I want to arrive at an output where predominantly where did I go wrong. Either Technical/Credit risk or both. Which model of regression analysis can help in solving this.

July 3, 2020 at 3:40 am

dear sir, i ‘m currently final year undergradute of Bsc.Radiography degree, so i choosed risk estimation of cardiovascular diseses using several risk factors from regression analysis as my undergraduate research. i want to predict a percentage value for my cardiovascular risk estimation as a dependent variable using regression analysis. how can i do that sir,i’m very pleased to have your answer sir ? Thank you very much.

July 3, 2020 at 3:41 pm

Hi, It sounds like you might need to use binary logistic regression. If your dependent variable indicates the presence or absence (i.e., binary outcome measure) of a cardiovascular condition, binary logistic regression will predict the probability of having that condition given the values of your dependent variables.

' src=

June 26, 2020 at 8:35 pm

Thank you for all the information on your page , I am currently beginning to get into statistics and wanted to ask your advice about something

I am an business analyst with MI skills building dashboard etc and using sales data and kpi s

I am wondering for regression would a good independent variable be the significance of a salespersons sales performance over the teams total sales performance or am I on the wrong track with that ?

' src=

June 11, 2020 at 2:18 pm

Dear Jim… I am a first year ‘MBA’ student having least exposure to the research kind of things. Please have patience and explain me whether I can use regression to determine the impact of a variable on a ‘construct’?

' src=

June 7, 2020 at 6:49 pm

which criteria does an independent variable need to meet in order to use it in a regression analysis? How do you deal with data that does not meet these requirements?

June 8, 2020 at 3:13 pm

I recommend you read my post about specifying the correct regression model . That deals directly with which variables to include in the model. If you have further questions on the specifics, please post them in the comments section there.

' src=

June 5, 2020 at 7:15 am

How should we interpret the factor A that becomes not significant when fitting with factor B in a model? Can I conclude that factor B incorporates factor A and just ignore the effect of factor A?

' src=

May 28, 2020 at 2:17 am

Hello Mr.Jim and friends,

I have one dependent variable Y and six independent variables X1….X6. I have to find the effect of of all independent variables on Y , Specifically X6. to check wither it is effective or not 1) Can I use OLS regression 2) which other test i need to do before or after regression analysis

May 29, 2020 at 4:16 pm

If your dependent variable is continuous, then OLS is a good place to start. You’ll need to check the OLS assumptions for your model.

' src=

April 29, 2020 at 8:06 am

good,very explicit processes.

' src=

April 10, 2020 at 4:53 pm

I hope this comment reaches you in good health as we are living in some pretty tough times right now. Also, thank you for building this website as it is an excellent resource for novice statisticians such as myself. My question has to do with the first paragraph of this post. In it you state,

“Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.”

Is it possible to use regression analysis to produce a regression equation when you have two independent variables and two dependent variables? Also, while I hopefully have you attention, would I need to do regression analysis twice(one for each dependent variable versus the independent variables)?

April 10, 2020 at 7:07 pm

Typically, you would separate regression models for each dependent variable. There are a few exception. For example, if you use multivariate ANOVA (MANOVA), you can include multiple dependent variables. If those DVs are correlated, using MANOVA provides some benefits. You can include covariates in the MANOVA model. For more informaton, read my post about MANOVA .

' src=

April 1, 2020 at 7:00 pm

n my study, I intervened with an instructional practice. My intervention has 4 independent variables (A, B, C, and D). In literature each subskill can be graded alone and we can get one whole score. In literature, the effect of the intervention is holistic (A, B, C, together predict the performance on D).

So, I conducted a multiple regression (enter method) before and after the intervention where individual scores of A, B, C were added as predictors on D.

I added Group (Experimental Vs Control ) to delete any difference at baseline between experimental and control. No significant effect was noticed except for individual score of A and C on D. Model had a weak fit.

However, after the intervention, I repeated the same regression. the group (experimental Vs Control) was the best predictor. No significant effect of A was noticed but significant effect of B and C was noticed — How do you think I can interpret the change in the significance value of A? It is relevant in literature but after the intervention it was not significant. Does the significance have to do with the increase of the significance of the Group?

' src=

January 26, 2020 at 2:51 pm

I’d like to ask a question that builds on your example of income regressed on IQ and education. In the dataset I am sure there would be a range of incomes. Let’s say you want to find ways to bring up the low income earners based on the data from this regression.

Can I use the coefficients from the regression to guide ideas on how to improve the lower income earners as an estimate of how much improvement would be expected? For example, if I take the lowest earner and find that he is also below average in IQ and education, could I suggest that he gets another degree and try to improve IQ test results to potentially gain $X (n*IQ + m*Edu) in income?

This example may not be strictly usable because I imagine there are many other factors for income. Assuming that we are confident that we’ve captured most of the variables that affect income, can the numbers be used in this way?

If this is not an appropriate application, how would one go about this? Thanks.

' src=

October 22, 2019 at 7:45 am

Hello I am completing a reflection paper for Math 221 I work in a call center can I use a regression analysis for this type of work?

' src=

October 20, 2019 at 4:48 am

I am a total novice when it comes to Statistics. My challenge is, I am working on the relationship between population growth of a town and class size of secondary schools in that same town (about 10 schools) over a period of years (2008-2018). Having gathered my data, I don’t know what to use in analyzing my data to show this relationship.

' src=

October 16, 2019 at 8:48 pm

Hi Jim! Im just a student whos trying to finish her science investigation 🙂 but i have a question. What is linear regression and how do we know if this method is appropriate for our data?

October 18, 2019 at 1:23 pm

Hi Marlene,

I think this blog post describes pretty well when to use regression analysis generally. Linear regression analysis is a specific form of regression. Linear refers to the form of the model–not whether it can fit curvature. I talk about this in my post about the differences between linear and nonlinear regression . I always suggest that you start with linear regression because it’s an easier to use analysis. However, sometimes linear regression can’t fit your data. It can fit curvature in your data but it can fit all types of curves. Nonlinear regression is more flexible in the types of curves.

As for determining whether linear regression is appropriate for your data, you need to see if it can provide an adequate fit to your data. To make that determination, please read my posts about residual plots because that’s how you can tell.

Best of luck with your research!! 🙂

' src=

August 27, 2019 at 4:50 pm

Hello Jim, thank you for this wonderful page. It has enlightened me when to use regression analysis. However, I am a complete beginner to using SPSS (and statistics at that) so I am hoping you can help me with my specific problem.

I intend to use a linear regression analysis. My dependent variable is continuous and I would think it’s ordinal (data was obtained through a 5-point Likert scale). I have two independent variables (also obtained through 5-point Likert scales). However, I also intend to use 7 control variables and this is where my problem lies. My control variables are all (I think) nominal (or is that called categorical in statistics?). They are as follows:

Age – 4 categories Gender – 2 categories Marital Status – 4 categories Education level – 11 categories Household income – 4 categories Nationality – 4 categories Country of origin – 9 categories

Do I input these control variables as it is? Or do I have to do something beforehand? I have heard about creating dummy variables. However, if I try creating dummy variables for each control variable, won’t I end up with many variables?

Please give me some advise regarding this. I am really stuck in this process for a while now. I look forward to hearing from you, thanks.

August 27, 2019 at 11:43 pm

There are several issues to address in your questions. I’ll provide some information. However, my regression ebook goes it into the details much further. So, I highly recommend you get that.

In terms of the dependent variable, the answer is clear. Likert scale data, if it’s the actual values of 1, 2, 3, 4, and 5, these are actually ordinal data and are not considered continuous. You’ll need to use ordinal logistic regression. If the DV is an average of multiple Likert score items for each individual, so an individual might have a 3.4, that is continuous data and you can try using linear least squares regression.

Categorical data and nominal data are the same. There are different naming conventions, but those synonyms.

For categorical data, it’s true that you need to recode them as indicator variables. However, most software should do that automatically behind the scenes. However, as you noticed, the recoding (even if your software does it for you) can involve creating many indicator variables (dummy variables), particularly when you have many categorical variables and/or many levels within a categorical variable. That can use up your degrees of freedom! My ebook covers this in more detail.

For Likert IV variables. Again, if it’s an average of multiple Likert items, you can probably include it as a continuous variable. However, if it’s the actual Likert values of 1, 2, 3, 4, and 5, then you’ll need to decide whether to include it as a continuous or categorical variable. There are pros and cons for both approaches. The best answer depends on both your data and your goals. My ebook describes this in more detail.

Yes, as a general rule, you want to include your control variables and IVs that you are specifically testing. Control variables are just more IVs, but they’re usually not your main focus of study. You include them so that you can account for them while testing your main variables of interest. Excluding relevant IVs that are significant can bias the estimates for the variables you’re interested in. However, if you include control variables and find they’re not significant, you can consider removing them from the model.

So, those are some pointers to start with!

' src=

June 22, 2019 at 1:02 am

Hi Jim and everyone! I’m starting some some statistical analysis and is been really useful. I have a question regarding variables and samples. I need to see if there is any relationship between days of the week and number of robberies. I already have the data but I wonder, if my variables (# of robberies in each day of the week (independent) and # of total roberies (dependent)) come from the same data sample, can it be a problem?

' src=

June 7, 2019 at 2:56 am

Thank you Jim this was really helpful

I have a question How do you interpret an independent variable lets say AGE with categories that are insignificant for example i run the regression analysis for the variable age with categories age as a whole was found to be significant but there appear insignificance within categories , it was as follows Age =0.002 <30 years =0.201 30-44 years=0.161 45+ ( ref cat)

I had another scenario occupation = 0.000 peasant farmers =0.061 petty businessmen=0.003 other occupation ( ref cat)

my research question was " what are effect of socio- demographic characteristics on men's attendance to education classes

I failed to interpret them , kindly help

June 7, 2019 at 10:07 am

For categorical variables, the linear regression procedure uses two tests of significance. It uses an F-test to determine the overall significance of the categorical variable across all its levels jointly. And, it uses separate t-tests to determine whether each individual level is different from the reference level. If you change the reference level, it can change the significance of t-tests because that changes the levels that the procedure directly compares. However, changing the reference level won’t change the F-test for the variable as a whole.

In your case, I’m guessing that the mean for <30 is on one side (high or low) compared to the reference category of 45+ while the mean of 30-44 is on the other side of 45+. These two categories are not far enough from 45+ to be significant. However, given the very low p-value for age, I'd guess that if you change the reference level from 45+ to one of the other two groups, you'll see significant p-values for at least one of the t-tests. The very low p-value for Age indicates that the means for the different levels are not all equal. However, given the reference level, you can't tell which means are different. Using a different reference level might provide more meaningful information.

For occupation, the low p-value for the F-test indicates that not all the means for the different types of occupations are equal. The t-test results indicate that the difference in means between petty businessmen and other (reference level) is statistically significant. The difference between peasant farmers and the reference category is not quite significant.

You don't include the coefficients, but those would indicate how those means differ.

Because you're using regression analysis, you should consider getting by regression ebook. I cover this topic, and others, in more detail in the book.

Best of luck with your analysis!

' src=

May 11, 2019 at 12:51 pm

Hi Jim, I have followed your discussion and I want to know if I can apply this analysis in case study

' src=

April 26, 2019 at 4:01 pm

Hi Jim really appreciate your excellency in regression analysis. please would help the steps to draw a single fitted line for several, say five IVs, against a sing DV with regard

April 26, 2019 at 4:18 pm

It sounds like you’re dealing with multiple regression because you have more than one IV. Each IV requires an axis (or dimension) on a graph. So, for a two-dimensional graph, you can use the X-axis (horizontal) for IV and the Y-axis for the DV. If you have two IVs, you could theoretically show them as hologram in three dimensions. Two dimensions for the IVs and one for the DV. However, when you get to three or more IVs, there’s just no way to graph them! You’d need four or more dimensions. So, what can you do?

You can view residual plots to see how the model with all 5 IVs fits the data. And, you can predict specific values by plugging numbers into the equation. But you can’t graph all 5 IVs against the DV at the same time.

You could graph them individually. Each IV by itself against the DV. However, that approach doesn’t control for the other variables in the model and can produce biased results.

The best thing you can do that shows the relationship between an individual IV and a DV while controlling for all the variables in a model is to use main effects plots and interaction plots. You can see interaction plots here . Unfortunately I don’t have a blog post about main effects plots, but I do write about them in my ebook, which I highly recommend you get to understand regression! Learn more about my ebook!

I hope this helps!

' src=

March 16, 2019 at 1:31 pm

Many thanks. I appreciate it.

March 15, 2019 at 10:47 am

I stumbled across your website in hopes of finding an answer to a couple of questions regarding the methodology of my political science paper. If you could help, I would be very grateful.

My research question is “Why do North-South regional trade agreements tend to generate economic convergence while South-South agreements sooner cause economic divergence?”. North = OECD developed countries and South = non-OECD developing countries.

This is my lineup of variables and hypotheses: DV: Economic convergence between country members in a regional trade agreement IV1: Complementarity (differentness) of relative factor abundance IV2: Market size of region IV3: Economic policy coordination (Harmonization of Foreign Direct Investment (FDI) policy)

H1: The higher the factor endowment difference between countries, the greater the convergence H2: The larger the market size, the greater the convergence H3: The greater the harmonization of FDI policies, the greater the convergence

I am not sure what the best methodological approach is. I will have to take North-South and South-South groups of countries and assign values for the groupings. I want to show the relationship between the IVs and DV, so I thought to use a regression. But there are at least two issues:

1. I feel the variables are not appropriate for a time series, which is usually used to show relationships. This is because e.g. the market size of a region will not be changing with time. Can I not do a time series and still have meaningful results?

2. The IVs are not completely independent of one another. How can I work with that?

Also, what kind of regression would be most appropriate in your view?

Many sincere thanks in advance. Irina

March 15, 2019 at 5:23 pm

I’m not an expert in that specific field, so I can’t give you concrete advice, but here are somethings to consider.

The question about whether you need to include time related information in the model depends on the nature of your data and whether you expect temporal effects to exist. If your data are essentially collected at the same time and refer to the same time period, you probably don’t need to account for time effects. If theory suggests that the outcome does not change over time, you probably don’t need to include variables for time effects.

However, if your data are collected at or otherwise describe different points in time, and you suspect that the relationships between the IVs and DV changes overtime, or there is an overall shift over time, yes, you’d need to account for the time effects in your model. In that case, failure to account for the effects of time can bias your other coefficients–basically there’s the potential for omitted variable bias .

I don’t know the subject area well enough to be able to answer those questions, but that’s what I’d think about.

You mention that the IVs are potentially correlated (multicollinearity). That might or might not be a problem. It depends on the degree of the correlation. Some correlation is OK and might not be a problem. I’d perform the analysis and check the VIFs, which measure multicollinearity. Read my post about multicollinearity , which discusses how to detect it, determine whether it’s a problem and some corrective measures.

I’d start with linear regression. Move away from that only if you have specific reason to do so.

' src=

March 10, 2019 at 3:59 am

I was wondering if you could help. I’m currently doing a lab report on Numerical cognition in Human and non human primates. Where we are looking at whether size , quantity and visibility of food effects choice. We have tested Humans so far and then are going to test chimps in the future. My Iv is Condition : visible and opague containers and my Dv is number of correct responses. So far I have compared the means of number of correct responses for both conditions using a one way repeated measures ANOVA but I don’t think this is correct. After having a look at your website, should I look to run a regression analysis instead ? Sorry for the confusion I’m really a rookie at this. Hope you can help !

March 11, 2019 at 11:26 am

Linear regression analysis and ANOVA are really the same type of analysis-linear models. They both use the same math “underneath the hood.” They each have their own historical traditions and terminology, but they’re really the same thing. In general, ANOVA tends to focus on categorical (nominal) independent variables while regression tends to focus on continuous IVs. However, you can add continuous variables into an ANOVA model and categorical variables into a regression model. If you fit the same model in ANOVA as regression, you’ll get the same results.

So, for your study, you can use either ANOVA or regression. However, because you have only one categorical IV, I’d normally suggest using one-way ANOVA. In fact, if you have only those two groups (visible vs opaque), you can use a 2-sample t-test.

Although, you mention repeated measures, you can use that if you in fact do have a pre-test and post-test conditions. You could even use a paired t-test if you have only the two groups and you have a pre- and post-tests.

There is one potential complication. You mention that the DV is a count of correct responses. Counts often do not follow the normal distribution but can follow other distributions such as the Poisson and Negative Binomial distributions. Although, counts can approximate the normal distribution when the mean is high enough (>~20). However, if you have two groups and each group has more than 15 observations, the analyses are robust to departures from the normal distribution.

I hope this helps! Best of luck with your analysis!

' src=

February 9, 2019 at 8:20 am

Thankyou so much for the reply . Appreciate it and I finally worked it out and got good mark on lab report, which was good :). Appreciate your time replying you explain things very clear so thankyou

January 17, 2019 at 9:49 am

Hi there. I am currently doing a lab report and have not done stats in years so hoping someone can help as due tommorow. When I do correlation bivariate test it shows the correlations not significant between a personaility trait and a particular cognitive task. Yet when I conduct a simple t test it shows a significant p value and gives the 95 % conf interval. If I was to compare that higher scores on one trait tends to mean higher scores on a particular cognitive task then should I be doing a regression then. We were told basic correlations so I did the bivariate option and just stated that the pearson’s r is not significant r=.. n= p =.84 for example. Yet if do a regression analysis for each it is significant. Why could this be?

January 18, 2019 at 9:45 am

There not quite enough details to know for sure what is happening–but here are some ideas.

Be aware that a series of pairwise correlations is not equivalent to performing regression analysis with multiple predictors. Suppose you have your outcome variable and two predictors (Y X1 X2). When you peform the pairwise correlations (X1 and Y, X2 and Y), each correlation does not account for the other X. However, when you include both X1 and X2 in a regression model, it estimates the relationship between each X and Y while accounting for the other X.

If the correlation and regression model results differ as you describe, you might well have a confounding variable, which biases your correlation results. I write about this in my post about omitted variable bias . You’d favor the regression results in this situation.

As for the difference between the 2-sample t-test and correlation, that’s not surprising because they are doing two entirely different things. The 2-sample t-test requires a continuous outcome variable and a categorical grouping variable and it tests the mean difference between the two groups. Correlations measure the linear association between two continuous variables. It’s not surprising the results can differ.

It sounds like you should probably use regression analysis and include your multiple continuous variables in the model along with your categorical grouping variables as independent variables to model your outcome variable.

' src=

January 17, 2019 at 12:39 am

This is Kathlene, and I am a Grade 12 student. I am currently doing my research. It’s a quantitative research. I am having a little trouble on how will i approach my statistical treatment. My research is entitled ” Emotional Quotient and Academic Performance Among Senior High School Students in Tarlac National High School: Basis to a Guidance Program. I was battling what to use to determine the relationship between the variables in my study. I’m thinking to use chi-square method but a friend said it would be more accurate to use the regression analysis method. Math is not really my field of study so i badly need your opinion regarding this.

I’m hoping you could lend me a helping hand.

January 17, 2019 at 9:27 am

Hi Kathlene,

It sounds like you’re in a great program! I wish more 12th grade students were conducting studies and analyzing their results! 🙂

To determine how to model the relationships between your variables, it depends on the type of variables you have. It sounds like your outcome variable is academic performance. If that’s a continuous variable, like GPA, then I’d agree with your friend that regression analysis would be a good place to start!

Chi-square assesses the relationship between categorical variables.

' src=

December 13, 2018 at 1:57 am

Hi Mr Jim, I am using orthogonal design having 7 factors with three levels. I have done regression analysis on Minitab software but i don’t know how to explain them or interpret them. I need your help in this regard.

December 13, 2018 at 9:13 am

I have a lot of content throughout my blog that will help you, including how to interpret the results. For a complete list for regression analysis, check out my regression tutorial .

Also, early next year I’ll be publishing a book about regression analysis as well that contains even more information.

If you have a more specific question after reading my other posts, you can ask them in the comments for the appropriate blog post.

Best of luck!

' src=

December 9, 2018 at 12:08 pm

By the way my gun laws vs VCR, is part of a regression model. Any help you can give, I’d greatly appreciate.

December 9, 2018 at 12:07 pm

Mr. Jim, I have a problem. I’m working on a research design on gun laws vs homicides with my dependent variable being violent crime rate. My sig is .308 The constant’s (VCR) standard error is 24.712 my n for violent crime rate is 430.44. I really need help ASAP. I don’t know how to interpret this well. Please help!!!

December 11, 2018 at 10:03 am

There’s not enough information for me to know how to interpret the results. How are you measuring gun laws? Also, VCR is your dependent variable, not the constant as you state. You don’t usually interpret the constant . All I can really say is that based on your p-value, it appears your independent variable is not statistically significant. You have insufficient evidence to conclude that there is a relationship between gun laws and homicides (or is it VCR?).

' src=

December 4, 2018 at 12:49 am

Your blog has been very useful. I have a query.. if I am conducting a multiple regression is it okay to have an outcome variable which is normally distributed ( i winsorized an outlier to achieve this) and have two other predictor variables which are not normally distributed? ( the normality tests scores were significant).

I have read in many places that you have to transform your data to achieve normality for the entire data set to conduct a multiple regression – but doing so has not helped me at all. Please advice.

December 4, 2018 at 10:42 am

I’m dubious about the Winsorizing process in general. Winsorizing reduces the effect of outliers. However, this process is fairly indiscriminate in terms of identifying outliers. It simply defines outliers as being more extreme than an upper and lower percentile and changes those extreme values to equal the specified percentiles. Identifying outliers should be a point by point investigation. Simply changing unusual values is not a good process. It might improve the fit of your data but it is an artificial improvement that overstates the true precision of the study area. If that point is truly an outlier, it might be better to remove it altogether, but make sure you a good explanation for why it’s an outlier.

For regression analysis, the distributions of your predictors and response don’t necessarily need to be normally distributed. However, it’s helpful, and generally sought, to have residuals that are normally distributed. So, check your residual plots! For more information, read my post about OLS assumptions so you know what you need to check!

If your residuals are nonnormally distributed, sometimes transforming the response can help. There are many transformations you can try. It’s a bit trial by error. I suggest you look into the Box-Cox and Johnson transformations. Both methods assess families of transformations and pick one that works bets for your data. However, it sounds like your outcome is already normally distributed so you might not need to do that.

Also, see what other researchers in your field have done with similar data. There’s little general advice I can offer other than to check the residuals and make sure they look good. If there are patterns in the residuals, make sure you’re fitting curvature that might be present. You can graph the various predictors by the residuals to find where the problem lies. You can also try transforming the variables as I describe earlier. While the variables don’t need to follow the normal distribution, if they’re very nonnormally distributed, it can cause problems in the residuals.

' src=

December 3, 2018 at 10:05 pm

Hi, I am confused about the assumption of independent observations in multiple linear regression. Here’s the case. I have heart rate data per five-minute for a day of 14 people. The dependent variable is the heart rate. During the day, the workers worked for 8 hours (8 am to 5 pm), so basically, I have 90 data points per worker for a day. So that makes it 1260 data points (90 times 14) to be included in the model. Is it valid to use multiple linear regression for this type of data?

December 4, 2018 at 10:47 am

It sounds like your model is more of a time series model. You can model those using regression analysis as well, but there are special concerns that you need to address. Your data are not independent. If someone has a height heart rate during one measurement, it’s very likely it’ll also be heighted 5 minutes later. The residuals are likely to be serially correlated, which violates one of the OLS assumptions .

You’ll likely need to include other variables in your model that capture this time dependent information, such as lagged variables. There are various considerations you’ll need to address that go beyond the scope of these comments. You’ll need to do some additional research into use regression analysis for time series data.

' src=

November 8, 2018 at 10:38 am

Ok.Thank you so much.

November 8, 2018 at 10:21 am

Thank you so much for your time! Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?

November 8, 2018 at 10:31 am

You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.

November 8, 2018 at 12:20 am

Hello Sir! is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values). Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable? Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?

November 8, 2018 at 9:39 am

Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.

There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.

' src=

November 5, 2018 at 5:52 pm

thank you so much Jim! this is really helpful 🙂

November 5, 2018 at 10:03 pm

You’re very welcome! Best of luck with your analysis!

November 5, 2018 at 5:16 pm

The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot? Another follow up question: does a narrower CI equals a better estimate?

November 5, 2018 at 5:26 pm

Yes, that’s definitely it!

I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.

In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.

November 5, 2018 at 4:21 pm

Thank you so much for the quick response! I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?

November 5, 2018 at 4:38 pm

That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!

CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.

November 5, 2018 at 2:36 pm

suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?

Thank you 🙂

November 5, 2018 at 3:52 pm

Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.

' src=

October 26, 2018 at 5:27 am

Hi, What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?

October 26, 2018 at 10:48 am

Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.

However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.

I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.

' src=

October 26, 2018 at 1:17 am

Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind

October 25, 2018 at 5:02 am

I have been unfortunate to get your reply to my comment on 18/09/2018

October 25, 2018 at 9:29 am

Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.

I replied under your original comment.

' src=

October 23, 2018 at 2:28 pm

Your blog has been really helpful! 🙂 I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.

I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear. My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.

At the moment I am struggling to complete this analysis. Any help would be greatly appreciated 🙂

' src=

October 21, 2018 at 5:06 pm

Dear Jim, thatk you very much for this post! Could you, please, explain the following.

You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”

What if I have small r-squired, but the coefficiants are statistically significant with the small values?

' src=

October 15, 2018 at 5:37 am

Hi Jim Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable. almadi

October 15, 2018 at 9:56 am

Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.

September 18, 2018 at 5:44 am

As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.

Construction industry, normally uses these price indices running over a period of time to redetermine the prices based on the movement between the base date and current date, which is called as price adjustment

Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.

Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.

It is a bit amusing that someone was suggesting to me.

V.G.Subramanian

October 25, 2018 at 9:27 am

I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!

Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.

I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!

' src=

September 2, 2018 at 8:15 am

Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?

September 2, 2018 at 2:59 pm

Hi Antonio,

Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.

I have written a post on binary logistic regression . Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!

' src=

July 19, 2018 at 2:55 am

Dear sir, I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?

July 19, 2018 at 11:14 am

Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.

Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.

That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.

So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.

I really need to write a blog post about this! I will soon!

In the meantime, I hope this helps!

' src=

May 28, 2018 at 5:28 am

Is it necessary to conduct correlation analysis before regression analysis?

May 30, 2018 at 11:02 am

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.

' src=

May 18, 2018 at 5:56 am

' src=

April 28, 2018 at 11:45 pm

Thank you Jim!

I really appreciate it!

April 28, 2018 at 7:38 am

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

April 28, 2018 at 2:26 pm

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model . Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems . This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

April 5, 2018 at 5:51 am

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

April 4, 2018 at 1:33 am

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right? Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?) Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

April 4, 2018 at 11:11 am

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics , but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots . If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data . By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions . The residuals are normally distributed even though the dependent variable is not.

' src=

March 29, 2018 at 2:27 am

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know. S. CHATTERJEE

March 29, 2018 at 10:36 am

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

' src=

March 20, 2018 at 12:25 pm

Hi Jim Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables . when i do the main effect plots, i have the straight line increasing. y= x, this linear trending to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

' src=

March 5, 2018 at 9:37 am

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

March 5, 2018 at 10:11 am

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

' src=

February 25, 2018 at 5:36 pm

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

February 25, 2018 at 8:14 pm

Hi Martin, yes, that is exactly what I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.

The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

February 25, 2018 at 5:04 pm

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

February 25, 2018 at 5:22 pm

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis can help establish causality, but only when it’s performed on data that were collected through a randomized experiment.

' src=

February 6, 2018 at 7:11 am

Very nicely explanined. thank you

February 6, 2018 at 10:04 am

Thanks you, Hari!

' src=

December 1, 2017 at 2:40 am

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

November 30, 2017 at 9:10 am

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

November 30, 2017 at 2:24 pm

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model . I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

November 29, 2017 at 11:12 am

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

November 29, 2017 at 11:01 am

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

November 29, 2017 at 11:50 am

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

November 29, 2017 at 10:29 am

Thanks for the reply. Jim.

I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

November 29, 2017 at 11:04 am

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

November 29, 2017 at 9:15 am

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

November 29, 2017 at 10:13 am

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

' src=

October 26, 2017 at 12:33 am

why do we use 5% level of significance usually for comparing instead of 1% or other

October 26, 2017 at 12:49 am

Hi, I actually write about this topic in a post about hypothesis testing . It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

October 24, 2017 at 11:30 pm

Sir usually we take 5% level of significance for comparing why 0

October 24, 2017 at 11:35 pm

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

' src=

October 23, 2017 at 9:08 am

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

October 23, 2017 at 11:22 am

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps! Jim

' src=

October 22, 2017 at 4:24 am

Thank you Mr. Jim

October 22, 2017 at 11:15 am

You’re very welcome!

' src=

October 22, 2017 at 2:31 am

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

October 22, 2017 at 10:44 pm

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help! Jim

Comments and Questions Cancel reply

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Journal Proposal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

buildings-logo

Article Menu

regression analysis in research formula

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Critical success factors for the widespread adoption of virtual alternative dispute resolution (vadr) in the construction industry: a structural equation modeling analysis.

regression analysis in research formula

1. Introduction

2. literature review, 2.1. technical infrastructure, 2.2. user competence and training, 2.3. regulatory and legal framework, 2.4. procedural adaptability, 2.5. logistical support, 2.6. stakeholder buy-in, 2.7. cost considerations, 3. research methodology, 3.1. research design, 3.2. data collection and procedures, 3.3. data analysis, 4. findings of the study, 4.1. demographics analysis, 4.2. structural equation modeling, 4.2.1. measurement model, convergent validity, discriminant validity, 4.2.2. structural model (path analysis), 5. discussion, 6. conclusions, author contributions, data availability statement, conflicts of interest.

  • Wang, P.; Wu, P.; Wang, J.; Chi, H.-L.; Wang, X. A Critical Review of the Use of Virtual Reality in Construction Engineering Education and Training. Int. J. Environ. Res. Public Health 2018 , 15 , 1204. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Aboulata, H.K.; Younis, A.A.; Bakhoum, E.S. Assessment of Key Factors Affecting the Choice of Alternative Dispute Resolution Methods in Construction Projects. In Proceedings of the International Conference on Sustainable Construction and Project Management, Cairo, Egypt, 13–15 December 2023. [ Google Scholar ]
  • Exon, S.N.; Lee, S. Building Trust Online: The Realities of Telepresence for Mediators Engaged in Online Dispute Resolution. Stetson Law Rev. 2019 , 49 , 109. [ Google Scholar ]
  • Alessa, H. The Role of Artificial Intelligence in Online Dispute Resolution: A Brief and Critical Overview. Inf. Commun. Technol. Law 2022 , 31 , 319–342. [ Google Scholar ] [ CrossRef ]
  • Lingasabesan, V.; Abenayake, M. Opportunities and Challenges in Conducting Virtual Alternative Dispute Resolution (ADR) Methods in the Sri Lankan Construction Industry. In Proceedings of the 10th World Construction Symposium, Colombo, Sri Lanka, 24–26 June 2022; pp. 657–667. [ Google Scholar ] [ CrossRef ]
  • Gómez-Moreno, J.P. Advocacy for Online Proceedings: Features of the Digital World and Their Role in How Communication is Shaped in Remote International Arbitration. Int. J. Semiot. Law 2024 , 37 , 865–885. [ Google Scholar ] [ CrossRef ]
  • Scherer, M. Remote Hearings in International Arbitration: An Analytical Framework. J. Int. Arbitr. 2020 , 37 , 407–448. [ Google Scholar ] [ CrossRef ]
  • Barnett, J.; Treleaven, P. Algorithmic Dispute Resolution—The Automation of Professional Dispute Resolution Using AI and Blockchain Technologies. Comput. J. 2018 , 61 , 399–408. [ Google Scholar ] [ CrossRef ]
  • Carrazco Delgado, M.O.; Juárez Landín, C.; Mendoza Pérez, M.A.; García Ibarra, J.R.S. Prototipo de un Simulador Virtual para la Enseñanza-Aprendizaje de los Medios Alternos de Solución de Conflictos (MASC) en México. Religación. Rev. Cienc. Soc. Humanid. 2023 , 8 , e2301014. [ Google Scholar ] [ CrossRef ]
  • Behfar, K.J.; Peterson, R.S.; Mannix, E.A.; Trochim, W.M.K. The Critical Role of Conflict Resolution in Teams: A Close Look at the Links Between Conflict Type, Conflict Management Strategies, and Team Outcomes. J. Appl. Psychol. 2008 , 93 , 170–188. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Abdullah, N.C. Going Green in Urbanisation Area: Environmental Alternative Dispute Resolution as an Option. Procedia Soc. Behav. Sci. 2015 , 170 , 401–408. [ Google Scholar ] [ CrossRef ]
  • Zeleznikow, J. Using Artificial Intelligence to Provide Intelligent Dispute Resolution Support. Group Decis. Negot. 2021 , 30 , 789–812. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Lydon, D. Police Legitimacy and the Policing of Protest: Identifying Contextual Influences Associated with the Construction and Shaping of Protester Perceptions of Police Legitimacy and Attitudes to Compliance and Cooperation Beyond the Limits of Procedural Justice and Elaborated Social Identity Approaches ; Canterbury Christ Church University: Canterbury, UK, 2018. [ Google Scholar ]
  • Garon, J.M. Reordering the Chaos of the Virtual Arena: Harmonizing Law and Framing Collective Bargaining for Avatar Actors and Digital Athletes. In Research Handbook on the Law of Virtual and Augmented Reality ; Edward Elgar Publishing: Cheltenham Glos, UK, 2018; pp. 513–566. [ Google Scholar ] [ CrossRef ]
  • Kumar, S. Virtual Venues: Improving Online Dispute Resolution as an Alternative to Cost Intensive Litigation. J. Marshall J. Comput. Inf. L. 2009 , 27 , 81. [ Google Scholar ]
  • Waihenya, J. The Art & Science of Virtual Proceedings: Shifting The Paradigm in Alternative Dispute Resolution Tribunals. Altern. Disput. Resol. 2021 , 1. [ Google Scholar ]
  • Pitt, J.; Ramirez-Cano, D.; Kamara, L.; Neville, B. Alternative Dispute Resolution in Virtual Organizations. In Engineering Societies in the Agents World VIII ; Springer: Berlin/Heidelberg, Germany, 2008; pp. 72–89. [ Google Scholar ] [ CrossRef ]
  • Lasprogata, G.A. Virtual Arbitration: Contract Law and Alternative Dispute Resolution Meet in Cyberspace. J. Legal Stud. Educ. 2001 , 19 , 107. [ Google Scholar ] [ CrossRef ]
  • Viscasillas, P.P. An Arbitrator’s Perspective: Online Hearings in Arbitration: The Taking of Evidence. In Online Dispute Resolution ; Nomos Verlagsgesellschaft mbH & Co. KG: Baden, Germany, 2022; pp. 107–132. [ Google Scholar ] [ CrossRef ]
  • Frolova, E.E.; Rusakova, E.P. Trends in the Development of Alternative Ways of Dispute Resolution of Neo Industrialization Subjects. In Modern Global Economic System: Evolutional Development vs. Revolutionary Leap ; Popkova, E.G., Sergi, B.S., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 1842–1849. [ Google Scholar ]
  • Rohmah, N.Q. Current Issues of The Utilization of Online Dispute Resolution as a Method For International Trade Dispute Settlement. Ipso Jure 2024 , 1 , 25–36. [ Google Scholar ] [ CrossRef ]
  • Ezeldin, A.S.; Abu Helw, A. Proposed Force Majeure Clause for Construction Contracts under Civil and Common Laws. J. Leg. Aff. Disput. Resolut. Eng. Constr. 2018 , 10 , 04518005. [ Google Scholar ] [ CrossRef ]
  • Creedon, P.S.; Hayes, A.F. Small Sample Mediation Analysis: How Far Can We Push the Bootstrap. In Proceedings of the Annual Conference of the Association for Psychological Science, New York, NY, USA, 21–24 May 2015. [ Google Scholar ]
  • Tofighi, D.; MacKinnon, D.P. Monte Carlo Confidence Intervals for Complex Functions of Indirect Effects. Struct. Equ. Model. 2016 , 23 , 194–205. [ Google Scholar ] [ CrossRef ]
  • Hair, J.F.; Sarstedt, M.; Ringle, C.M.; Mena, J.A. An Assessment of the Use of Partial Least Squares Structural Equation Modeling in Marketing Research. J. Acad. Mark. Sci. 2012 , 40 , 414–433. [ Google Scholar ] [ CrossRef ]
  • Memon, A.H.; Rahman, I.A. SEM-PLS Analysis of Inhibiting Factors of Cost Performance for Large Construction Projects in Malaysia: Perspective of Clients and Consultants. Sci. World J. 2014 , 2014 , 165158. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Hair, J.F.; Ringle, C.M.; Sarstedt, M. Partial Least Squares Structural Equation Modeling: Rigorous Applications, Better Results and Higher Acceptance. Long Range Plan. 2013 , 46 , 1–12. [ Google Scholar ] [ CrossRef ]
  • Sarstedt, M.; Ringle, C.M.; Hair, J.F. Partial Least Squares Structural Equation Modeling. In Handbook of Market Research ; Springer: Cham, Switzerland, 2014; pp. 587–632. [ Google Scholar ]
  • Civelek, M.E. Essentials of Structural Equation Modeling ; Zea Books: Lincoln, NE, USA, 2018. [ Google Scholar ] [ CrossRef ]
  • Avramovic, S.; Alemi, F.; Kanchi, R.; Lopez, P.M.; Hayes, R.B.; Thorpe, L.E.; Schwartz, M.D. US Veterans Administration Diabetes Risk (VADR) National Cohort: Cohort Profile. BMJ Open 2020 , 10 , e039489. [ Google Scholar ] [ CrossRef ] [ PubMed ]

Click here to enlarge figure

Critical Success FactorsSub-FactorsKey Insights from the Literature
Technical Infrastructure (TI)Internet Reliability (TI1)Essential for effective operation and interaction [ ].
Platform Compatibility (TI2)Need for universal platform operability [ ].
Cybersecurity Measures (TI3)Importance of robust security to protect data and ensure privacy [ ].
User Competence and Training (UCT)Participant Familiarity (UCT1)Familiarity enhances user engagement and system usability [ ].
Training Programs (UCT2)Training programs necessary for skill development [ ].
Ongoing Support (UCT3)Continuous support improves proficiency and system adaptability [ ].
Regulatory and Legal Framework (RLF)Compliance (RLF1)Must comply with existing legal standards [ ].
Validity and Enforceability (RLF2)Ensuring outcomes are legally recognized [ ].
Standardization (RLF3)Crucial for maintaining consistency across jurisdictions [ ].
Procedural Adaptability (PA)Transition of Traditional Methods (PA1)Adapting traditional methods to virtual settings [ ].
Virtual Communication (PA2)Ensuring clear and effective communication [ ].
Cross-examination and Authentication (PA3)Authenticating identities and evidence in a virtual environment [ ].
Logistical Support (LS)Time Zone Management (LS1)Coordinating sessions across different time zones [ ].
Technology Access (LS2)Ensuring equitable access to necessary technology [ ].
Participant Engagement (LS3)Keeping participants actively involved and interested [ ].
Stakeholder Buy-In (SB)Support from Key Stakeholders (SB1)Gaining support from legal and technical professionals [ ].
Support from Key Stakeholders (SB2)Demonstrating cost-effectiveness and efficiency [ ].
Promotion and Advocacy (SB3)Using success stories and strategic partnerships for promotion [ ].
Cost Considerations (CCs)Initial Investment (CC1)Upfront costs for technology and training [ ].
Ongoing Expenses (CC2)Recurring costs of maintenance and updates [ ].
Resource Allocation (CC3)Efficient use of financial and human resources [ ].
DemographicsCategoriesNumber%
Organization typeLarge Corporation3738.1%
Medium-sized Enterprise2929.9%
Small Business3132.0%
Participants’ roleSenior Management2020.6%
Middle Management3536.1%
Technical Staff2525.8%
Administrative Staff1717.5%
Participants’ experience in construction industry Less than 5 years1818.6%
5–10 years2525.8%
11–20 years3435.1%
More than 20 years2020.5%
ScalesItemsOuter LoadingsαCRAVEFull Collinearity
Cost Considerations (CCs)CC10.8550.8260.8270.7433.100
CC20.896
CC30.832
Logistical Support (LS)LS10.7040.7200.7390.5102.417
LS20.779
LS30.747
Procedural Adaptability (PA)PA10.7450.7180.7150.5692.272
PA20.883
PA30.777
Regulatory and Legal Framework (RLF)RLF10.7400.7480.7500.5862.556
RLF20.765
RLF30.791
Stakeholder Buy-In (SB)SB10.7320.7410.7950.5621.124
SB20.827
SB30.705
Technical Infrastructure (TI)TI10.7050.7370.7640.5741.727
TI20.821
TI30.744
User Competence and Training (UCT)UCT10.7280.7020.7280.5593.211
UCT20.823
UCT30.778
Overall VADR AdoptionVADRA10.7200.7940.9130.6783.123
VADRA20.756
VADRA30.718
VADRA40.717
VADRA50.741
ConstructsCCLSPARLFSBTI UCT VADR Adoption
CCs0.862
LS0.5740.714
PA0.4910.5700.790
RLF0.4490.6840.6790.766
SB0.2420.2510.1930.0820.679
TI0.5260.5110.5210.4500.2050.758
UCT0.8610.6190.6080.5550.2210.607 0.948
VADR Adoption0.8460.6780.7730.7390.3560.723 0.896 0.527
Constructs/IndicatorsCCsLSPARLFSBTIUCTVADR Adoption
CC10.860.390.340.290.280.530.680.69
CC20.900.510.400.380.200.440.720.72
CC30.830.570.510.480.150.400.820.77
LS10.320.600.250.400.250.270.320.45
LS20.540.780.540.490.160.470.560.66
LS30.340.750.380.580.150.330.410.55
PA1(0.26)(0.08)(0.14)(0.01)(0.38)(0.01)(0.23)(0.22)
PA20.460.450.880.560.110.510.560.68
PA30.280.510.780.610.090.370.390.56
RLF10.240.470.580.74(0.01)0.310.340.52
RLF20.370.540.420.760.220.410.450.59
RLF30.410.550.570.79(0.03)0.310.470.59
SB1(0.17)(0.08)(0.14)(0.11)(0.73)(0.13)(0.17)(0.25)
SB2(0.21)(0.29)(0.17)(0.04)(0.83)(0.19)(0.18)(0.31)
SB30.020.030.030.14(0.41)0.040.04(0.01)
TI10.310.400.280.330.170.700.380.47
TI20.550.510.500.390.200.820.580.68
TI30.280.210.370.300.080.740.380.45
UCT10.360.410.400.45(0.02)0.550.630.54
UCT20.830.570.510.480.150.400.820.77
UCT30.680.390.440.330.340.450.780.68
Pathsβ Valuesp ValuesVIF
CC →VADR Adoption0.2820.004.100
LS →VADR Adoption0.1490.002.417
PA → VADR Adoption0.1860.002.272
RLF → VADR Adoption0.1770.002.556
SB → VADR Adoption0.1170.2931.124
TI → VADR Adoption0.1610.001.727
UCT → VADR Adoption0.2250.005.211
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Salem, M.; Al-Sabah, R.S.; Elnabwy, M.T.; Elbeltagi, E.; Tantawy, M. Critical Success Factors for the Widespread Adoption of Virtual Alternative Dispute Resolution (VADR) in the Construction Industry: A Structural Equation Modeling Analysis. Buildings 2024 , 14 , 3033. https://doi.org/10.3390/buildings14093033

Salem M, Al-Sabah RS, Elnabwy MT, Elbeltagi E, Tantawy M. Critical Success Factors for the Widespread Adoption of Virtual Alternative Dispute Resolution (VADR) in the Construction Industry: A Structural Equation Modeling Analysis. Buildings . 2024; 14(9):3033. https://doi.org/10.3390/buildings14093033

Salem, Mohamed, Ruqaya S. Al-Sabah, Mohamed T. Elnabwy, Emad Elbeltagi, and Mohamed Tantawy. 2024. "Critical Success Factors for the Widespread Adoption of Virtual Alternative Dispute Resolution (VADR) in the Construction Industry: A Structural Equation Modeling Analysis" Buildings 14, no. 9: 3033. https://doi.org/10.3390/buildings14093033

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Regression Analysis Formula

    regression analysis in research formula

  2. Regression Equation: What it is and How to use it

    regression analysis in research formula

  3. Regression Formula

    regression analysis in research formula

  4. Regression Equation

    regression analysis in research formula

  5. Regression Analysis

    regression analysis in research formula

  6. Regression Equation

    regression analysis in research formula

VIDEO

  1. SPSS Tutorial: Mastering Simple Linear Regression for Data Analysis

  2. Confidence Interval and Regression analysis Research in Education

  3. 209 Multiple Regression Analysis and Some Issues relating to it

  4. Igor's Physics Class Linear Regression Tricks

  5. A Guide with Examples: Essentials of Quantitative Research Design

  6. How to perform a logistic regression: SPSS & Medcalc

COMMENTS

  1. Regression Analysis

    Regression Analysis. Regression analysis is a quantitative research method which is used when the study involves modelling and analysing several variables, where the relationship includes a dependent variable and one or more independent variables. In simple terms, regression analysis is a quantitative method used to test the nature of ...

  2. Regression Analysis

    Regression Analysis in Finance. Regression analysis comes with several applications in finance. For example, the statistical method is fundamental to the Capital Asset Pricing Model (CAPM). Essentially, the CAPM equation is a model that determines the relationship between the expected return of an asset and the market risk premium.

  3. Simple Linear Regression

    Simple linear regression example. You are a social researcher interested in the relationship between income and happiness. You survey 500 people whose incomes range from 15k to 75k and ask them to rank their happiness on a scale from 1 to 10. Your independent variable (income) and dependent variable (happiness) are both quantitative, so you can ...

  4. Linear Regression Equation Explained

    Equation for a Line. Think back to algebra and the equation for a line: y = mx + b. In the equation for a line, Y = the vertical value. M = slope (rise/run). X = the horizontal value. B = the value of Y when X = 0 (i.e., y-intercept). So, if the slope is 3, then as X increases by 1, Y increases by 1 X 3 = 3. Conversely, if the slope is -3, then ...

  5. Regression Tutorial with Analysis Examples

    My tutorial helps you go through the regression content in a systematic and logical order. This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions.

  6. Regression Analysis

    Logistic Regression: Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables. Logistic Regression Model: p = 1 / (1 + e^- (β0 + β1X1 + β2X2 + … + βnXn)) In the formula: p represents the ...

  7. Linear Regression Explained with Examples

    A parameter multiplied by an independent variable (IV) Then, you build the linear regression formula by adding the terms together. These rules limit the form to just one type: Dependent variable = constant + parameter * IV + … + parameter * IV. This formula is linear in the parameters. However, despite the name linear regression, it can model ...

  8. The Complete Guide To Simple Regression Analysis

    Linear Regression Equation. The equation for a simple linear regression is: Y= \beta_0 + \beta_1X+ \varepsilon Y = β 0 + β 1X + ε. Where: X is your independent variable. Y is an estimate of your dependent variable. β 0. \beta_0 β 0 . is the constant or intercept of the regression line, which is the value of Y when X is equal to zero.

  9. Lesson 1: Simple Linear Regression

    Objectives. Upon completion of this lesson, you should be able to: Distinguish between a deterministic relationship and a statistical relationship. Understand the concept of the least squares criterion. Interpret the intercept b 0 and slope b 1 of an estimated regression equation. Know how to obtain the estimates b 0 and b 1 from Minitab's ...

  10. Regression analysis

    In the more general multiple regression model, there are independent variables: = + + + +, where is the -th observation on the -th independent variable.If the first independent variable takes the value 1 for all , =, then is called the regression intercept.. The least squares parameter estimates are obtained from normal equations. The residual can be written as

  11. Multiple Linear Regression

    Multiple linear regression formula. The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable ...

  12. The Complete Guide to Linear Regression Analysis

    With a simple calculation, we can find the value of β0 and β1 for minimum RSS value. With the stats model library in python, we can find out the coefficients, Table 1: Simple regression of sales on TV. Values for β0 and β1 are 7.03 and 0.047 respectively. Then the relation becomes, Sales = 7.03 + 0.047 * TV.

  13. Regression Analysis

    Regression Analysis Regression analysis is used to quantify the relationship between a single independent variable and a single dependent variable based on past observations. This 40 min video explains regression analysis.

  14. Regression Analysis: Step by Step Articles, Videos, Simple Definitions

    Step 1: Type your data into two columns in Minitab. Step 2: Click "Stat," then click "Regression" and then click "Fitted Line Plot.". Regression in Minitab selection. Step 3: Click a variable name for the dependent value in the left-hand window.

  15. 17.1: Simple linear regression

    Regression analysis results in a model of the cause-effect relationship between a dependent and one (simple linear) or more (multiple) predictor variables. The equation can be used to predict new observations of the dependent variable. True or False. The value of \(X\) at the \(Y\)-intercept is always equal to zero in a simple linear regression.

  16. Explained: Regression analysis

    The regression analysis creates the single line that best summarizes the distribution of points. Mathematically, the line representing a simple linear regression is expressed through a basic equation: Y = a 0 + a 1 X. Here X is hours spent studying per week, the "independent variable.". Y is the exam scores, the "dependent variable ...

  17. 15.1: Introduction to Regression Analysis

    It appears that the formula can be applied to any data set, and it is true - here are the examples of the regression lines superimposed on various data sets. As you can see all three data sets have the same linear regression, however there are some clear distinctions between the data sets.

  18. Regression Analysis Formula

    A regression analysis formula tries to find the best fit line for the dependent variable with the help of the independent variables. The regression analysis equation is the same as the equation for a line which is: y = MX + b. Where, Y= the dependent variable of the regression equation. M= slope of the regression equation.

  19. A Refresher on Regression Analysis

    A Refresher on Regression Analysis. Understanding one of the most important types of data analysis. by. Amy Gallo. November 04, 2015. uptonpark/iStock/Getty Images. You probably know by now that ...

  20. (PDF) Regression Analysis

    Regression analysis is a way of fitting a "best" line through a series of observations. squared differences between the observations and the line itself. It is important to. know that the ...

  21. Regression: Definition, Analysis, Calculation, and Example

    Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between a dependent variable and one ...

  22. Making Predictions with Regression Analysis

    The general procedure for using regression to make good predictions is the following: Research the subject-area so you can build on the work of others. This research helps with the subsequent steps. Collect data for the relevant variables. Specify and assess your regression model.

  23. Regression Analysis for Prediction: Understanding the Process

    Regression analysis is a statistical technique for determining the relationship between a single dependent (criterion) variable and one or more independent (predictor) variables. The analysis yields a predicted value for the criterion resulting from a linear combination of the predictors. According to Pedhazur, 15 regression analysis has 2 uses ...

  24. 15.2: Linear Regression Analysis

    The units of the input variable in the equation are 1000 miles, so \(38.297\) is equivalent to \(38297\) miles. Thus, the expected mileage of a 2016 Subaru Outback that cost $25000 is 38297 miles. We just learned how to find and interpret the linear regression line and the coefficient of determination and use it in applications.

  25. What Is Regression Analysis in Business Analytics?

    Regression analysis is the statistical method used to determine the structure of a relationship between two variables (single linear regression) or three or more variables (multiple regression). According to the Harvard Business School Online course Business Analytics, regression is used for two primary purposes: To study the magnitude and ...

  26. Regression Analysis

    Regression analysis serves as a valuable tool in various fields. Mastering its assumptions and applications enhances data-driven decision-making. Embracing regression techniques empowers you to make informed choices. Steps to Verify Data Assumptions Data Collection and Preparation. Data collection forms the foundation of any regression analysis.

  27. Module 8. Bivariate Regression (pdf)

    See also the estinated regression coefficients, a(hat) = 16.05, and b(hat)= .39 (see the estimated coefficients on the regression line in the scatterplot above). Now, l et's attempt to interpret the regression coefficients (recall the example of the equation of the line above): a(hat) This is the intercept coefficient (or constant): When %urban (X) equals zero, %vote for Clinton (Y) equals 16. ...

  28. When Should I Use Regression Analysis?

    Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions. As a statistician, I should probably tell you that I love all ...

  29. Critical Success Factors for the Widespread Adoption of Virtual

    This study explores the increasing adoption of virtual alternative dispute resolution (VADR) in the construction industry, enhancing efficiency and accessibility in dispute resolution. VADR is crucial for streamlining processes and reducing participation barriers. The study aims to investigate the critical success factors (CSFs) influencing the adoption of VADR in the construction sector.