The general purpose of multiple regression (this term was first used in Pearson’s paper, 1908) consists in analyzing the relationship between several independent variables (also called regressors or predictors) and the dependent variable. For example, a real estate agent could contribute to each registry element the size of the house (in square feet), the number of bedrooms, the average income of the population in the area according to census data and a subjective assessment of the attractiveness of the house. As soon as this information is collected for various houses, it would be interesting to see if these characteristics of the house are related to the price at which it was sold. For example, it might turn out that the number of bedrooms is a better predictor (predictor) for the selling price of a house in a particular area than "attractiveness" at home (subjective assessment). Could also show up and "emissions",
Salary = .5 * Resp + .8 * No_Super
Once this so-called regression line is defined, the analyst is able to plot the expected (predicted) pay and the real obligations of the company to pay salaries. Thus, the analyst can determine which positions are undervalued (lie below the regression line), which are paid too high (lie above the regression line), and which are paid adequately.
In the social and natural sciences, multiple regression procedures are extremely widely used in research. In general, multiple regression allows the researcher to ask a question (and probably get an answer) that "what is the best predictor for. ". For example, an education researcher might want to know which factors are the best predictors of success in high school. A psychologist could be interested in the question of what individual qualities allow us to better predict the degree of social adaptation of an individual. Sociologists probably would like to find those social indicators that predict better than others the result of the adaptation of a new immigrant group and the degree of its merger with society. Note that the term "multiple" indicates the presence of several predictors or regressors that are used in the model.
The general computational problem that needs to be solved when analyzing by the method of multiple regression consists in fitting a straight line to some set of points.
Least square method. The scatterplot has an independent variable or variable X and a dependent variable Y. These variables can, for example, represent IQ (intelligence level, assessed by a test) and academic achievement (average grade score – grade point average; GPA), respectively. Each dot in the diagram represents one student’s data.
Regression equation A straight line on a plane (in the space of two dimensions) is given by the equation Y = a + b * X; in more detail: the variable Y can be expressed in terms of the constant (a) and the slope (b) multiplied by the variable X. The constant is sometimes also called the free term, and the slope is called the regression or B-coefficient. For example, the GPA value can best be predicted using the formula 1 + .02 * IQ. So, knowing that a student’s IQ is 130, you could predict his GPA performance, he’s likely to be close to
For example, the animation below shows the confidence intervals (90%, 95%, and 99%) plotted for a two-dimensional regression equation.
In the multidimensional case, when there is more than one independent variable, the regression line cannot be displayed in two-dimensional space, but it can also be easily estimated. For example, if, in addition to IQ, you have other predictors of performance (for example, Motivation, Self-discipline), you can construct a linear equation containing all of these variables. Then, in the general case, multiple regression procedures will estimate the parameters of a linear equation of the form:
Unambiguous forecast and private correlation. Regression coefficients (or B-coefficients) represent the independent contributions of each independent variable to the prediction of the dependent variable. In other words, the variable X1, for example, correlates with the variable Y after taking into account the influence of all other independent variables. This type of correlation is also referred to as private correlation (this term was first used in Yule, 1907). Probably the following example will clarify this concept. Someone could probably detect a significant negative correlation in the population between hair length and height (short people have longer hair). At first glance, this may seem strange; however, if you add the Paul variable to the multiple regression equation, this correlation is likely to disappear. This is due to the fact that women, on average, have longer hair than men; however, they are also on average lower than men. Thus, after removing the difference by sex by entering the predictor Paul into the equation, the relationship between hair length and height disappears, because the hair length does not make any independent contribution to the growth prediction beyond what it shares with the variable Gender. In other words, after accounting for the variable Gender, the partial correlation between hair length and height is zero. In other words, if one value is correlated with another, then this may be a reflection of the fact that both of them are correlated with a third value or with a set of values.
Predicted values and residuals. The regression line expresses the best prediction of the dependent variable (Y) with respect to the independent variables (X). However, nature is rarely (if ever) completely predictable and there is usually a substantial scatter of observed points relative to the fitted straight line (as was shown earlier in the scatterplot). The deviation of a single point from the regression line (from the predicted value) is called the remainder.
Residual variance and coefficient of determination R-squared. The smaller the spread of residual values around the regression line relative to the total spread of values, the better the prediction is. For example, if the relationship between the variables X and Y is absent, then the ratio of the residual variability of the variable Y to the initial variance is equal to
Interpretation of the multiple correlation coefficient R. Usually, the degree of dependence of two or more predictors (independent variables or X variables) with the dependent variable (Y) is expressed using the multiple correlation coefficient R. By definition, it is equal to the square root of the coefficient of determination. This is a non-negative value, taking values between 0 and 1. To interpret the direction of the relationship between variables, they look at the signs (plus or minus) of the regression coefficients or B-coefficients. If the B-coefficient is positive, then the relationship of this variable with the dependent variable is positive (for example, the higher the IQ, the higher the average grade rate); if the B-coefficient is negative, then the relationship is negative (for example, the smaller the number of students in a class, the higher the average test scores). Of course, if the B-factor is 0, there is no connection between the variables.
Assumptions, limitations and discussion of practical issues
Assumption of linearity. First of all, as can be seen from the name of the multiple linear regression, it is assumed that the relationship between the variables is linear. In practice, this assumption, in essence, can never be confirmed; Fortunately, multiple regression analysis procedures are not significantly affected by small deviations from this assumption. However, it always makes sense to look at two-dimensional scatterplots of variables of interest. If the nonlinearity of the connection is obvious, then one can consider either transformations of the variables or explicitly assume the inclusion of nonlinear terms.
The assumption is normal. In multiple regression, it is assumed that the residuals (predicted values minus observables) are normally distributed (
Limitations. The main conceptual limitation of all regression analysis methods is that they allow only numerical dependencies to be detected, and not the causal connections underlying them. For example, you can find a strong positive relationship (correlation) between the damage caused by a fire and the number of firefighters involved in the fight against fire. Should it be concluded that firefighters cause destruction? Of course, the most likely explanation for this correlation is that the size of the fire (an external variable that was forgotten to be included in the study) affects both the scale of the damage and the involvement of a certain number of firefighters (
Select the number of variables. Multiple regression – provides the user "temptation" include all variables as predictors in the hope that some of them will be significant. This is due to the fact that the benefit is derived from accidents arising from the simple inclusion of as many variables as possible, regarded as predictors of another variable of interest. This problem occurs when, moreover, the number of observations is relatively small. It is intuitively clear that one can hardly draw conclusions from the analysis of the questionnaire with 100 items based on the answers of 10 respondents. Most authors advise to use at least 10 to 20 observations (respondents) per variable, otherwise the regression line estimates will probably be very unreliable and most likely irreducible for those who want to repeat this study.
Multicollinearity and poor conditioning of the matrix. The problem of multicollinearity is common to many methods of correlation analysis. Imagine that there are two predictors (variables X) for subject growth: (1) weight in pounds and (2) weight in ounces. Obviously, to have both predictors is completely unnecessary; weight is the same variable, measured in pounds or ounces. Attempting to determine which of the two measures is the best predictor looks rather silly; however, this is exactly what happens when you try to perform multiple regression analysis with growth as a dependent variable (Y) and two measures of weight as independent variables (X). If many variables are included in the analysis, then it is often not immediately obvious that this problem exists, and it can arise only after some variables have already been included in the regression equation. However, if such a problem arises, it means that at least one of the dependent variables (predictors) is completely superfluous if there are other predictors. There are quite a few statistical indicators of redundancy (tolerance, semi-private R, etc.), as well as a lot of means to combat redundancy (for example, the ridge regression method).
Fitting centered polynomial models. Fitting higher-order polynomials in independent variables with a non-zero mean can create great difficulties with multicollinearity. Namely, the resulting polynomials will be strongly correlated due to this average value of the primary independent variable. When using large numbers (for example, dates in Julian calculus), this problem becomes very serious, and if you do not take appropriate measures, you can come to wrong results. The solution in this case is the procedure for centering an independent variable,
The importance of residue analysis. Although most of the assumptions of multiple regression cannot be accurately verified, the researcher can detect deviations from these assumptions. In particular, emissions (