TOPIC 3 (2015)Apunte Inglés
Vista previa del texto
TOPIC 3: MULTIPLE REGRESSION
LINEAR REGRESSION WITH MULTIPLE REGRESSORS: ESTIMATION
Topic 2 ended on a worried note:
Although school districts with lower student-teacher ratio (STR) tend to have higher scores in the California data set, perhaps
students from districts with small classes have other advantages that help them perform well on standardized tests. Could this have
produced misleading results, and, if so, what ca be done?
Is our regression correctly
predicting the truth? In other
words, the effect of
increasing/decreasing STR is
trustful reflected in our
Omitted factors can, in fact, make the OLS estimator of the effect of class size on test scores misleading or, more precisely, biased. This
topic explains this “omitted variable bias” and introduces multiple regression, a method that can eliminate this bias.
KEY IDEA OF MULTIPLE REGRESSION: If we have data on these omitted variables, then we can include them as additional regressors and thereby estimate the effect of one regressor (the student-teacher ratio) while holding constant the other variables.
ASPECTS IN COMMON WITH REGRESSION WITH A SINGLE REGRESSOR: o The coefficients can be estimated from data using OLS.
o The OLS estimators are random variables because they depend on data from a random sample.
o In large samples, the sampling distribution of the OLS estimators is approximately normal.
Omitted variable bias = If the regressor (the student-teacher ratio) is correlated with a variable that has been omitted from the analysis (the percentage of English learners, in other words, students who have difficulties on the language and are still studying it) and that determines, in part, the dependent variable (test scores), then the OLS estimator will have OMITTED VARIABLE BIAS.
Omitted variable bias occur when two conditions are true:  The omitted variable is correlated with the included regressor X.
 The omitted variable is a determinant of the dependant variable Y.
Laura Aparicio 36 ECONOMETRICS I EXAMPLE #1: PERCENTAGE OF ENGLISH LEARNERS CONDITION 1: There is a small correlation [0.19] that suggests that districts with more English learners tend to have a higher student-teacher ratio (larger classes).
CONDITION 2: It is plausible that students who are still learning English will do worse on standardized tests than native English speakers, in which case the percentage of English learners is a determinant of test scores.
Thus omitting the percentage of English learners may introduce omitted variable bias.
EXAMPLE #2: TIME OF DAY OF THE TEST CONDITION 1: The time of day of the test varies from one district to the next in a way that is unrelated to class size, then the time if day and class size would be uncorrelated.
CONDITION 2: Conversely, the time of day of the test could affect scores (alertness varies through the school day).
Thus omitting the time of day of the test does NOT result in omitted variable bias.
EXAMPLE #3: PARKING LOT SPACE PER PUPIL CONDITION 1: Schools with more teachers per pupil probably have more teacher parking space.
CONDITION 2: Under the assumption that learning takes place in classroom, not the parking lot, parking lot space has no direct effect on learning.
Thus omitting the parking lot space per pupil does NOT result in omitted variable bias.
SUMMARY: OMITTED VARIABLE BIAS AND THE FIRST LEAST SQUARES ASSUMPTION: Omitted variable bias means that the first least squares assumption [𝐸(𝑢𝑖 |𝑋𝑖 ) = 0] is incorrect. WHY? If one of the factors included in the error term is correlated with X i, this means that the error term is also correlated with Xi. Because ui and Xi are correlated, the conditional mean of ui given Xi is nonzero.
Laura Aparicio 37 ECONOMETRICS I OLS ESTIMATOR IS BIASED AND INCONSISTENT  Omitted variable is a problem whether the sample size is large or small.
 Whether this bias is large or small in practice depends on the correlation 𝝆𝑿𝒖 between the regressor and the error term. The larger |𝜌𝑋𝑢 |is, the larger the bias.
̂𝟏 depends on whether X and u are positively or  The direction of the bias in 𝜷 negatively correlated.
Omitted variable bias formula: PROOF EXAMPLE THREE WAYS TO OVERCOME OMITTED VARIABLE BIAS: A.
Run a randomized controlled experiment in which treatment (STR) is randomly assigned: then EL_PCT (English learners percentage) is still determinant of TestScore, [𝛽𝑤 ≠ 0], but EL_PCT is uncorrelated with STR, as is any other factor in U’.
Laura Aparicio 38 ECONOMETRICS I B.
Adopt a “cross tabulation” approach, with finer and finer gradations of STR and EL_PCT. Some problems: we will run out of data and we don’t talk about other determinants like family income or parental education. This method consists of dividing our data in as many groups as possible.
Districts are broken into four categories that correspond to the quartiles of the distribution of the percentage of English learners across districts.
Within each of these four categories, districts are further broken down into two groups, depending on whether the student-teacher ratio is small (STR < 20) or large (STR > 20).
TOTAL: districts are divided in 8 groups.
Over the full sample of 420 districts, the average test score is 7.4 points higher in districts with a low student-teacher ratio than a high one; the t-statistic is 4.04, so the null hypothesis that the mean test score is the same in two groups is rejected at the 1% significance level.
But if we look to the final four rows and hold the percentage of English’s learners constant, the difference in performance between districts with high and low student-teacher ratios is perhaps half (or less) of the overall estimate of 7’4 points.
Multiple regression: omitted variable is no longer omitted because we include EL_PCT as an additional regressor.
The multiple regression model is a linear regression model that includes multiple regressors, X1, X2… XK. Associated with each regressor is a regression coefficient, β1, β2…, βk. The coefficient β1 is the expected change in Y associated with a 1-unit change in X1, holding the other regressors constant. The other regression coefficient have an analogous interpretation.
POPULATION MULTIPLE REGRESSION MODEL: , POPULATION REGRESSION LINE: 𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝟏 + 𝜷𝟐 𝑿𝟐 + ⋯ + 𝜷𝒏 𝑿𝒏 + 𝑼𝒊 , where 𝒊 = 𝟏, … , 𝒏 β0 is the intercept.
Laura Aparicio 39 ECONOMETRICS I VARIABLES: β1 is the slope coefficient of X1.
β2 is the slope coefficient of X2.
Xn are the independent (or explanatory) variables.
Y is the dependent variable One or more of the independent variables in the multiple β0 is the intercept parameter regression model are sometimes referred to as control variables.
β1 is the causal effect of X on Y. Parameter that gives The interpretation of the coefficient β1 is different than it was the change in Y for a unit change in X, holding other when there was only one regressor: β1 is the effect on Y of a unit factors constant.
change in X1, holding X2 constant or controlling for X2.
Ui are unobserved factors that influence Y, other than the variable X. Now, U excludes all the variables that were correlated with Y and X.
The error term Ui in the multiple regression model is HOMOSKEDASTIC if the variance of the conditional distribution of Ui given X1, …, Xn is constant and thus does not Finally, the intercept β0 is the expected value of Yi when X1 and X2 depend on the values of X. Otherwise, the error term is are zero. In other words, it determines how far up the Y axis the HETEROSKEDASTIC.
population regression line starts.
All these coefficients can be estimated using OLS.
The OLS estimators 𝛽̂0 , 𝛽̂1 , … , 𝛽̂𝑘 are the values of 𝑏0 , 𝑏1 , … , 𝑏𝑘 that minimize the sum of squared prediction mistakes∑𝑛𝑖=1(𝑌𝑖 − 𝑏0 − ̂ 𝒊 and residuals 𝒖 ̂ 𝒊 are: 𝑏1 𝑋1 − ⋯ − 𝑏𝑘 𝑋𝑘𝑖 )2 . The OLS predicted values 𝒀 The OLS estimators 𝛽̂0 , 𝛽̂1 , … , 𝛽̂𝑘 and residual 𝑢̂𝑖 are computed from a sample of n observations of (𝑋1𝑖 , … , 𝑋𝑘𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , 𝑛. These are estimators of the unknown true population coefficients 𝛽0 , 𝛽1 , … , 𝛽𝑘 and error term 𝑢𝑖 .
APPLICATION TO TEST SCORES AND THE STUDENT-TEACHER RATIO BEFORE NOW We used OLS to estimate the intercept and slope coefficient of We are now in a position to address this concern by the regression relating TestScore to STR, using our 420 using OLS to estimate a multiple regression in which observations for California school districts; the estimated OLS the dependent variable is the test score (Y) and there regression line was: are two regressors: 1.
X STR 2.
W PERCENTAGE OF ENGLISH LEARNERS PROBLEM: IS THIS RELATIONSHIP MISLEADING BECAUSE STR MIGHT BE PICKING UP THE EFFECT OF HAVING MANY ENGLISH LEARNER IN DISTRICTS WITH LARGE CLASSES? For our 420 districts. The estimated OLS regression line for this multiple regression is: Laura Aparicio 40 ECONOMETRICS I The estimated effect on test scores of a change in STR in the multiple regression is approximately half as large as when the STR was the only regressor. This difference occurs because the coefficient on STR in the multiple regression is the effect of a change in STR, holding constant (or controlling for) PctEL, whereas in the single-regressor regression, PctEL is not held constant.
This two estimates can be reconciled by concluding that there is an omitted variable bias in the estimate in the single-regression model.
Previously we saw that districts with high percentage of English learners tend to have: (1) low test scores and (2) high student-teacher ratio. If the fraction of English learners is omitted from the regression, reducing the STR is estimated to have a larger effect in test scores, but this estimate reflects BOTH the effect of a change in the STR and the omitted effect of having fewer English learners in the district.
Laura Aparicio 41 ECONOMETRICS I DIAGRAM MEASURES OF FIT IN MULTIPLE REGRESSION = Three commonly used summary statistics in multiple regression are the Standard Error of the Regression, the regression R2 and the adjusted R2 (also known as ̅𝑅̅̅2̅).
All three measure how well the OLS estimate of the multiple regression line describes, or “fits”, the data.
STANDARD ERROR OF THE REGRESSION (SER) It estimates the standard deviation of the error term u i. Thus the SER is a measure of the spread of the distribution of Y around the regression line. In multiple regression, the SER is: Laura Aparicio 42 ECONOMETRICS I The only difference between this formula and the SER for the single-regressor model is that here the division is 𝑛 − 𝑘 − 1 rather than 𝑛 − 2. As in previous sections, using this denominators instead of n is called a degrees-of-freedom adjustment.
If there is a single regressor, then 𝑘 = 1, so both formulas are the same. When n is large, the effect of the degrees-of-freedom adjustment is negligible.
THE R2 The mathematical definition of the R2 is the same as for regression with a single regressor: In multiple regression, the R2 increases whenever a regressor is added, unless the estimated coefficient on the added regressor is exactly zero. In practice, it is extremely unusual for an estimated coefficient to be exactly zero, so in general the SSR will decrease when a new regressor is added. But this means that the R2 generally increases (and never decreases) when a new regressor is added.
Because the R2 increases when a new variable is added, an increase in the R 2 does not mean that adding a variable actually improves the fit of the model. In this sense, the R2 gives and inflated estimate of how well the regression fits the data.
THE ADJUSTED R2 One way to correct for this is to deflate or reduce R2 by some factor, and this is what the adjusted R2 does: THREE USEFUL THINGS TO KNOW: ̅̅̅̅𝟐 is always less than R2 because (𝑛 − 1)/(𝑛 − 𝑘 − 1)  𝑹  Adding a regressor has two opposite effects: a.
The SSR falls, which increases the ̅𝑅̅̅2̅.
The factor (𝑛 − 1)/(𝑛 − 𝑘 − 1) increases.
Whether the ̅𝑅̅̅2̅ increases or decreases depends on which of these two effects is stronger.
̅̅̅̅𝟐 can be negative. This happens when the regressors, taken together, reduce SSR by such a small amount  The 𝑹 that this reduction fails to offset the factor (𝑛 − 1)/(𝑛 − 𝑘 − 1).
Laura Aparicio 43 ECONOMETRICS I Using the R2 and adjusted R2 is useful because it quantifies the extent to which the regressors account for, or explain, the variation in the dependent variable.
Nevertheless, heavy reliance in these statistics can be a trap.
The decision about whether to include a variable in a multiple regression should be based on whether including that variable allows you better to estimate the causal effect of interest. We return to the issue of how to decide which variables to include and which to exclude.
LEAST SQUARES ASSUMPTIONS ASSUMPTION #1: THE CONDITIONAL DISTRIBUTION ASSUMPTION #4: NO PERFECT MULTICOLLINEARITY OF Ui GIVEN Xi HAS A MEAN OF ZERO The regressors are said to exhibit perfect multicollinearity, if one of the This assumption means that sometimes Yi is above the population regression line and sometimes Yi is below the population regression line, but on average over the population Yi falls on the population regression line.
Therefore, for any value of the regressors, the expected value of ui is zero. THIS IS THE KEY ASSUMPTION THAT MAKES THE OLS ESTIMATORS UNBIASED.
regressors is a perfect linear function of the other regressors which makes impossible to calculate the OLS estimator. At an intuitive, perfect multicollinearity is a problem because you are asking the regression to answer an illogical question. In multiple regression, the coefficient on one of the regressors is the effect of a change in that regressor, holding the other regressors constant. In the hypothetical regression of TestScore on STR and STR, the coefficient of the first occurrence of STR is the effect on test scores of a change in STR, holding constant STR. This makes no sense.
ASSUMPTION #2: (Xi, Yi), I = 1, …, n ARE INDEPENDENTLY AND IDENTICALLY DISTRIBUTED This assumption holds automatically if the data are collected by simple random sampling.
In general, the software will do one of two things: either it will drop one of the occurrences of STR or it will refuse to calculate the OLS estimates and give an error-message.
SUMMARY ASSUMPTION #3: LARGE OUTLIERS (= observations with values far outside the usual range of data) ARE UNLIKELY This assumption serves as a reminder that, as in singleregressor case, the OLS estimator of the coefficients in the multiple regression model can be sensitive to large outliers. In other words, we assume that the regressors and the dependent variables have nonzero finite fourth moments.
The coefficients in multiple regression can be estimated by OLS. When the four least squares assumptions are satisfied, the OLS estimators are: Unbiased, Consistent and Normally distributed in large samples.
Laura Aparicio 44 ECONOMETRICS I EXTRA: IMPLICATIONS OF ASSUMPTION #1 CONTROL VARIABLES IMPLICATION ON CONDITIONAL MEAN INDEPENDENCE Laura Aparicio 45 ECONOMETRICS I EXTRA: ASSUMPTION #4 IN STATA Because the data differ from one sample to the next, different samples produce different values of the OLS estimators. This variation is summarized in the SAMPLING DISTRIBUTION OF THE OLS ESTIMATORS.
Under the least squares assumptions: ̂ 𝟎, 𝜷 ̂ 𝟏, … , 𝜷 ̂ 𝒌 are unbiased and consistent estimators of 𝜷𝟎 , 𝜷𝟏 , … , 𝜷𝒌 in the linear multiple regression  The OLS estimators 𝜷 model.
̂ 𝟎, 𝜷 ̂ 𝟏, … , 𝜷 ̂ 𝒌 is well approximated by a multivariate normal distribution, which  In large samples, the joint distribution of 𝜷 is the extension of the bivariate normal distribution to the general case of two or more jointly normal random variables.
Laura Aparicio 46 ECONOMETRICS I As discussed previously, PERFECT MULTICOLLINEARITY arises when one of the regressors is a perfect linear combination of the other regressors. We are going to see: I.
Some examples of perfect multicollinearity.
How perfect multicollinearity can arise and be avoided in regressions with multiple binary regressors? III.
What is imperfect multicollinearity? (I) EXAMPLES OF PERFECT MULTICOLLINEARITY: We’ll examine three hypothetical regressions EXAMPLE #1. We have three regressors: STR, PctEL (percentage of English learners) and FracEL (fraction of English learners which varies between 0 and 1). The regressors would be perfectly multicollinear PctEL = 100·FracEL EXAMPLE #2. We have two regressors: STR and NVS (“not very small classes” is a binary variable that equals 1 is STR > 12 and 0 otherwise). But, in fact, there are no districts in our data set with STR < 12; as you can see in the scatterplot, the smallest value of STR is 14. Now recall that the linear regression model with an intercept can equivalently be thought as including a regressor, X0, that equals 1 for all i. The regressors would be perfectly multicollinear NVS =X0 EXAMPLE #3. We have three regressors: STR, PctEL (percentage of English learners) and PctES (percentage of English speakers). The regressors would be perfectly multicollinear PctEL = 100 – PctES which could be write also as PctEL = 100X0 – PctES.
IMPORTANT! Perfect multicollinearity is a feature of the entire set of regressors. If either the intercept (X 0) or other regressor were excluded from this regression, the regressors would not be perfectly multicollinear.
(II) THE DUMMY VARIABLE TRAP Another possible source of perfect multicollinearity arises when multiple binary, or dummy, variables are used as regressors.
Imagine you have partitioned the school districts into three categories: rural, suburban and urban. If you include all three variables in the regression along with a constant, the regressors would be perfectly multicollinear Rural + Suburban + Urban = 1 = X0 To solve this problem you have to: Exclude one of these four variables, either one of the binary indicators or the constant term.
(III) IMPERFECT MULTICOLLINEARITY Imperfect multicollinearity means that two or more of the regressors are highly correlated. Imperfect multicollinearity does not pose any problems for the theory of the OLS estimators; indeed, a purpose of OLS is to sort out the independent influences of the various regressors when these regressors are potentially correlated.
If the regressors are imperfectly multicollinear, then the coefficients on at least one individual regressors will be imprecisely estimated.
EXAMPLE: Consider the regression of TestScore on STR and PctEL. Suppose we were to add a third regressor, the percentage of the district’s residents who are first-generation immigrants. First-generation immigrants often speak English as a second language, so the variables PctEL and percentage immigrants will be highly correlated.
Laura Aparicio 47 ECONOMETRICS I Districts with many recent immigrants will tend to have many students who are still learning English. Because these two variables are highly correlate, it would be difficult to use these data to estimate the partial effect on test scores of an increase in PctEL, holding constant the percentage of immigrants.
In other words, the data set provides little information about what happens to test scores when the percentage of English learners is low but the fraction of immigrants is high, or vice versa. If the least squares assumption hold, then the OLS estimator of the coefficient on PctEL in this regression will be unbiased; however, it will have a larger variance than if the regressors PctEL and percentage immigrants were uncorrelated.
More generally, when multiple regressors are imperfectly multicollinear, the coefficients on one or more of these regressors will be imprecisely estimated (that is, they will have a large sampling variance).
PERFECT MULTICOLLINEARITY IMPERFECT MULTICOLLINEARITY It often signals the presence of a logical error.
It is not necessarily an error, but rather just a feature of OLS, your data, and the question you are trying to answer.
Perfect multicollinearity, which occurs when one regressor is an exact linear function of the other regressors, usually arises from a mistake in choosing which regressors to include in a multiple regression. Solving perfect multicollinearity requires changing the set of regressors.
CONCLUSION Regression with a single regressor is vulnerable to omitted variable bias: If an omitted variable is determinant of the dependent variable and is correlated with the regressor, then the OLS estimator of the slope coefficient will be biased and will reflect both the effect of the regressor and the effect of the omitted variable.
Multiple regression makes it possible to mitigate omitted variable bias by including the omitted variable in regression. The coefficient on a regressor, X, in multiple regression is the partial effect of a change in X, holding constant the other included regressors.
Finally, the least squares assumptions for multiple regression are extensions of the three least squares assumptions for regression with single regressor, plus a fourth assumption ruling out perfect multicollinearity. Because the regression coefficients are estimated using a single sample, the OLS estimators have a joint sampling distribution and therefore have sampling uncertainty.
This sampling uncertainty must be quantified as part of an empirical study, and the ways to do so in the multiple regression model are the topic of the next chapter.
Laura Aparicio 48 ECONOMETRICS I HYPOTHESIS TESTS AND CONFIDENCE INTERVALS IN MULTIPLE REGRESSION: INFERENCE First of all, hypothesis tests and confidence intervals for a single regression coefficient are carried out using essentially the same procedures that were used in the one-variable linear regression model of TOPIC 2. For example, a 95% confidence interval for β1 is given by 𝛽̂1 ± 1.96𝑆𝐸(𝛽̂1 ).
NEW!! Hypothesis involving more than one restriction on the coefficient are called JOINT HYPOTHESIS which can be tested using an F-statistic.
① HOW TO FORMULATE JOINT HYPOTHESES Why can’t we just test the individual coefficients one at a time? Because maybe the coefficients alone are insignificant but, once we test them jointly, they are relevant. So, the best approach, especially when the regressors are highly correlated, is the F-statistic.
② HOW TO TEST THEM USING AN F-STATISTIC When q = 2: if we knew that the t-statistics are uncorrelated, our equation will be simplified to the average of the squared t-statistics.
When q = 1: the joint null hypothesis reduces to the null hypothesis on a single regression coefficient, and the Fstatistic is the square of the t-statistic.
When there are q restrictions: It’s a really complicated formula which is normally incorporated in regression software. In large samples, under the null hypothesis the F-statistic is distributed as 𝐹𝑞,∞ . Thus the critical values for the F-statistic can be obtained from the tables.
P-value: Laura Aparicio 49 ECONOMETRICS I Heteroskedastic-robust F-statistic: In some software packages you must select a “robust” option.
Homoskedastic- only F-statistic: There is a link between the F-statistic and R2: a large F-statistic should be associated with a substantial increase in the R2.
Restricted regression: the null hypothesis forced to be true.
Unrestricted regression: the alternative hypothesis is allowed to be true.
K = number of regressors in the unrestricted regression.
SPECIAL CASE: TEST WITH ONLY ONE RESTRICTION INVOLVING MULTIPLE COEFFICIENTS APPROACH 1: TEST THE RESTRICTION DIRECTLY (use the F-statistic when q = 1).
APPROACH 2: TRANSFORM THE REGRESSION.
Example: Suppose there are only two regressors, so the population regression has the form Laura Aparicio 50 ECONOMETRICS I MODEL SPECIFICATION FOR MULTIPLE REGRESSION So far, we’ve distinguished variables of interest and control variables.
 Variable of interest: we wish to estimate its causal effect.
 Control variables: regressors included to hold constant factors that, if neglected, could lead the estimated causal effect of interest to suffer from omitted variable bias. Its coefficient may not represent the causal effect.
HOW CAN WE KNOW WHETHER TO INCLUDE A PARTICULAR VARIABLE? Regression specification proceeds by first determining a base specification chosen to address concern about omitted variable bias. The base specification can be modified by including additional regressors that address other potential sources of omitted variable bias. Simply choosing the specification with the highest R2 can lead to regression models that do not estimate the causal effect of interest.
Laura Aparicio 51 ...