TOPIC 2 (2015)Apunte Inglés
Vista previa del texto
TOPIC 2: INTRODUCTION TO LINEAR REGRESSION
LINEAR REGRESSION WITH ONE REGRESSOR
POPULATION REGRESSION LINE =
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝒊 + 𝑼𝒊 , where 𝒊 = 𝟏, … , 𝒏
X is the independent (or explanatory) variable. Also
Y is the dependent variable
β0 is the intercept parameter (when X = 0).
X1, X2, X3, and X4 are four hypothetical values of the explanatory variable.
Sometimes it has a meaningful interpretation and in If the relationship between Y and X were exact, the corresponding other just act as the level (height) of the regression values of Y would be represented by the points Q1 – Q4 on the line.
The disturbance term causes the actual values of Y to be different.
β1 is the causal effect of X on Y. Parameter that gives In the diagram, the disturbance term has been assumed to be the change in Y for a unit change in X, holding other positive in the first and fourth observations and negative in the factors constant. In the linear model the change in Y other two, with the result that, if one plots the actual values of Y is the same for all changes in X, no matter what the against the values of X, one obtains the points P1 – P4.
initial level of X.
It must be emphasized that in practice the P points are all one can Ui are unobserved factors that influence Y, other see. The actual values of β1 and β2, and hence the location of the than the variable X. Ui, is sometimes also called the Q points, are unknown, as are the values of the disturbance term “error term” or “residual”.
in the observations.
In our example: How can we find the value of these betas? Using the OLS estimators.
OLS ESTIMATORS = the population regression line can be estimated using sample observations by ordinary least squares (OLS). The OLS estimators of the regression intercept and slope are denoted 𝛽̂0 and 𝛽̂1 .
Laura Aparicio 14 ECONOMETRICS I Suppose that you are given the four observations on X and Y represented in previous figure and you are asked to obtain estimates of the values of β0 and β1. As a rough approximation, you could do this by plotting the four P points and drawing a line to fit them as best you can.
This has been done in the following figure: The intersection of the line with the Y-axis provides an estimate of ̂𝟎 the intercept β0, which will be denoted 𝜷 And the slope provides an estimate of the slope coefficient β2, ̂𝟏.
which will be denoted 𝜷 ̂𝟎 + 𝜷 ̂ 𝟏 𝑿𝒊 ̂𝒊 = 𝜷 The fitted line will be written as: 𝒀 Drawing a regression line by eye is all very well, but it leaves a lot to subjective judgment. The question arises, is there a way of calculating good estimates of β0 and β1 algebraically?  Define what is known as a residual for each observation: the difference between the actual value of Y in any observation and the fitted value given by the regression line.
̂𝒊 𝒖𝒊 = 𝒀𝒊 − 𝒀  We substitute the fitted line: ̂𝟎 − 𝜷 ̂ 𝟏 𝑿𝒊 𝒖𝒊 = 𝒀𝒊 − 𝜷  Hence the residual in each observation depends on our choice of b1 and b2. Obviously, we wish to fit the regression line, that is, choose b1 and b2, in such a way as to make the residuals as SMALL AS POSSIBLE. We need to devise a criterion of fit that takes account of the size of all the residuals simultaneously. There are a number of possible criteria, some of which work better than others. One way of overcoming the problem is to minimize RSS, the sum of the squares of the residuals.
𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + 𝒖𝟐 𝟐 +𝒖𝟑 𝟐 +𝒖𝟒 𝟐 The smaller one can make RSS, the better is the fit, according to this criterion. If one could reduce RSS to 0, one would have a perfect fit, for this would imply that all the residuals are equal to 0. The line would go through all the points, but of course in general the disturbance term makes this impossible.
WHY USE OLS, RATHER THAN SOME OTHER ESTIMATOR? The OLS estimator has some desirable properties: under certain assumptions, it is unbiased (that is, E(𝛽̂1 ) = β1), and it has a tighter sampling distribution than some other candidate estimators of β1.
Importantly, this is what everyone uses.
Laura Aparicio 15 ECONOMETRICS I Let’s know considerer the GENERAL CASE where there are n observations on two variables X and Y and, supposing Y to depend on X, we will fit the equation: ̂𝟎 + 𝜷 ̂ 𝟏 𝑿𝒊 ̂𝒊 = 𝜷 𝒀 ̂ 𝒊 , will be (𝑏1 + 𝑏2 𝑋𝑖 ), and the residual 𝑢𝑖 will be (𝑌𝑖 –𝑏1 − 𝑏2 𝑋𝑖 ). We The fitted value of the dependent variable in observation i, 𝒀 wish to choose b1 and b2 so as to minimize the residual sum of the squares, RSS, given by: 𝒏 𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + ⋯ +𝒖𝒏 𝟐 = ∑ 𝒖𝒊 𝟐 𝒊=𝟏 We will find that RSS is minimized when: ̂𝟏 = 𝜷 𝟏 𝒏 ̅ )(𝒀𝒊 − 𝒀 ̅ ) ∑𝒏 (𝑿 − 𝑿 ̅ )(𝒀𝒊 − 𝒀 ̅) 𝒄𝒐𝒗(𝑿, 𝒀) 𝒏 ∑𝒊=𝟏(𝑿𝒊 − 𝑿 𝒊 𝒊=𝟏 = = 𝒏 𝟐 ̅ 𝟏 𝒏 ∑ 𝑽𝒂𝒓(𝑿) (𝑿 − 𝑿 ) 𝒊 ̅ )𝟐 𝒊=𝟏 ∑ (𝑿 − 𝑿 𝒏 𝒊=𝟏 𝒊 ̂𝟎 = 𝒀 ̅ − 𝜷𝟏 𝑿 ̅ 𝜷 MAIN CONCEPTS OBJECTIVE: The task of regression analysis is to obtain estimates of β1 and β2, and hence an estimate of the location of the line, given the P points. Moreover, if you were concerned only with measuring the effect of X on Y, it would be much more convenient if the disturbance term did not exist. But in fact, part of each change in Y is due to a change in u, and this makes life more difficult (u is sometimes described as noise).
Laura Aparicio 16 ECONOMETRICS I INTERPRETATION There are two stages in the interpretation of a regression equation: to turn the equation into words so that it can be understood by a no econometrician and to decide whether this literal interpretation should be taken at face value or whether the relationship should be investigated further.
EXAMPLE: SLOPE: It indicates that, as S increases by one unit (of S), EARNINGS increases by 1.07 units (of EARNINGS). Since S is measured in years, and EARNINGS is measured in dollars per hour, the coefficient of S implies that hourly earnings increase by $1.07 for every extra year of schooling.
CONSTANT TERM: Strictly speaking, it indicates the predicted level of EARNINGS when S is 0. Sometimes the constant will have a clear meaning, but sometimes not. If the sample values of the explanatory variable are a long way from 0, extrapolating the regression line back to 0 may be dangerous. Even if the regression line gives a good fit for the sample of observations, there is no guarantee that it will continue to do so when extrapolated to the left or to the right.
In this case a literal interpretation of the constant would lead to the nonsensical conclusion that an individual with no schooling would have hourly earnings of –$1.39. In this data set, no individual had less than six years of schooling and only three failed to complete elementary school, so it is not surprising that extrapolation to 0 leads to trouble.
MEASURES OF FIT = A natural question is how well the regression line “fits” or explains the data. There are two regression statistics that provide complementary measures of the quality of fit: Regression R2: measures the fraction of the variance of Y that is explained by X. It’s unit-less and ranges between 0 (no fit) and 1 (perfect fit).
Standard error of the regression (SER): is an estimator of the standard deviation of the regression error.
REGRESSION (R2) First, we need to recall some nice properties (grey box).
Laura Aparicio 17 ECONOMETRICS I ̂𝑖 and ei, after running a regression:  We have seen that we can split the value of Yi in each observation into two components, 𝑌 𝒀𝒊 = 𝒀̂𝒊 + 𝒖𝒊  We can use this to decompose the variance of Y: 𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 ) + 𝟐𝑪𝒐𝒗(𝒀̂𝒊 , 𝒖𝒊 )  Now it so happens the Cov(Yˆ,e) must be equal to 0 (see the box). Hence we obtain: 𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 ) ̂ ), the part "explained" by the regression line, This means that we can decompose the variance of Y into two parts, 𝑽𝒂𝒓(𝒀 and 𝑽𝒂𝒓(𝒆), the "unexplained" part.
̂ )/𝑽𝒂𝒓(𝒀) is the proportion of the variance explained by the regression line. This proportion is known  In view of this, 𝑽𝒂𝒓(𝒀 as the coefficient of determination or, more usually, R2: 𝑹𝟐 = ̂) 𝑽𝒂𝒓(𝒀 𝑽𝒂𝒓(𝒀) The maximum value of R2 is 1. This occurs when the regression line fits the observations exactly, so that 𝑌̂ = 𝑌𝑖 in all observations and all the residuals are 0. Then 𝑉𝑎𝑟(𝑌̂) = 𝑉𝑎𝑟(𝑌)𝑎𝑛𝑑𝑉𝑎𝑟(𝑒)𝑖𝑠0, and one has a perfect fit.
If there is no apparent relationship between the values of Y and X in the sample, R 2 will be close to 0.
Often it is convenient to decompose the variance as”sums of squares”: Laura Aparicio 18 ECONOMETRICS I STANDARD ERROR OF THE REGRESSION (SER) The standard error of the regression is (almost) the sample standard deviation of the OLS residuals: CHARACTERISTICS It has the units of 𝑢̂, which are the units of Y.
It measures the spread of the distribution of 𝑢̂.
It measures the average “size” of the OLS residual (the average “mistake” made by the OLS regression line) The root mean squared error (RMSE) is closely related to the SER: IMPORTANT! A low R2 and large SER do NOT imply that our regression is either “good” or “bad”. What they tell us is that other important factors influence Y. Moreover, they do NOT tell us what these factors are, but they do indicate that X alone explains only a small part of the variation in Y in these data.
Laura Aparicio 19 ECONOMETRICS I APPLICATION TO TEST-SCORES AND CLASS-SIZE Interpretation of the estimated slope and intercept SLOPE: Districts with one more student per teacher on average have test scores that are 2.28 points lower.
INTERCEPT: The intercept (taken literally) means that, according to this estimated line, districts with zero students per teacher would have a (predicted) test score of 698.9. This interpretation of the intercept makes no sense (it extrapolates the line outside the range of the data) in this application, the intercept is not itself economically meaningful.
SPECIAL CASE OF A DUMMY VARIABLE Laura Aparicio 20 ECONOMETRICS I So far we have seen how to estimate the slope of the population regression function using the estimator. But under what conditions ̂1 in a causal way? And how should we interpret if this condition fails? Moreover, the OLS regression line is an can we interpret 𝛽 estimate, computed using our sample of data; a different sample would have given a different value of.
How can we: ̂1 ? quantify the sampling uncertainty associated with 𝛽 ̂1 to test hypotheses such as 𝛽1 = 0? use 𝛽 ̂1 ? construct a confidence interval for 𝛽 KEY ASSUMPTIONS OF THE MODEL = OLS provides an appropriate estimator of the unknown regression coefficients, β0 and β1, under these three assumptions: ASSUMPTION #1: THE CONDITIONAL DISTRIBUTION OF Ui GIVEN Xi HAS A MEAN OF ZERO It means that the “other factors” contained in Ui are unrelated to Xi in the sense that, given a value of Xi, the mean of the distribution of these other factors is zero. This assumption is illustrated in Figure 4.4: At a given value of class size, say 20 students per class.
Sometimes these other factors lead to better performance than predicted (Ui>0) and sometimes to worse performance (Ui<0), but on average over the population the prediction is right.
In other words, given Xi = 20, and, more generally, at other values x of Xi as well, the mean of the distribution of Ui is zero. This is shown at the distribution of Ui being centred on the population regression line.
Laura Aparicio 21 ECONOMETRICS I As shown in Figure 4.4, the assumption that 𝐸(𝑢𝑖 |𝑋𝑖 ) = 0 is equivalent to assuming that the population regression line is the conditional mean of Yi given Xi.
Moreover, it could be understood as two conditions in one: 𝐸(𝑢𝑖 |𝑥 = 1) = 𝐸(𝑢𝑖 |𝑥 = 2) = ⋯ Changes in X (size class) should never have an impact on Ui.
𝐸(𝑢𝑖 |𝑋𝑖 = 𝑥) = 0 On average our regression model predicts the truth.
EXPERIMENTAL DATA In a OBSERVATIONAL DATA randomized controlled In observational data, X is not experiment, subjects are randomly randomly assigned in an experiment.
assigned to the treatment group (X = Instead, the best that can be hoped 1) or to the control group (X = 0).
for is that X is as if randomly assigned, in the precise sense that 𝐸(𝑢𝑖 |𝑋𝑖 ) = The random assignment typically is 0.
CORRELATION AND CONDITIONAL MEAN: If the conditional mean of one random variable given another is zero, then the two random variables have zero covariance and are uncorrelated.
done using a computer program that uses no information about the Whether this assumption holds in a subject, ensuring that X is distributed given independently observational data requires careful of all personal characteristics of the subject.
empirical application = 0 Correlation = 0 Conditional mean can take any value.
realistic! Random assignment makes X and U independent, which in turn implies In other words, we can find cases with that the conditional mean of U given observational X is zero.
assumption doesn’t hold.
EXTREMELY IMPORTANT!! mean correlation = 0 with thought and judgement. It’s not very Conditional data where Correlation different to 0 conditional mean is nonzero.
this If X and U are correlated, then the conditional mean assumption is violated.
And is they are uncorrelated we can’t be sure.
ASSUMPTION #2: (Xi, Yi), I = 1, …, n ARE INDEPENDENTLY AND IDENTICALLY DISTRIBUTED This assumption is a statement about how the sample is drawn. This arises automatically if the entity (individual, district) is sampled by simple random sampling: the entity is selected then, for that entity, X and Y are observed (recorded).
EXAMPLES: NON-I.I.D SAMPLING  The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”). This will introduce some extra complications.
Example: data on inventory levels (Y) at a firm and the interest rate at which the firm can borrow (X), where these data are collected over time from a specific firm (four times per year during 30 years). A key feature of time series data is that observations falling close to each other in time are not independent but rather tend to be correlated Laura Aparicio 22 ECONOMETRICS I with each other; if interest rates are low now, they are likely to be low next quarter. This pattern of correlation violates the “independence” part of the i.i.d. assumption.
 Another instance of non-i.i.d. sampling is when observations belonging to a group or cluster have unobservable variables in common.
ASSUMPTION #3: LARGE OUTLIERS ARE UNLIKELY Large outliers (that is, observations with values of Xi, Yi or both that are far outside the usual range of the data) are unlikely. Large outliers can make OLS regression results misleading. This potential sensitivity of OLS to extreme outliers is illustrated in the following figure: Mathematically: we assume X and Y have nonzero finite fourth moments: 0 < 𝐸(𝑋𝑖4 ) < ∞ and 0 < 𝐸(𝑌𝑖4 ) < ∞ where Which means that our variables can take finite values (example: class size is capped by the physical capacity of a classroom; the best you can do on a standardized test is to get all the question right and the worst you can do is to get all the questions wrong).
In conclusion, if the assumption of finite fourth moments holds, then it is unlikely that statistical inferences using OLS will be dominated by a few observations.
The least squares assumptions play twin roles:  FIRST ROLE: If these assumptions hold, then, as is shown in the next section, in large samples the OLS estimators have sampling distribution that are normal which allows us to develop methods for hypothesis testing and to construct confidence intervals.
 SECOND ROLE: It allows us to organize the circumstances that pose difficulties for OLS regression: a.
Assumption #1: It’s the most important to consider in practice because in several cases may not hold.
Assumption #2: Although it holds in many datasets, the independence assumption is inappropriate for time series data. Therefore, in this cases we will need to modify the methods used.
Laura Aparicio 23 ECONOMETRICS I c.
Assumption #3: If your dataset contains large outliers, you should examine those outliers carefully to make sure those observations are correctly recorded and belong in the data set (there can be data entry errors like height in meters or centimetres).
̂0 and 𝛽 ̂1 are the OLS estimators of the unknown intercept 𝛽0 and slope 𝛽1 of the population regression line. Because Remember that 𝛽 ̂0 and 𝛽 ̂1 are random variables that take on different values from one the OLS estimators are calculated using a random sample, 𝛽 sample to the next; the probability of these different values is summarized in their sampling distributions. Under the three Least Square Assumptions and when the sample is LARGE: ̂1 has mean 𝛽1 (“𝛽 ̂1 is an unbiased estimator of 𝛽1 ”), and UNBIASED: The exact (finite sample) sampling distribution of 𝛽 ̂1 ) is inversely proportional to n.
𝑉𝑎𝑟(𝛽 ̂1 is complicated and depends on the distribution (X, U).
Other than its mean and variance, the exact distribution of 𝛽 o ̂1 : It’s easier to draw a precise line when we have a large variance.
The larger is Var(Xi), the smaller is the variance of 𝛽 o ̂1 : If the errors are small the data will be tighter around the line.
The smaller Var(Ui), the smaller is the variance of 𝛽 𝑝 ̂1 → 𝛽1 , in other words, these estimators are consistent (when the sample is large, our estimators will be CONSISTENT: 𝛽 near the true population coefficients) (LLN) NORMALLY DISTRIBUTED: ̂1 −𝐸(𝛽 ̂1 ) 𝛽 ̂1 ) √𝑉𝑎𝑟(𝛽 is approximately distributed as N(0,1) if the sample is sufficiently large even if the original distribution wasn’t normal (CLT) Laura Aparicio 24 ECONOMETRICS I ̂𝟏 and 𝑽𝒂𝒓(𝜷 ̂𝟏 ) is inversely proportional to n PROPERTY: Unbiasedness of 𝜷 Cov (X,U) BEFORE: the estimator depends on X and Y.
If the #1 assumption holds, the cov(X, U) = 0, so the estimator predicts the truth NOW: the estimator depends on X and U.
Now we calculate the expectation of the expression we’ve obtained: Finally, regarding the variance: The amount of doubts that you have about you’ve predicted.
Laura Aparicio 25 ECONOMETRICS I 𝒑 ̂𝟏 : 𝜷 ̂𝟏 → 𝜷𝟏 PROPERTY: Consistency of 𝜷 These values are fixed parameters so the covariance between them and a variable is equal to 0. More important, the expected value of a fixed parameter is exactly the parameter.
𝒑 ̅ → 𝝁𝒙 𝑿 𝒑 𝒔𝟐𝒙 → 𝝈𝒙 𝒑 𝒔𝟐𝒙𝒚 → 𝝈𝒙𝒚 CLT and LLN allow us to combine parameters (fixed values) of a population with sample estimators.
̂𝟏 with large n PROPERTY: Approximation to a normal distribution of 𝜷 Additional notes: Laura Aparicio 26 ECONOMETRICS I CONCLUSION Until now we have focused on the use of ordinary least squares to estimate the intercept and slope of a population regression line using a sample of n observations on a dependent variable, Y, and a single regressor, X.
There are many ways to draw a straight line through a scatterplot, but doing so using OLS has several virtues. If the least squares assumptions hold, then the OLS estimators of the slope and the intercept are unbiased, consistent and have sampling distribution with a variance that is inversely proportional to the sample size n. Moreover, if n is large, then the sampling distribution of the OLS estimator is normal.
The results we’ve obtained describe the sampling distribution of the OLS estimator. By themselves, however, these results are not ̂1 or to construct a confidence interval for 𝛽 ̂1 . Doing so requires an estimator of sufficient to test a hypothesis about the value of 𝛽 the standard deviation of the sampling distribution (that is, the standard error of the OLS estimator) which is what we will do in the next sections.
SOME ADDITIONAL ALGEBRAIC FACTS ABOUT OLS (4.32) 𝟏 𝒏 ∑𝒏𝒊=𝟏 𝒖 ̂ 𝒊 = 𝟎 The SAMPLE AVERAGE of the OLS residuals is zero        Estimated linear regression model ̂𝒊 Isolate 𝑼 ̂ 𝒐 thanks to 𝜷 ̂𝒐 = 𝒀 ̂ 𝟏𝑿 ̅−𝜷 ̅𝒊 Substitute 𝜷 Rearrange the expression Summation We know that the summation of a mean is just n · mean (the mean is always the same so it acts as a constant) Remove common factor n 𝒏 𝒏 𝟏 ̅ = ∑𝒊=𝟏 𝒀𝒊 − 𝒀 ̅ = 𝟎 because 𝒀 ̅ = ∑𝒊=𝟏 𝒀𝒊 and the same happens with X. Finally, we  It’s easily observable that: ∑𝒏𝒊=𝟏 𝒀𝒊 − 𝒀 𝒏 𝒏 𝒏 pass the n which is multiplying the summation of unobserved factors dividing the right expression ̂𝒐 + 𝜷 ̂ 𝟏 𝑿𝒊 + 𝑼 ̂𝒊 𝒀𝒊 = 𝜷 ̂𝒐 + 𝜷 ̂ 𝟏 𝑿𝒊 ] ̂ 𝒊 = 𝒀𝒊 − [𝜷 𝑼 ̂ 𝟏𝑿 ̂ 𝟏 𝑿𝒊 ] ̂ 𝒊 = 𝒀𝒊 − [𝒀 ̅−𝜷 ̅𝒊 + 𝜷 𝑼 ̂ 𝟏 ] = (𝒀𝒊 − 𝒀 ̂𝟏 ̂ 𝒊 = 𝒀𝒊 − [𝒀 ̅ + (𝑿𝒊 − 𝑿 ̅ 𝒊 )𝜷 ̅ ) − (𝑿𝒊 − 𝑿 ̅ 𝒊 )𝜷 𝑼 𝒏 𝒏 𝒏 𝒏 𝒏 𝒏 𝒏 ̂ 𝟏 ∑(𝑿𝒊 − 𝑿 ̂ 𝟏 (∑ 𝑿𝒊 − ∑ 𝑿 ̂ 𝒊 = ∑(𝒀𝒊 − 𝒀 ̅) − 𝜷 ̅ 𝒊 ) = ∑ 𝒀𝒊 − ∑ 𝒀 ̅−𝜷 ̅ 𝒊) ∑𝑼 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒏 𝒏 𝒊=𝟏 𝒊=! 𝒊=𝟏 𝒊=! 𝒏 ̂ 𝟏 (∑ 𝑿𝒊 − 𝒏𝑿 ̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀 ̅−𝜷 ̅ 𝒊) ∑𝑼 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 Laura Aparicio 27 ECONOMETRICS I 𝒏 𝒏 𝒏 𝒏 𝒏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒏 𝒏 𝒏 𝟏 𝟏 ̂ 𝟏 ( ∑ 𝑿𝒊 − 𝒏𝑿 ̂ 𝟏 ( ∑ 𝑿𝒊 − 𝑿 ̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀 ̅−𝜷 ̅ 𝒊 ) = 𝒏 ( ∑ 𝒀𝒊 − 𝒀 ̅ ) − 𝒏𝜷 ̅ 𝒊) ∑𝑼 𝒏 𝒏 𝒏 𝒏 𝒏 𝒏 𝟏 ̂𝟏 · 𝟎 = 𝟎 ̂𝒊 = 𝟎 − 𝜷 ∑𝑼 𝒏 𝒊=𝟏 (4.33) 𝟏 𝒏 ̂𝒊 = 𝒀 ̅ 𝒊 The SAMPLE AVERAGE of the OLS predicted values equals 𝒀 ̅ ∑𝒏𝒊=𝟏 𝒀     Estimated linear regression model ̂𝒐 + 𝜷 ̂ 𝟏 𝑿𝒊 and substitute it in the previous expression We know that the estimated regression line is 𝒀̂𝒊 = 𝜷 Summation ̂𝒊 = 𝟎 We already know from (4.32) that ∑𝒏𝒊=𝟏 𝑼 ̅ = 𝟏 ∑𝒏𝒊=𝟏 𝒀𝒊  Finally, we use the formula of the mean: 𝒀 𝒏 ̂𝒐 + 𝜷 ̂ 𝟏 𝑿𝒊 + 𝑼 ̂𝒊 𝒀𝒊 = 𝜷 ̂𝒊 𝒀𝒊 = 𝒀̂𝒊 + 𝑼 𝒏 𝒏 𝒏 ̂𝒊 ∑ 𝒀𝒊 = ∑ 𝒀̂𝒊 + ∑ 𝑼 𝒊=𝟏 𝒊=𝟏 𝒏 𝒊=𝟏 𝒏 ∑ 𝒀𝒊 = ∑ 𝒀̂𝒊 𝒊=𝟏 𝒊=𝟏 𝒏 ̅= 𝒀 𝟏 𝟏 ∑ 𝒀𝒊 = 𝒏 𝒏 𝒊=𝟏 (4.34) ∑𝒏𝒊=𝟏 𝒖 ̂ 𝒊 𝑿𝒊 = 𝟎 The SAMPLE COVARIANCE between the OLS residuals and the regressors is zero ̅) ̂ 𝒊 𝑿𝒊 = ∑𝒏𝒊=𝟏 𝒖 ̂ 𝒊 (𝑿𝒊 − 𝑿  We know that ∑𝒏𝒊=𝟏 𝒖 𝑛 a. WHY? ∑𝑖=1 𝑢̂𝑖 (𝑋𝑖 − 𝑋̅) = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ ∑𝑛𝑖=1 𝑢̂𝑖 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ · 0 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) ̂𝟏 ̂ 𝒊 = (𝒀𝒊 − 𝒀 ̅ ) − (𝑿𝒊 − 𝑿 ̅ 𝒊 )𝜷  Substitute 𝑼  Develop ̂𝟏 =  Substitute 𝜷 𝒄𝒐𝒗(𝑿,𝒀) 𝑽𝒂𝒓(𝑿) = ̅ ̅ ∑𝒏 𝒊=!(𝑿𝒊 −𝑿)(𝒀𝒊 −𝒀) 𝟐 ̅ ∑𝒏 (𝑿−𝑿 ) 𝒊=𝟏  Finally, develop.
𝒏 𝒏 ̅) = 𝟎 ̂ 𝒊 𝑿𝒊 = ∑ 𝒖 ̂ 𝒊 (𝑿𝒊 − 𝑿 ∑𝒖 𝒊=𝟏 𝒊=𝟏 𝒏 ̂ 𝟏 ] (𝑿𝒊 − 𝑿 ̅ ) − (𝑿𝒊 − 𝑿 ̅ 𝒊 )𝜷 ̅) = 𝟎 ∑[(𝒀𝒊 − 𝒀 𝒊=𝟏 Laura Aparicio 28 ECONOMETRICS I 𝒏 𝒏 ̂ 𝟏 ∑(𝑿𝒊 − 𝑿 ̅ ) (𝑿𝒊 − 𝑿 ̅) − 𝜷 ̅ )𝟐 = 𝟎 ∑(𝒀𝒊 − 𝒀 𝒊=𝟏 𝒊=𝟏 𝒏 𝒏 ̅ ) (𝑿𝒊 − 𝑿 ̅) − ∑(𝒀𝒊 − 𝒀 𝒊=𝟏 ̅ )(𝒀𝒊 − 𝒀 ̅) ∑𝒏𝒊=!(𝑿𝒊 − 𝑿 ̅ )𝟐 = 𝟎 ∑(𝑿𝒊 − 𝑿 𝒏 𝟐 ̅) ∑𝒊=𝟏(𝑿 − 𝑿 𝒊=𝟏 𝒏 𝒏 ̅ ) (𝑿𝒊 − 𝑿 ̅ ) − ∑(𝑿𝒊 − 𝑿 ̅ )(𝒀𝒊 − 𝒀 ̅) = 𝟎 ∑(𝒀𝒊 − 𝒀 𝒊=𝟏 𝒊=! (4.35) 𝑻𝑺𝑺 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 ̅ )𝟐  We know that 𝑻𝑺𝑺 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀 ̂  Include 𝒀𝒊  Substitute: ̂𝒊 a. 𝑨 = 𝒀𝒊 − 𝒀 ̂ ̅ b. 𝑩 = 𝒀𝒊 − 𝒀  Develop  Finally, we know that: a. ∑𝒏𝒊=𝟏 𝑨𝟐 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 It’s the sum of the square of the difference between true observations and predicted observations.
b. ∑𝒏𝒊=𝟏 𝑩𝟐 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 It’s the sum of the square of the difference between the predicted observations and the mean.
̂ )(𝒀 ̂−𝒀 ̅ ) = ∑𝒏𝒊=𝟏(𝑼 ̂ 𝒊 · (𝒀 ̂−𝒀 ̅ )) = ∑𝒏𝒊=𝟏 𝑼 ̂𝒊 𝒀 ̂−𝒀 ̅ ∑𝒏𝒊=𝟏 𝑼 ̂ 𝒊 = ∑𝒏𝒊=𝟏 𝑼 ̂𝒊 𝒀 ̂−𝟎= c. ∑𝒏𝒊=𝟏 𝑨𝑩 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀 𝒏 𝒏 𝒏 ̂ ̂ ̂ ̂ ̂ ̂ ̂ ∑𝒊=𝟏 𝑼𝒊 (𝜷𝒐 + 𝜷𝟏 𝑿𝒊 ) = 𝜷𝒐 ∑𝒊=𝟏 𝑼𝒊 + 𝜷𝟏 ∑𝒊=𝟏 𝑼𝒊 𝑿𝒊 = 𝟎 + 𝟎 = 𝟎 𝒏 ̅ )𝟐 𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀 𝒊=𝟏 𝒏 ̂𝒊 + 𝒀 ̂𝒊 − 𝒀 ̅ )𝟐 𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀 𝒊=𝟏 𝒏 𝑻𝑺𝑺 = ∑(𝑨 + 𝑩)𝟐 𝒊=𝟏 𝒏 𝒏 𝟐 𝟐 𝒏 𝟐 𝒏 𝟐 𝑻𝑺𝑺 = ∑(𝑨 + 𝑩 + 𝟐𝑨𝑩) = ∑ 𝑨 + ∑ 𝑩 + 𝟐 ∑ 𝑨𝑩 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝒊=𝟏 𝑻𝑺𝑺 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 + 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝟎 Laura Aparicio 29 ECONOMETRICS I REGRESSION WITH A SINGLE REGRESSOR: HYPOTHESIS TESTS AND CONFIDENCE INTERVALS Hypothesis testing for regression coefficients is analogous to hypothesis testing for the population mean: Use the t-statistic to calculate the p-values and either accept or reject the null hypothesis. Like a confidence interval for the population mean, a 95% confidence interval for a regression coefficient is computed as the estimator ±1.96 standard errors.
HYPOTHESIS TESTING First of all, we need to state precisely the null and ̂𝟏 , 𝑺𝑬(𝜷 ̂𝟏 ). Although the STEP 1: Compute the standard error of 𝜷 alternative hypothesis before starting the test: formula is complicated, in applications the standard error is computed by regression software.
Recall: STEP 2: Compute the t-statistic: ̂1 at least as different from 𝛽1,0 assuming that STEP 3: Compute the p-value. In other words, the probability of observing a value of 𝛽 the null hypothesis is correct.
EXAMPLE: TEST SCORES AND STR, CALIFORNIA DATA Using STATA we obtain the following table and we observe that there are three equivalent ways to reject or not the null hypothesis: Looking the t-statistic: if it’s bigger than 1.96 we reject.
Looking the p-value: if it’s lower than 0.05 we reject.
Laura Aparicio 30 ECONOMETRICS I Looking to the confidence interval: if the 0 (the value of the null hypothesis) is not included then we reject.
CONFIDENCE INTERVALS: In 95% of all samples that might be drawn, the confidence interval will contain the true value of the population parameter.
At the same time, it can be define as the set of values that can’t be rejected using a two-sided hypothesis test with a 5% significance level.
RECALL!! When X is binary, the regression model can be used to estimate and test hypotheses about the difference between the population means of the “X=0” and the “X=1” group.
Our only assumption about the distribution of Ui conditional on Xi is that is has a mean of zero (the first least squares assumption). If, furthermore, the variance of this conditional distribution does NOT depend on Xi, then the errors are said to be homoskedastic.
We’re going to discuss: Its theoretical implications: What are heteroskedasticity and homoskedasticity? The simplified formulas for the standard errors of the OLS estimators: Mathematical implications The risks you run if you use these simplified formulas in practice: What does this mean in practice? Laura Aparicio 31 ECONOMETRICS I RECALL SOME PROPERTIES (Y = wages, X = years of school): 𝑽𝒂𝒓(𝒀) You calculate the variance of all the population’s wages 𝑽𝒂𝒓(𝒀|𝑿 = 𝟏𝟐) You just calculate the variance of the group who satisfies the condition X [we’re fixing X, but Y keeps changing] 𝑽𝒂𝒓(𝑿) You calculate the variance of all the population’s years of school.
𝑽𝒂𝒓(𝑿|𝑿 = 𝟏𝟐) = 𝟎 We’re fixing X, so the variance of X is equal to 0.
Therefore, 𝑽𝒂𝒓(𝒀|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 + 𝜷𝟏 𝑿 + 𝑼|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 |𝑿) + 𝑽𝒂𝒓(𝜷𝟏 𝑿|𝑿) + 𝑽𝒂𝒓(𝑼|𝑿) = 𝟎 + 𝟎 + 𝑽𝒂𝒓(𝑼|𝑿) = 𝑽𝒂𝒓(𝑼|𝑿) The variance of fixed parameters is always 0.
 WHAT ARE HETEROSKEDASTICITY AND HOMOSKEDASTICITY? HOMOSKEDASTICITY HETEROSKEDASTICITY If 𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is constant, that is the variance of the If 𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is NOT constant, that is the variance of the conditional distribution of U given X does NOT depend on conditional distribution of U given X depends on X, then U is said to be X, then U is said to be homoskedastic.
All distributions are equally wide.
Conditional distribution of Ui spreads out as x increases.
Laura Aparicio 32 ECONOMETRICS I EXAMPLE 1[NOT SURE, MAYBE HOMOSKEDASTICITY]: California test scores EXAMPLE 2 [HETEROSKEDASTICITY]: Average hourly earnings vs years of education First, on average, the longer you study, the higher the wage you’ll have.
Analysis of the variance of the conditional distribution of U given X: If you go less than ten years to school, your wage will be around 0-20.
If you go more than 10 years, your wage will be between 0-60.
Therefore, conditional distribution of Ui spreads out as x increases.
Laura Aparicio 33 ECONOMETRICS I  MATHEMATICAL IMPLICATIONS Heteroskedasticity and homoskedasticity concern only to the variance, in other words, the Standard Error (SE) and all the values calculated using SE (t-statistic, confidence intervals…).
HETEROSKEDASTICITY HETEROSKEDASTICITY X=1 Var = 15 X=2 Var = 20 HOMOSKEDASTICITY HOMOSKEDASTICITY Know due to U and X are independent we can express 𝑉𝑎𝑟(𝑣) = 𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈) ̂1 ) = 𝑉𝑎𝑟(𝛽 X=1 Var = 15 X=2 Var = 25 𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈) 𝑉𝑎𝑟(𝑈) = 𝑛 · 𝑉𝑎𝑟(𝑋)2 𝑛 · 𝑉𝑎𝑟(𝑋)  WHAT DOES THIS MEAN IN PRACTICE? If the errors are homoskedastic and you use the heteroskedastic formula for standard errors (the one we derived), you are OK.
If the errors are heteroskedastic and you use the homoskedasticity-only formula for standard errors, the standard errors are WRONG.
Laura Aparicio 34 ECONOMETRICS I The two formulas coincide (when n is large) in the special case of homoskedasticity.
The bottom line: you should ALWAYS use the heteroskedasticity-based formulas – these are conventionally called the heteroskedasticity-robust standard errors.
MAIN IDEA! In general, the error ui is heteroskedastic (that is, the variance of ui at a given value of Xi, 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥), depends on x).
A special case is when the error is homoskedastic (that is 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥) is constant). Homoskedasticity-only standard errors do NOT produce valid statistical inferences when the errors are heteroskedastic, but heteroskedasticity-robust standard errors do.
EXTRA! If the three least squares assumption hold AND if the regression errors are homoskedastic, then, the OLS estimator is BLUE.
Moreover, if the three least squares holds, if the regression errors are homoskedastic AND if the regression errors are normally distributed, then the OLS t-statistic computed using homoskedasticity-only standard errors has a Student t distribution when the null hypothesis is true. The difference between the Student t distribution and normal distribution is negligible if the sample size is moderate or large.
CONCLUSION Returning to the California test score data set, there is a negative relationship between the student-teacher ratio and test scores, but is this relationship necessarily the causal one? Districts with lower STR have, on average, higher test score. But does this mean that reducing the STR will, in fact, increase scores? There is, in fact, reason to worry that it might not. Hiring more teachers, after all, costs money, so wealthier school districts can better afford small classes. Moreover, students at wealthier schools also have other advantages over their poorer neighbours, including better facilities, newer books, and better-paid teachers.
What’s more, California has a large immigrant community; these immigrants tend to be poorer than the overall population, and, in many cases, their children are not native English speakers. It thus might be that our negative estimated relationship between test scores and the STR is a consequence of large classes being found in conjunction with many other factors that are, in fact, the real cause of the lower test scores.
These other factors or “omitted variables”, could mean that the OLS analysis done so far has little value. Indeed, it could be misleading: changing the STR alone would not change these other factors that determine a child’s performance at school. To address this problem, we need a method that will allow us to isolate the effect on test scores of changing the STR, holding these other factors constant. The method is MULTIPLE REGRESSION ANALYSIS.
Laura Aparicio 35 ...