# TOPIC 2 (2015)

Apunte InglésUniversidad | Universidad Pompeu Fabra (UPF) |

Grado | Administración y Dirección de Empresas - 2º curso |

Asignatura | Econometrics I |

Año del apunte | 2015 |

Páginas | 22 |

Fecha de subida | 10/04/2016 |

Descargas | 18 |

Subido por | laparicioimbuluzqueta |

### Vista previa del texto

ECONOMETRICS I
TOPIC 2: INTRODUCTION TO LINEAR REGRESSION
LINEAR REGRESSION WITH ONE REGRESSOR
POPULATION REGRESSION LINE =
SIMPLIEST MODEL:
GRAPHICALLY:
Fixed quantities
𝒀𝒊 = 𝜷𝟎 + 𝜷𝟏 𝑿𝒊 + 𝑼𝒊 , where 𝒊 = 𝟏, … , 𝒏
Non-random
component
Disturbance term
VARIABLES:
X is the independent (or explanatory) variable. Also
called regressor!
Y is the dependent variable
β0 is the intercept parameter (when X = 0).

X1, X2, X3, and X4 are four hypothetical values of the explanatory
variable.

Sometimes it has a meaningful interpretation and in
If the relationship between Y and X were exact, the corresponding
other just act as the level (height) of the regression
values of Y would be represented by the points Q1 – Q4 on the line.

line.

The disturbance term causes the actual values of Y to be different.

β1 is the causal effect of X on Y. Parameter that gives
In the diagram, the disturbance term has been assumed to be
the change in Y for a unit change in X, holding other
positive in the first and fourth observations and negative in the
factors constant. In the linear model the change in Y
other two, with the result that, if one plots the actual values of Y
is the same for all changes in X, no matter what the
against the values of X, one obtains the points P1 – P4.

initial level of X.

It must be emphasized that in practice the P points are all one can
Ui are unobserved factors that influence Y, other
see. The actual values of β1 and β2, and hence the location of the
than the variable X. Ui, is sometimes also called the
Q points, are unknown, as are the values of the disturbance term
“error term” or “residual”.

in the observations.

In our example:
How can we find the value of these betas? Using the OLS estimators.

OLS ESTIMATORS = the population regression line can be estimated using sample observations by ordinary least squares (OLS). The
OLS estimators of the regression intercept and slope are denoted 𝛽̂0 and 𝛽̂1 .

Laura Aparicio
14
ECONOMETRICS I
Suppose that you are given the four observations on X and Y represented in previous figure and you are asked to obtain estimates of
the values of β0 and β1. As a rough approximation, you could do this by plotting the four P points and drawing a line to fit them as best
you can.

This has been done in the following figure:
The intersection of the line with the Y-axis provides an estimate of
̂𝟎
the intercept β0, which will be denoted 𝜷
And the slope provides an estimate of the slope coefficient β2,
̂𝟏.

which will be denoted 𝜷
̂𝟎 + 𝜷
̂ 𝟏 𝑿𝒊
̂𝒊 = 𝜷
The fitted line will be written as: 𝒀
Drawing a regression line by eye is all very well, but it leaves a lot to subjective judgment. The question arises, is there a way of
calculating good estimates of β0 and β1 algebraically?
[1] Define what is known as a residual for each observation: the difference between the actual value of Y in any observation and
the fitted value given by the regression line.

̂𝒊
𝒖𝒊 = 𝒀𝒊 − 𝒀
[2] We substitute the fitted line:
̂𝟎 − 𝜷
̂ 𝟏 𝑿𝒊
𝒖𝒊 = 𝒀𝒊 − 𝜷
[3] Hence the residual in each observation depends on our choice of b1 and b2. Obviously, we wish to fit the regression line, that
is, choose b1 and b2, in such a way as to make the residuals as SMALL AS POSSIBLE. We need to devise a criterion of fit that
takes account of the size of all the residuals simultaneously. There are a number of possible criteria, some of which work
better than others. One way of overcoming the problem is to minimize RSS, the sum of the squares of the residuals.

𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + 𝒖𝟐 𝟐 +𝒖𝟑 𝟐 +𝒖𝟒 𝟐
The smaller one can make RSS, the better is the fit, according to this criterion. If one could reduce RSS to 0, one would have
a perfect fit, for this would imply that all the residuals are equal to 0. The line would go through all the points, but of course
in general the disturbance term makes this impossible.

WHY USE OLS, RATHER THAN SOME OTHER ESTIMATOR? The OLS estimator has some desirable properties: under certain
assumptions, it is unbiased (that is, E(𝛽̂1 ) = β1), and it has a tighter sampling distribution than some other candidate estimators of β1.

Importantly, this is what everyone uses.

Laura Aparicio
15
ECONOMETRICS I
Let’s know considerer the GENERAL CASE where there are n observations on two variables X and Y and, supposing Y to depend on X,
we will fit the equation:
̂𝟎 + 𝜷
̂ 𝟏 𝑿𝒊
̂𝒊 = 𝜷
𝒀
̂ 𝒊 , will be (𝑏1 + 𝑏2 𝑋𝑖 ), and the residual 𝑢𝑖 will be (𝑌𝑖 –𝑏1 − 𝑏2 𝑋𝑖 ). We
The fitted value of the dependent variable in observation i, 𝒀
wish to choose b1 and b2 so as to minimize the residual sum of the squares, RSS, given by:
𝒏
𝑹𝑺𝑺 = 𝒖𝟏 𝟐 + ⋯ +𝒖𝒏 𝟐 = ∑
𝒖𝒊 𝟐
𝒊=𝟏
We will find that RSS is minimized when:
̂𝟏 =
𝜷
𝟏 𝒏
̅ )(𝒀𝒊 − 𝒀
̅ ) ∑𝒏 (𝑿 − 𝑿
̅ )(𝒀𝒊 − 𝒀
̅)
𝒄𝒐𝒗(𝑿, 𝒀) 𝒏 ∑𝒊=𝟏(𝑿𝒊 − 𝑿
𝒊
𝒊=𝟏
=
=
𝒏
𝟐
̅
𝟏 𝒏
∑
𝑽𝒂𝒓(𝑿)
(𝑿
−
𝑿
)
𝒊
̅ )𝟐
𝒊=𝟏
∑ (𝑿 − 𝑿
𝒏 𝒊=𝟏 𝒊
̂𝟎 = 𝒀
̅ − 𝜷𝟏 𝑿
̅
𝜷
MAIN CONCEPTS
OBJECTIVE: The task of regression analysis is to obtain estimates of β1 and β2, and hence an estimate of the location of the line, given
the P points. Moreover, if you were concerned only with measuring the effect of X on Y, it would be much more convenient if the
disturbance term did not exist. But in fact, part of each change in Y is due to a change in u, and this makes life more difficult (u is
sometimes described as noise).

Laura Aparicio
16
ECONOMETRICS I
INTERPRETATION
There are two stages in the interpretation of a regression equation: to turn the equation into words so that it can be understood by
a no econometrician and to decide whether this literal interpretation should be taken at face value or whether the relationship
should be investigated further.

EXAMPLE:
SLOPE: It indicates that, as S increases by one unit (of S), EARNINGS increases by 1.07 units (of EARNINGS). Since S is measured
in years, and EARNINGS is measured in dollars per hour, the coefficient of S implies that hourly earnings increase by $1.07 for
every extra year of schooling.

CONSTANT TERM: Strictly speaking, it
indicates the predicted level of EARNINGS
when S is 0. Sometimes the constant will
have a clear meaning, but sometimes not. If
the sample values of the explanatory
variable
are
a
long
way
from
0,
extrapolating the regression line back to 0
may be dangerous. Even if the regression
line gives a good fit for the sample of
observations, there is no guarantee that it
will continue to do so when extrapolated to
the left or to the right.

In this case a literal interpretation of the constant would lead to the nonsensical conclusion that an individual with no
schooling would have hourly earnings of –$1.39. In this data set, no individual had less than six years of schooling and only
three failed to complete elementary school, so it is not surprising that extrapolation to 0 leads to trouble.

MEASURES OF FIT = A natural question is how well the regression line “fits” or explains the data. There are two regression statistics
that provide complementary measures of the quality of fit:
Regression R2: measures the fraction of the variance of Y that is explained by X. It’s unit-less and ranges between 0 (no fit)
and 1 (perfect fit).

Standard error of the regression (SER): is an estimator of the standard deviation of the regression error.

REGRESSION (R2)
First, we need to recall some nice properties (grey box).

Laura Aparicio
17
ECONOMETRICS I
̂𝑖 and ei, after running a regression:
[1] We have seen that we can split the value of Yi in each observation into two components, 𝑌
𝒀𝒊 = 𝒀̂𝒊 + 𝒖𝒊
[2] We can use this to decompose the variance of Y:
𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 ) + 𝟐𝑪𝒐𝒗(𝒀̂𝒊 , 𝒖𝒊 )
[3] Now it so happens the Cov(Yˆ,e) must be equal to 0 (see the box). Hence we obtain:
𝑽𝒂𝒓(𝒀) = 𝑽𝒂𝒓(𝒀̂𝒊 + 𝒖𝒊 ) = 𝑽𝒂𝒓(𝒀̂𝒊 ) + 𝑽𝒂𝒓(𝒖𝒊 )
̂ ), the part "explained" by the regression line,
This means that we can decompose the variance of Y into two parts, 𝑽𝒂𝒓(𝒀
and 𝑽𝒂𝒓(𝒆), the "unexplained" part.

̂ )/𝑽𝒂𝒓(𝒀) is the proportion of the variance explained by the regression line. This proportion is known
[4] In view of this, 𝑽𝒂𝒓(𝒀
as the coefficient of determination or, more usually, R2:
𝑹𝟐 =
̂)
𝑽𝒂𝒓(𝒀
𝑽𝒂𝒓(𝒀)
The maximum value of R2 is 1. This occurs when the regression line fits the observations exactly, so that 𝑌̂ = 𝑌𝑖 in all
observations and all the residuals are 0. Then 𝑉𝑎𝑟(𝑌̂) = 𝑉𝑎𝑟(𝑌)𝑎𝑛𝑑𝑉𝑎𝑟(𝑒)𝑖𝑠0, and one has a perfect fit.

If there is no apparent relationship between the values of Y and X in the sample, R 2 will be close to 0.

Often it is convenient to decompose the variance as”sums of squares”:
Laura Aparicio
18
ECONOMETRICS I
STANDARD ERROR OF THE REGRESSION (SER)
The standard error of the regression is (almost) the sample standard deviation of the OLS residuals:
CHARACTERISTICS
It has the units of 𝑢̂, which are the units of Y.

It measures the spread of the distribution of 𝑢̂.

It measures the average “size” of the OLS residual (the average
“mistake” made by the OLS regression line)
The root mean squared error (RMSE) is closely related to the SER:
IMPORTANT! A low R2 and large SER do NOT imply that our regression is either “good” or “bad”. What they tell us is that other
important factors influence Y. Moreover, they do NOT tell us what these factors are, but they do indicate that X alone explains only
a small part of the variation in Y in these data.

Laura Aparicio
19
ECONOMETRICS I
APPLICATION TO TEST-SCORES AND CLASS-SIZE
Interpretation of the estimated slope and intercept
SLOPE: Districts with one more student per teacher on
average have test scores that are 2.28 points lower.

INTERCEPT: The intercept (taken literally) means that, according to this estimated line, districts with zero students per teacher would
have a (predicted) test score of 698.9. This interpretation of the intercept makes no sense (it extrapolates the line outside the range of
the data) in this application, the intercept is not itself economically meaningful.

SPECIAL CASE OF A DUMMY VARIABLE
Laura Aparicio
20
ECONOMETRICS I
So far we have seen how to estimate the slope of the population regression function using the estimator. But under what conditions
̂1 in a causal way? And how should we interpret if this condition fails? Moreover, the OLS regression line is an
can we interpret 𝛽
estimate, computed using our sample of data; a different sample would have given a different value of.

How can we:
̂1 ?
quantify the sampling uncertainty associated with 𝛽
̂1 to test hypotheses such as 𝛽1 = 0?
use 𝛽
̂1 ?
construct a confidence interval for 𝛽
KEY ASSUMPTIONS OF THE MODEL = OLS provides an appropriate estimator of the unknown regression coefficients, β0 and β1, under
these three assumptions:
ASSUMPTION #1: THE CONDITIONAL DISTRIBUTION OF Ui GIVEN Xi HAS A MEAN OF ZERO
It means that the “other factors” contained in Ui are unrelated to Xi in the sense that, given a value of Xi, the mean of the distribution
of these other factors is zero. This assumption is illustrated in Figure 4.4:
At a given value of class size, say 20 students per class.

Sometimes these other factors lead to better performance
than predicted (Ui>0) and sometimes to worse
performance (Ui<0), but on average over the population
the prediction is right.

In other words, given Xi = 20, and, more generally, at other
values x of Xi as well, the mean of the distribution of Ui is
zero. This is shown at the distribution of Ui being centred
on the population regression line.

Laura Aparicio
21
ECONOMETRICS I
As shown in Figure 4.4, the assumption that 𝐸(𝑢𝑖 |𝑋𝑖 ) = 0 is equivalent to assuming that the population regression line is the
conditional mean of Yi given Xi.

Moreover, it could be understood as two conditions in one:
𝐸(𝑢𝑖 |𝑥 = 1) = 𝐸(𝑢𝑖 |𝑥 = 2) = ⋯ Changes in X (size class) should never have an impact on Ui.

𝐸(𝑢𝑖 |𝑋𝑖 = 𝑥) = 0 On average our regression model predicts the truth.

EXPERIMENTAL DATA
In
a
OBSERVATIONAL DATA
randomized
controlled
In observational data, X is not
experiment, subjects are randomly
randomly assigned in an experiment.

assigned to the treatment group (X =
Instead, the best that can be hoped
1) or to the control group (X = 0).

for is that X is as if randomly assigned,
in the precise sense that 𝐸(𝑢𝑖 |𝑋𝑖 ) =
The random assignment typically is
0.

CORRELATION AND CONDITIONAL MEAN:
If the conditional mean of one random
variable given another is zero, then the
two random variables have zero
covariance and are uncorrelated.

done using a computer program that
uses no information about the
Whether this assumption holds in a
subject, ensuring that X is distributed
given
independently
observational data requires careful
of
all
personal
characteristics of the subject.

empirical
application
=
0
Correlation = 0 Conditional
mean can take any value.

realistic!
Random assignment makes X and U
independent, which in turn implies
In other words, we can find cases with
that the conditional mean of U given
observational
X is zero.

assumption doesn’t hold.

EXTREMELY IMPORTANT!!
mean
correlation = 0
with
thought and judgement. It’s not very
Conditional
data
where
Correlation different to 0
conditional mean is nonzero.

this
If X and U are correlated, then the
conditional mean assumption is violated.

And is they are uncorrelated we can’t be
sure.

ASSUMPTION #2: (Xi, Yi), I = 1, …, n ARE INDEPENDENTLY AND IDENTICALLY DISTRIBUTED
This assumption is a statement about how the sample is drawn. This arises automatically if the entity (individual, district) is sampled
by simple random sampling: the entity is selected then, for that entity, X and Y are observed (recorded).

EXAMPLES: NON-I.I.D SAMPLING
[1] The main place we will encounter non-i.i.d. sampling is when data are recorded over time (“time series data”). This will
introduce some extra complications.

a.

Example: data on inventory levels (Y) at a firm and the interest rate at which the firm can borrow (X), where these
data are collected over time from a specific firm (four times per year during 30 years). A key feature of time series
data is that observations falling close to each other in time are not independent but rather tend to be correlated
Laura Aparicio
22
ECONOMETRICS I
with each other; if interest rates are low now, they are likely to be low next quarter. This pattern of correlation
violates the “independence” part of the i.i.d. assumption.

[2] Another instance of non-i.i.d. sampling is when observations belonging to a group or cluster have unobservable variables in
common.

ASSUMPTION #3: LARGE OUTLIERS ARE UNLIKELY
Large outliers (that is, observations with values of Xi, Yi or both that are far
outside the usual range of the data) are unlikely. Large outliers can make OLS
regression results misleading. This potential sensitivity of OLS to extreme outliers
is illustrated in the following figure:
Mathematically: we assume X and Y have nonzero finite fourth moments:
0 < 𝐸(𝑋𝑖4 ) < ∞ and 0 < 𝐸(𝑌𝑖4 ) < ∞
where
Which means that our variables can take finite values (example: class size is capped by the physical capacity of a classroom; the best
you can do on a standardized test is to get all the question right and the worst you can do is to get all the questions wrong).

In conclusion, if the assumption of finite fourth moments holds, then it is unlikely that statistical inferences using OLS will be dominated
by a few observations.

The least squares assumptions play twin roles:
[1] FIRST ROLE: If these assumptions hold, then, as is shown in the next section, in large samples the OLS estimators have
sampling distribution that are normal which allows us to develop methods for hypothesis testing and to construct
confidence intervals.

[2] SECOND ROLE: It allows us to organize the circumstances that pose difficulties for OLS regression:
a.

Assumption #1: It’s the most important to consider in practice because in several cases may not hold.

b.

Assumption #2: Although it holds in many datasets, the independence assumption is inappropriate for time
series data. Therefore, in this cases we will need to modify the methods used.

Laura Aparicio
23
ECONOMETRICS I
c.

Assumption #3: If your dataset contains large outliers, you should examine those outliers carefully to make
sure those observations are correctly recorded and belong in the data set (there can be data entry errors like
height in meters or centimetres).

̂0 and 𝛽
̂1 are the OLS estimators of the unknown intercept 𝛽0 and slope 𝛽1 of the population regression line. Because
Remember that 𝛽
̂0 and 𝛽
̂1 are random variables that take on different values from one
the OLS estimators are calculated using a random sample, 𝛽
sample to the next; the probability of these different values is summarized in their sampling distributions. Under the three Least Square
Assumptions and when the sample is LARGE:
̂1 has mean 𝛽1 (“𝛽
̂1 is an unbiased estimator of 𝛽1 ”), and
UNBIASED: The exact (finite sample) sampling distribution of 𝛽
̂1 ) is inversely proportional to n.

𝑉𝑎𝑟(𝛽
̂1 is complicated and depends on the distribution (X, U).

Other than its mean and variance, the exact distribution of 𝛽
o
̂1 : It’s easier to draw a precise line when we have a large variance.

The larger is Var(Xi), the smaller is the variance of 𝛽
o
̂1 : If the errors are small the data will be tighter around the line.

The smaller Var(Ui), the smaller is the variance of 𝛽
𝑝
̂1 → 𝛽1 , in other words, these estimators are consistent (when the sample is large, our estimators will be
CONSISTENT: 𝛽
near the true population coefficients) (LLN)
NORMALLY DISTRIBUTED:
̂1 −𝐸(𝛽
̂1 )
𝛽
̂1 )
√𝑉𝑎𝑟(𝛽
is approximately distributed as N(0,1) if the sample is sufficiently large even if the
original distribution wasn’t normal (CLT)
Laura Aparicio
24
ECONOMETRICS I
̂𝟏 and 𝑽𝒂𝒓(𝜷
̂𝟏 ) is inversely proportional to n
PROPERTY: Unbiasedness of 𝜷
Cov (X,U)
BEFORE: the estimator depends on X and Y.

If the #1 assumption holds, the cov(X, U) = 0, so
the estimator predicts the truth
NOW: the estimator depends on X and U.

Now we calculate the expectation of the expression we’ve obtained:
Finally, regarding the variance:
The amount of doubts that you have about you’ve predicted.

Laura Aparicio
25
ECONOMETRICS I
𝒑
̂𝟏 : 𝜷
̂𝟏 → 𝜷𝟏
PROPERTY: Consistency of 𝜷
These values are fixed parameters so the covariance
between them and a variable is equal to 0. More
important, the expected value of a fixed parameter is
exactly the parameter.

𝒑
̅ → 𝝁𝒙
𝑿
𝒑
𝒔𝟐𝒙 → 𝝈𝒙
𝒑
𝒔𝟐𝒙𝒚 → 𝝈𝒙𝒚
CLT and LLN allow us to combine parameters (fixed values) of a population with sample estimators.

̂𝟏 with large n
PROPERTY: Approximation to a normal distribution of 𝜷
Additional notes:
Laura Aparicio
26
ECONOMETRICS I
CONCLUSION
Until now we have focused on the use of ordinary least squares to estimate the intercept and slope of a population regression line
using a sample of n observations on a dependent variable, Y, and a single regressor, X.

There are many ways to draw a straight line through a scatterplot, but doing so using OLS has several virtues. If the least squares
assumptions hold, then the OLS estimators of the slope and the intercept are unbiased, consistent and have sampling distribution
with a variance that is inversely proportional to the sample size n. Moreover, if n is large, then the sampling distribution of the OLS
estimator is normal.

The results we’ve obtained describe the sampling distribution of the OLS estimator. By themselves, however, these results are not
̂1 or to construct a confidence interval for 𝛽
̂1 . Doing so requires an estimator of
sufficient to test a hypothesis about the value of 𝛽
the standard deviation of the sampling distribution (that is, the standard error of the OLS estimator) which is what we will do in the
next sections.

SOME ADDITIONAL ALGEBRAIC FACTS ABOUT OLS
(4.32)
𝟏
𝒏
∑𝒏𝒊=𝟏 𝒖
̂ 𝒊 = 𝟎 The SAMPLE AVERAGE of the OLS residuals is zero
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Estimated linear regression model
̂𝒊
Isolate 𝑼
̂ 𝒐 thanks to 𝜷
̂𝒐 = 𝒀
̂ 𝟏𝑿
̅−𝜷
̅𝒊
Substitute 𝜷
Rearrange the expression
Summation
We know that the summation of a mean is just n · mean (the mean is always the same so it acts as a constant)
Remove common factor n
𝒏
𝒏
𝟏
̅ = ∑𝒊=𝟏 𝒀𝒊 − 𝒀
̅ = 𝟎 because 𝒀
̅ = ∑𝒊=𝟏 𝒀𝒊 and the same happens with X. Finally, we
[8] It’s easily observable that: ∑𝒏𝒊=𝟏 𝒀𝒊 − 𝒀
𝒏
𝒏
𝒏
pass the n which is multiplying the summation of unobserved factors dividing the right expression
̂𝒐 + 𝜷
̂ 𝟏 𝑿𝒊 + 𝑼
̂𝒊
𝒀𝒊 = 𝜷
̂𝒐 + 𝜷
̂ 𝟏 𝑿𝒊 ]
̂ 𝒊 = 𝒀𝒊 − [𝜷
𝑼
̂ 𝟏𝑿
̂ 𝟏 𝑿𝒊 ]
̂ 𝒊 = 𝒀𝒊 − [𝒀
̅−𝜷
̅𝒊 + 𝜷
𝑼
̂ 𝟏 ] = (𝒀𝒊 − 𝒀
̂𝟏
̂ 𝒊 = 𝒀𝒊 − [𝒀
̅ + (𝑿𝒊 − 𝑿
̅ 𝒊 )𝜷
̅ ) − (𝑿𝒊 − 𝑿
̅ 𝒊 )𝜷
𝑼
𝒏
𝒏
𝒏
𝒏
𝒏
𝒏
𝒏
̂ 𝟏 ∑(𝑿𝒊 − 𝑿
̂ 𝟏 (∑ 𝑿𝒊 − ∑ 𝑿
̂ 𝒊 = ∑(𝒀𝒊 − 𝒀
̅) − 𝜷
̅ 𝒊 ) = ∑ 𝒀𝒊 − ∑ 𝒀
̅−𝜷
̅ 𝒊)
∑𝑼
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝒏
𝒏
𝒊=𝟏
𝒊=!
𝒊=𝟏
𝒊=!
𝒏
̂ 𝟏 (∑ 𝑿𝒊 − 𝒏𝑿
̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀
̅−𝜷
̅ 𝒊)
∑𝑼
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
Laura Aparicio
27
ECONOMETRICS I
𝒏
𝒏
𝒏
𝒏
𝒏
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝒏
𝒏
𝒏
𝟏
𝟏
̂ 𝟏 ( ∑ 𝑿𝒊 − 𝒏𝑿
̂ 𝟏 ( ∑ 𝑿𝒊 − 𝑿
̂ 𝒊 = ∑ 𝒀𝒊 − 𝒏𝒀
̅−𝜷
̅ 𝒊 ) = 𝒏 ( ∑ 𝒀𝒊 − 𝒀
̅ ) − 𝒏𝜷
̅ 𝒊)
∑𝑼
𝒏
𝒏
𝒏
𝒏
𝒏
𝒏
𝟏
̂𝟏 · 𝟎 = 𝟎
̂𝒊 = 𝟎 − 𝜷
∑𝑼
𝒏
𝒊=𝟏
(4.33)
𝟏
𝒏
̂𝒊 = 𝒀
̅ 𝒊 The SAMPLE AVERAGE of the OLS predicted values equals 𝒀
̅
∑𝒏𝒊=𝟏 𝒀
[1]
[2]
[3]
[4]
Estimated linear regression model
̂𝒐 + 𝜷
̂ 𝟏 𝑿𝒊 and substitute it in the previous expression
We know that the estimated regression line is 𝒀̂𝒊 = 𝜷
Summation
̂𝒊 = 𝟎
We already know from (4.32) that ∑𝒏𝒊=𝟏 𝑼
̅ = 𝟏 ∑𝒏𝒊=𝟏 𝒀𝒊
[5] Finally, we use the formula of the mean: 𝒀
𝒏
̂𝒐 + 𝜷
̂ 𝟏 𝑿𝒊 + 𝑼
̂𝒊
𝒀𝒊 = 𝜷
̂𝒊
𝒀𝒊 = 𝒀̂𝒊 + 𝑼
𝒏
𝒏
𝒏
̂𝒊
∑ 𝒀𝒊 = ∑ 𝒀̂𝒊 + ∑ 𝑼
𝒊=𝟏
𝒊=𝟏
𝒏
𝒊=𝟏
𝒏
∑ 𝒀𝒊 = ∑ 𝒀̂𝒊
𝒊=𝟏
𝒊=𝟏
𝒏
̅=
𝒀
𝟏
𝟏
∑ 𝒀𝒊 =
𝒏
𝒏
𝒊=𝟏
(4.34)
∑𝒏𝒊=𝟏 𝒖
̂ 𝒊 𝑿𝒊 = 𝟎 The SAMPLE COVARIANCE between the OLS residuals and the regressors is zero
̅)
̂ 𝒊 𝑿𝒊 = ∑𝒏𝒊=𝟏 𝒖
̂ 𝒊 (𝑿𝒊 − 𝑿
[1] We know that ∑𝒏𝒊=𝟏 𝒖
𝑛
a. WHY? ∑𝑖=1 𝑢̂𝑖 (𝑋𝑖 − 𝑋̅) = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ ∑𝑛𝑖=1 𝑢̂𝑖 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 ) − 𝑋̅ · 0 = ∑𝑛𝑖=1(𝑢̂𝑖 𝑋𝑖 )
̂𝟏
̂ 𝒊 = (𝒀𝒊 − 𝒀
̅ ) − (𝑿𝒊 − 𝑿
̅ 𝒊 )𝜷
[2] Substitute 𝑼
[3] Develop
̂𝟏 =
[4] Substitute 𝜷
𝒄𝒐𝒗(𝑿,𝒀)
𝑽𝒂𝒓(𝑿)
=
̅
̅
∑𝒏
𝒊=!(𝑿𝒊 −𝑿)(𝒀𝒊 −𝒀)
𝟐
̅
∑𝒏
(𝑿−𝑿
)
𝒊=𝟏
[5] Finally, develop.

𝒏
𝒏
̅) = 𝟎
̂ 𝒊 𝑿𝒊 = ∑ 𝒖
̂ 𝒊 (𝑿𝒊 − 𝑿
∑𝒖
𝒊=𝟏
𝒊=𝟏
𝒏
̂ 𝟏 ] (𝑿𝒊 − 𝑿
̅ ) − (𝑿𝒊 − 𝑿
̅ 𝒊 )𝜷
̅) = 𝟎
∑[(𝒀𝒊 − 𝒀
𝒊=𝟏
Laura Aparicio
28
ECONOMETRICS I
𝒏
𝒏
̂ 𝟏 ∑(𝑿𝒊 − 𝑿
̅ ) (𝑿𝒊 − 𝑿
̅) − 𝜷
̅ )𝟐 = 𝟎
∑(𝒀𝒊 − 𝒀
𝒊=𝟏
𝒊=𝟏
𝒏
𝒏
̅ ) (𝑿𝒊 − 𝑿
̅) −
∑(𝒀𝒊 − 𝒀
𝒊=𝟏
̅ )(𝒀𝒊 − 𝒀
̅)
∑𝒏𝒊=!(𝑿𝒊 − 𝑿
̅ )𝟐 = 𝟎
∑(𝑿𝒊 − 𝑿
𝒏
𝟐
̅)
∑𝒊=𝟏(𝑿 − 𝑿
𝒊=𝟏
𝒏
𝒏
̅ ) (𝑿𝒊 − 𝑿
̅ ) − ∑(𝑿𝒊 − 𝑿
̅ )(𝒀𝒊 − 𝒀
̅) = 𝟎
∑(𝒀𝒊 − 𝒀
𝒊=𝟏
𝒊=!
(4.35)
𝑻𝑺𝑺 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺
̅ )𝟐
[1] We know that 𝑻𝑺𝑺 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀
̂
[2] Include 𝒀𝒊
[3] Substitute:
̂𝒊
a. 𝑨 = 𝒀𝒊 − 𝒀
̂
̅
b. 𝑩 = 𝒀𝒊 − 𝒀
[4] Develop
[5] Finally, we know that:
a. ∑𝒏𝒊=𝟏 𝑨𝟐 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 It’s the sum of the square of the difference between true observations and predicted
observations.

b. ∑𝒏𝒊=𝟏 𝑩𝟐 = 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 It’s the sum of the square of the difference between the predicted observations and
the mean.

̂ )(𝒀
̂−𝒀
̅ ) = ∑𝒏𝒊=𝟏(𝑼
̂ 𝒊 · (𝒀
̂−𝒀
̅ )) = ∑𝒏𝒊=𝟏 𝑼
̂𝒊 𝒀
̂−𝒀
̅ ∑𝒏𝒊=𝟏 𝑼
̂ 𝒊 = ∑𝒏𝒊=𝟏 𝑼
̂𝒊 𝒀
̂−𝟎=
c. ∑𝒏𝒊=𝟏 𝑨𝑩 = ∑𝒏𝒊=𝟏(𝒀𝒊 − 𝒀
𝒏
𝒏
𝒏
̂
̂
̂
̂
̂
̂
̂
∑𝒊=𝟏 𝑼𝒊 (𝜷𝒐 + 𝜷𝟏 𝑿𝒊 ) = 𝜷𝒐 ∑𝒊=𝟏 𝑼𝒊 + 𝜷𝟏 ∑𝒊=𝟏 𝑼𝒊 𝑿𝒊 = 𝟎 + 𝟎 = 𝟎
𝒏
̅ )𝟐
𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀
𝒊=𝟏
𝒏
̂𝒊 + 𝒀
̂𝒊 − 𝒀
̅ )𝟐
𝑻𝑺𝑺 = ∑(𝒀𝒊 − 𝒀
𝒊=𝟏
𝒏
𝑻𝑺𝑺 = ∑(𝑨 + 𝑩)𝟐
𝒊=𝟏
𝒏
𝒏
𝟐
𝟐
𝒏
𝟐
𝒏
𝟐
𝑻𝑺𝑺 = ∑(𝑨 + 𝑩 + 𝟐𝑨𝑩) = ∑ 𝑨 + ∑ 𝑩 + 𝟐 ∑ 𝑨𝑩
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝒊=𝟏
𝑻𝑺𝑺 = 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝑺𝑺 + 𝑬𝒙𝒑𝒍𝒂𝒊𝒏𝒆𝒅𝑺𝑺 + 𝟎
Laura Aparicio
29
ECONOMETRICS I
REGRESSION WITH A SINGLE REGRESSOR: HYPOTHESIS TESTS AND CONFIDENCE
INTERVALS
Hypothesis testing for regression coefficients is analogous to hypothesis testing for the population mean: Use the t-statistic to
calculate the p-values and either accept or reject the null hypothesis. Like a confidence interval for the population mean, a 95%
confidence interval for a regression coefficient is computed as the estimator ±1.96 standard errors.

HYPOTHESIS TESTING
First of all, we need to state precisely the null and
̂𝟏 , 𝑺𝑬(𝜷
̂𝟏 ). Although the
STEP 1: Compute the standard error of 𝜷
alternative hypothesis before starting the test:
formula is complicated, in applications the standard error is
computed by regression software.

Recall:
STEP 2: Compute the t-statistic:
̂1 at least as different from 𝛽1,0 assuming that
STEP 3: Compute the p-value. In other words, the probability of observing a value of 𝛽
the null hypothesis is correct.

EXAMPLE: TEST SCORES AND STR, CALIFORNIA DATA
Using STATA we obtain the following
table and we observe that there are
three equivalent ways to reject or
not the null hypothesis:
Looking the t-statistic: if it’s bigger
than 1.96 we reject.

Looking the p-value: if it’s lower
than 0.05 we reject.

Laura Aparicio
30
ECONOMETRICS I
Looking to the confidence interval: if the 0 (the value of the null hypothesis) is not included then we reject.

CONFIDENCE INTERVALS: In 95% of all samples that might be
drawn, the confidence interval will contain the true value of
the population parameter.

At the same time, it can be define as the set of values that can’t
be rejected using a two-sided hypothesis test with a 5%
significance level.

RECALL!! When X is binary, the regression model can be used to estimate and test hypotheses about the difference between the
population means of the “X=0” and the “X=1” group.

Our only assumption about the distribution of Ui conditional on Xi is that is has a mean of zero (the first least squares assumption). If,
furthermore, the variance of this conditional distribution does NOT depend on Xi, then the errors are said to be homoskedastic.

We’re going to discuss:
Its theoretical implications: What are heteroskedasticity and homoskedasticity?
The simplified formulas for the standard errors of the OLS estimators: Mathematical implications
The risks you run if you use these simplified formulas in practice: What does this mean in practice?
Laura Aparicio
31
ECONOMETRICS I
RECALL SOME PROPERTIES (Y = wages, X = years of school):
𝑽𝒂𝒓(𝒀) You calculate the variance of all the population’s wages
𝑽𝒂𝒓(𝒀|𝑿 = 𝟏𝟐) You just calculate the variance of the group who satisfies the condition X [we’re fixing X, but Y keeps
changing]
𝑽𝒂𝒓(𝑿) You calculate the variance of all the population’s years of school.

𝑽𝒂𝒓(𝑿|𝑿 = 𝟏𝟐) = 𝟎 We’re fixing X, so the variance of X is equal to 0.

Therefore, 𝑽𝒂𝒓(𝒀|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 + 𝜷𝟏 𝑿 + 𝑼|𝑿) = 𝑽𝒂𝒓(𝜷𝟎 |𝑿) + 𝑽𝒂𝒓(𝜷𝟏 𝑿|𝑿) + 𝑽𝒂𝒓(𝑼|𝑿) = 𝟎 + 𝟎 + 𝑽𝒂𝒓(𝑼|𝑿) = 𝑽𝒂𝒓(𝑼|𝑿)
The variance of fixed parameters is always 0.

[1] WHAT ARE HETEROSKEDASTICITY AND HOMOSKEDASTICITY?
HOMOSKEDASTICITY
HETEROSKEDASTICITY
If 𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is constant, that is the variance of the
If
𝑉𝑎𝑟(𝑈|𝑋 = 𝑥) is NOT constant, that is the variance of the
conditional distribution of U given X does NOT depend on
conditional distribution of U given X depends on X, then U is said to be
X, then U is said to be homoskedastic.

heteroskedastic.

All distributions are equally wide.

Conditional distribution of Ui spreads out as x increases.

Laura Aparicio
32
ECONOMETRICS I
EXAMPLE 1[NOT SURE, MAYBE HOMOSKEDASTICITY]: California test scores
EXAMPLE 2 [HETEROSKEDASTICITY]: Average hourly earnings vs years of education
First, on average, the longer you study, the higher the
wage you’ll have.

Analysis of the variance of the conditional distribution
of U given X:
If you go less than ten years to school, your wage will
be around 0-20.

If you go more than 10 years, your wage will be
between 0-60.

Therefore, conditional distribution of Ui spreads out as
x increases.

Laura Aparicio
33
ECONOMETRICS I
[2] MATHEMATICAL IMPLICATIONS
Heteroskedasticity and homoskedasticity concern only to the variance, in other words, the Standard Error (SE) and all the values
calculated using SE (t-statistic, confidence intervals…).

HETEROSKEDASTICITY
HETEROSKEDASTICITY
X=1 Var = 15
X=2 Var = 20
HOMOSKEDASTICITY
HOMOSKEDASTICITY
Know due to U and X are independent we can express 𝑉𝑎𝑟(𝑣) = 𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈)
̂1 ) =
𝑉𝑎𝑟(𝛽
X=1 Var = 15
X=2 Var = 25
𝑉𝑎𝑟(𝑋) · 𝑉𝑎𝑟(𝑈)
𝑉𝑎𝑟(𝑈)
=
𝑛 · 𝑉𝑎𝑟(𝑋)2
𝑛 · 𝑉𝑎𝑟(𝑋)
[3] WHAT DOES THIS MEAN IN PRACTICE?
If the errors are homoskedastic and you
use the heteroskedastic formula for
standard errors (the one we derived), you
are OK.

If the errors are heteroskedastic and you
use the homoskedasticity-only formula
for standard errors, the standard errors
are WRONG.

Laura Aparicio
34
ECONOMETRICS I
The two formulas coincide (when n is large) in the special case of homoskedasticity.

The bottom line: you should ALWAYS use the heteroskedasticity-based formulas – these are conventionally called the
heteroskedasticity-robust standard errors.

MAIN IDEA! In general, the error ui is heteroskedastic (that is, the variance of ui at a given value of Xi, 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥), depends on x).

A special case is when the error is homoskedastic (that is 𝑣𝑎𝑟(𝑢𝑖 |𝑋𝑖 = 𝑥) is constant). Homoskedasticity-only standard errors do NOT
produce valid statistical inferences when the errors are heteroskedastic, but heteroskedasticity-robust standard errors do.

EXTRA! If the three least squares assumption hold AND if the regression errors are homoskedastic, then, the OLS estimator is BLUE.

Moreover, if the three least squares holds, if the regression errors are homoskedastic AND if the regression errors are normally
distributed, then the OLS t-statistic computed using homoskedasticity-only standard errors has a Student t distribution when the null
hypothesis is true. The difference between the Student t distribution and normal distribution is negligible if the sample size is moderate
or large.

CONCLUSION
Returning to the California test score data set, there is a negative relationship between the student-teacher ratio and test scores,
but is this relationship necessarily the causal one? Districts with lower STR have, on average, higher test score. But does this mean
that reducing the STR will, in fact, increase scores?
There is, in fact, reason to worry that it might not. Hiring more teachers, after all, costs money, so wealthier school districts can
better afford small classes. Moreover, students at wealthier schools also have other advantages over their poorer neighbours,
including better facilities, newer books, and better-paid teachers.

What’s more, California has a large immigrant community; these immigrants tend to be poorer than the overall population, and, in
many cases, their children are not native English speakers. It thus might be that our negative estimated relationship between test
scores and the STR is a consequence of large classes being found in conjunction with many other factors that are, in fact, the
real cause of the lower test scores.

These other factors or “omitted variables”, could mean that the OLS analysis done so far has little value. Indeed, it could be
misleading: changing the STR alone would not change these other factors that determine a child’s performance at school. To
address this problem, we need a method that will allow us to isolate the effect on test scores of changing the STR, holding these
other factors constant. The method is MULTIPLE REGRESSION ANALYSIS.

Laura Aparicio
35
...