Data Analysis (2016)
Apunte InglésUniversidad  Universidad Pompeu Fabra (UPF) 
Grado  International Business Economics  1º curso 
Asignatura  Data Analysis 
Año del apunte  2016 
Páginas  38 
Fecha de subida  11/11/2017 
Descargas  0 
Subido por  ccastellsfontelles 
Vista previa del texto
Data Analysis
Clara Castells
WEEK 1: INTRODUCTION TO STATISTICS
We call statistics the science that deals with the obtaining of information from numerical data.
Specifically, we focus on applied statistics, which owns three principal major fields of study:
Obtaining data: We get data to answer specific questions.
Data analysis: We organize and describe these data using graphs, numerical summaries and
mathematical models.
Statistical interference: Interpret the data and obtain conclusions that can be applied to a
broader group, and determine the reliability of the conclusions.
Samples
When obtaining data, 2 key concepts must be taken into account.:

A population is the group of individuals that we want to study.
A sample is the individuals (or the part of the population) that we really study.
There are diverse ways to choose a sample, but some of them can lead to false conclusions. We
will divide the methods in biased and nonbiased. There are two types:


Sample of volunteers: They are the people who have a particular interest in answering
to the subject of the study and, therefore, voluntarily offer to participate. For example,
if in the TV is asked if someone has to go to prison and this person has lots of enemies,
it is clear that the result to the question won’t reflect reality.
Sample of convenience: It is the one where individuals of easier access. For example,
degustation campaigns of certain type of food.
The nonbiased samples are the ones that give the same probability to each individual of the
population to be elected. Depending on the size of the population, we distinguish 3 types:
Simple Random Sample: A sample of n size consists of n individuals of a population
elected by a way in which they all have the same probability to be elected. The selection
process is randomly done, by a series of statistics programmes, or by a random digits
table.
This table is characterized for having independent numbers knowing a part of the table
doesn’t provide full information of the others where every value has the same
probability to be a digit between 0 and 9. This way, we give a number to every individual,
we choose a row of the table and look at the last digits of the numbers: we choose the
individual that has a number that coincides with the last digits of the last digits of a
number of the table, and this for every individual.
Stratified Random Sample: It divides the population in similar individual groups
(stratums) and then does the “Simple Random Sample” in each stratum to combine
them and obtain a complete sample.
Multiple Stage Random Sample: Applies the “Simple Random Sample” by stages, first
in the city, then in a neighbourhood, afterwards in a street and finally in a living. The fist
sample can be stratified.
1
Data Analysis
Clara Castells
The survey and its possible problems:
When making a survey, in addition of doing questions related to what we want to study, other
questions related to the individual’s characteristics need to be done (sex, age, occupation…). In
addition, even if the election method has been random, there can be some problems:
Lack of range: Some of the groups of the population are out of the selection process of
a sample. For example, marginal groups.
Lack of answer: When an individual doesn’t want to participate or cannot be contacted.
Slope of answers: Asked people can lie if they are asked for unpopular or illegal
behaviours. Lack of memory can also influence.
Writing of the questions: Some questions can generate confusion or induce to a certain
answer.
Data organization
Once we have the surveys, we have to organize the collected information creating a data base
in the computer. There are some key concepts in this organization.
Individuals: persons, animals or things that are described in the data set. In a data base,
every individual is a column.
Observation or case: in a data set, an individual and its variables. In a data base, every
observation is a row.
Categorical variable: it indicates to which group or category the individual belongs.
Quantitative variable: it takes numerical values and it is possible to perform
mathematical operations, such as sums or means.
Distribution of a variable: it tells us which values a variable takes and with which
frequency.
Frequency table: It presents the frequencies in which the values or range of values
(intervals or classes) of a variable are observed.
Absolute frequency: times that we observe a value in an interval or a class.
Absolute cumulative frequency: Sum of all the absolute frequencies of each
observation or case until the moment. The last frequency will be equal to the number
of cases there are, for example, if there are seven cases it will be seven.
Relative frequency: percentage of times that we observe a value in an interval or class.
Relative cumulative frequency: Sum of all the relative frequencies of each observation
or case until the moment. It shows the percentage of variables that have a value which
is inferior to the maximum number of the interval. For example, if the interval (20,30)
has a relative cumulative frequency of 0,40 there will be a 40% of the values smaller
than 30.
2
Data Analysis
Clara Castells
WEEK 2: DESCRIPTIVE ANALYSIS OF DATA
Graphs and histograms
The first thing that needs to be done with a data set is describe it. We will first examine the
graphs and later, other numerical analysis tools. We can distinguish two types of graphs: the
ones used for categorical variables and the ones used for numerical variables.
Categorical variables graphs can be:

Bar diagrams, if we want to compare different categorical variables.
Piecharts, if we want to compare all the categorical variables.
Numerical variables graphs can be:

Histograms: graphical representations of a frequency table. To create them we need to
create intervals of the same length, count how many cases there are and draw the
graph. It is necessary to choose correctly the number of classes.
To analyse their form:
*A histogram is symmetric if the left and the right sides have approximately the same
form.
*A histogram is asymmetric if one side has a different form to the other. It is skewed to
the left if this side is longer, and skewed to the right if the right side is longer.
Atypical observations must also be identified.

Stemplots: It can be used for small data sets. To compute it:
1. Organize values from lower to higher.
2. Separate each observation in a stem (which will contain all the digits except for the
last one) and a leaf (which will contain the last digit)
Stems are drawn vertically from lower to higher and the leaves are placed next to
their stem (left to right, from lower to higher).
The unit of the leaves must be specified to avoid errors, and if the digits can be
approximated if they are big numbers, that the stems can be divided when there
are too many leaves and that, if we turn the diagram, it is similar to a histogram.
Numerical analysis of data: the centre.
For numerical variables, we can describe distributions numerically with the help of a set of
measures. Basically, we can describe the centre and the spread.
To describe the centre or mean value, we can use:


The mean: It is obtained by adding all the values and dividing by the number of cases. It
is a good indicator of the centre when the distribution is symmetric.
The median: It is the value of the central observation when the cases are ordered from
lower to higher, if there is an odd number of cases. If the number of cases is even, it will
be the mean between the two central cases. We use the median when the distribution
is asymmetrical.
The mode: The observation with a higher frequency.
3
Data Analysis
Clara Castells
On the other hand, we must take into account the existence of the five numerical summaries
that allow to describe a numerical data set. These numbers are:

The median or the value that separates the 50% of the observations:
The maximum
The minimum
The first quartile or the value under of which we have the 25% of the observations.
The third quartile or the value under which we have a 75% of the observations.
In addition, we can do 2 operations with these two values:

The spread: MaximumMinimum
The interquartilic range: 3Q – 1Q
These five numbers allow us too to build the boxplot:
1. Draw a line and mark the 5 numerical summaries.
2. Draw a “box” that includes the values from 1Q to 3Q.
3. Draw two lines that go out from this box and reach the maximum and the minimum
values.
4. We also have to be careful with the outliers: An observation is an outlier if it is greater
than 3Q + (1.5 * Interquartilic range) or inferior to 1Q  (1.5*Interquartilic range). The
outliers are drawn apart.
Numerical analysis of data: The spread and other measures.
To calculate the spread we will use the standard deviation, which measures the spread with
relation to the mean.
To calculate the standard deviation, these steps must be followed.
1.
2.
3.
4.
5.
Calculate the difference between each observation’s value and the mean.
Raise the result of the difference to square.
Add all the squared differences.
Divide by all the observations minus 1.
Compute the square root of the step 4 (only taking the positive result).
On the other hand, there also exist other measures, such as:
P% percentile: value in a position under which we obtain p% cases.
Variation coefficient: Standard/ mean deviation.
Asymmetry measures: (meanmode)/Standard deviation or (meanmedian)/Standard
deviation.
Kurtosis measures: It measures the degree of concentration of the frequencies in
relation to the mean.
In conclusion, we can say that for an asymmetric or with extreme values distribution we will
use the 5 numerical summaries. For a symmetric distribution we will use the mean and the
standard deviation.
4
Data Analysis
Clara Castells
WEEK 3: GROUPED DATA
We understand as grouped data a set of data of a numerical variable presented in the form of a
frequency table. We don’t know the original information; this means, the data case by case.
Therefore, we need to work with data grouped in intervals or value ranges.
However, we can calculate almost every numerical summary and this way describing quite good
the data set.
Imagine we have the following data set:
We present special attention to the 5th column (the blue one), since it is a key tool to calculate
the numerical summaries. The midpoint of an interval is its upper limit (or maximum of the
interval) minus its lower limit (or the minimum of the interval) divided by 2. The formula could
be expressed in the following way:
Interval midpoint = (Upper limit – Lower limit)/2
Once we have calculated the midpoints, we can find the five numerical summaries:
1st Quartile calculation: We take the midpoint of the interval that has the observation
determined by: (Number of cases+1)/4.
In this case, we have 280 cases, if we do 281/4 = 70.25. therefore, we would take 17500
as the value of the first quartile, since 70.25 is found between the observations 70 and
71, located in the third interval.
Median calculation: We take the midpoint of the interval that has the observation
determined by: (Number of cases+1)/2.
In this case, we have 280 cases, so 281/2=140.5. Therefore, we would take 17500 as the
median value, since 140.5 is found between the observations 140 and 141, located in
the third interval.
3rd Quartile calculation: We take the midpoint of the interval that has the observation
determined by: (Number of cases+1) *0,75.
In this case, we have 280 cases, so 281*0.75 =210.75. Therefore, we would take 25000
as the median value, since 210.75 is found between the observations 210 and 211,
located in the fourth interval.
Maximum and minimum calculation: We take the lowest value (the inferior limit of the
first interval) to know the minimum, and the highest value (the upper limit of the last
interval) to know the maximum.
We can calculate the other numerical summaries also by the midpoint. We look at the last
column (the orange one), where we have supposed that each of the values of each interval are
5
Data Analysis
Clara Castells
equal to the midpoint, and this is why we multiplicate the midpoint times the absolute
frequency. Once this is done, we can calculate some numerical summaries:
Mean calculation: We add all the values of the orange column and divide them by the
total number of observations. This way, we get: 6187500/280=22098,21.
Therefore, the mean=Total sum of the intervals / number of cases.
Standard deviation calculation: We simply calculate the standard deviation as we
usually do; but as we do not know the values of all the observations, we do it using the
midpoint and we multiplicate it times the absolute frequency. The rest of the procedure
is the same.
For example, the standard deviation of the first interval is:
o Midpoint  mean= 500022098,21= 17098,21
o (Midpoint – mean)2= (17098,21)2= 282348931,76
o (Midpoint – mean)2*Absolute frequency= (17098,21)2*15= 435233976,4.
o We would afterwards add the results of each interval, divide by the number of
observations minus 1; and we would do the square root of all of this.
Data transformation
If we want to change measure units (change from dollars to euros, for example), it must be taken
into account how this changes affect the numerical summaries. Basically, we distinguish two
types of change:
Origin change: it takes place when we add or subtract a number (a constant) to the
original variable). Hence, suppose X is our original variable, and a is any constant. An
origin change of the variable origin change of the variable X will give us a transformed
variable, which we call Y. The change is expressed with the following equation: Y=X±a.
The origin change shifts the graph to the right or left (depending of a). In this change,
only the position measures change (mean, quartiles…) but the spread and the form don’t
change.
Scale change: It takes place when we multiply or divide the data by a number. We
suppose X is our original variable, and b is any constant. A scale change of the X variable
will give us a transformed variable, which we call Y. The change is expressed with the
following equation: Y=X*b (if we multiply) or Y=X/b (if we divide).
This type of change makes the size of the histogram change, depending if we
multiplicate or divide. In this change, the position measures (mean, quartiles,…) and the
dispersion measures (standard deviation, kurtosis…) vary, only the form remains
constant.
Linear transformations: The two changes together, and we express them with the
equation: Y=(X±a)/b or Y=(X±a) *b. In this change, the position measures (mean,
quartiles…) and the dispersion measures (standard deviation, kurtosis…) vary, only the
form remains constant.
There exist another type of transformations which are not that frequent: the nonlinear
transformations. They are based on nonlinear functions, such as logarithmic or exponential,
and are used to convert asymmetric distributions into symmetric and calculate this way the
numerical summaries which are only valid for these distributions (mean, standard deviation…).
6
Data Analysis
Clara Castells
When these transformations are applied, everything changes: form, spread and position. In
addition, we cannot calculate the new mean and the standard deviation with the old data. This
means, that if we do a logarithmic transformation, the new mean will NOT be the logarithm of
the old mean.
OdStatistics allows to do all these transformations in a quick way.
7
Data Analysis
Clara Castells
WEEK 4: DENSITY CURVES AND HISTOGRAMS (week 4 slides)
When exploring a numerical or quantitative variable:
1. We draw a graph (a histogram or a stemplot)
2. We analyse the general appearance of the distribution (centre, spread, shape) and the
atypical observations.
3. We choose a numerical summary to describe briefly the centre and the spread of the
distribution.
In addition, we can describe some histograms with lots of observations with a smooth curve. To
be able to do so, the histogram must be regular and, therefore, must accomplish the following:
1. It must be symmetric.
2. Both sides must decrease gradually.
3. There cannot be atypical information nor notable gaps.
Therefore, the density curve (technical name for the previously described curve) is a
mathematical model that provides us with a good data description, even though this description
is idealized since it ignores the atypical values and the small irregularities.
Finally, it must be said that the histogram depends on the number of chosen classes while the
density curve does not.
Density curve example
One one hand, the density curve define under it an are exactly equal to 1, this means, the region
under the curve contains the total proportion of observations. This allows us to, for eample, know
the proportion of cases under a value or locate center measures as the median (which divides
the area under the curve in two halves, each of which contains 50% of the cases). As in the rest
of distributions, the mean ans the median match if the form is symmetric and if it is asymmetric
the mean moves to the longest side. We can also locate position measures (first quartile…).
Normal distribution
Normal density curves are a special class of density curves. They characteryze for :

being symmetric
having an only mode or peak
having a bell form
they are described by simply giving the mean 𝜇 and a standard deviation 𝜎.
8
Data Analysis
Clara Castells
These kind of distributions are very important, since:

They describe well a part of the real data sets.
They approximate well the results of many aleatory processes.
Many interference statistics processes are based in their properties.
There exist two important properties from these curves:
1. The mean 𝜇 is located in the center of the curve.
2. Typical deviation 𝜎 controls the curve spread.
In addition, the mean and the typical deviation allow us to calculate the inflection points since
they are those equal to 𝜇 ± 𝜎.
The rule 689599.7 says that:

68% of observations are located between 𝜇 − 𝜎 and 𝜇 + 𝜎.
95% of observations are located between 𝜇 − 2𝜎 and 𝜇 + 2𝜎
99.7% of observations are located between 𝜇 − 3𝜎 and 𝜇 + 3𝜎.
Standard normal distribution
If we want to compare two cases expressed in different systems of measure, we will use a
chriteria that measures in standard deviations, so that we can know which case is the biggest.
This chriteria is called standarized observation (z) and indicates how many typical deviations (𝜎)
is the original observation (x) from the mean (µ) and in which direction.
It can be calculated by:
𝑧=
𝑥−𝜇
𝜎
Variable z is a linear transformation from x variable, so, z of mean (µ) will be 0 (it is at 0 typical
deviations) and z of standard deviation (𝜎) will be 1. Then, as all the normal distributions share
the same properties, we can “standarize” the data and transform any normal curve (𝜇, 𝜎) in the
normal standard curve N(0,1).
Calculating a value with the standard normal distribution
The standard normal distribution allows us to calculate a percentage (percentage of cases under
a X value) and a value (value under which a concrete % of the cases are found). To do these
calculations, we need to standarize the normal distribution and the table A, which we were given
in class.
To calculate a value, for example: in a normal distribution N(72,4) we want to know under which
value is the 60% of the class. This means that we look for the value of z of the chart A with a
value equal or very close to 60. This is because the ‘z’ of the chart A tell us which area of the
normal standard distribution is under them. If we search, we will find that the closest value is
z=0,25 with an area of 59,87% of observations. Once this is done, we isolate x:
9
Data Analysis
Clara Castells
0,25 = (x72) / 4
1 = x – 72
X = 73
Therefore, under the value 73 we find approximately 60% of this distribution cases.
Calculating a percentage with the standard normal distribution
To calculate a percentage, for example: in a normal distribution N(72,4) we want to know under
which percentage of cases have a value greater than 64. First, we need to standarize the value
64:
z = (64 – 72)/ 4 = 2
We look for z=2 in table A and we obtain a value of 0,0228 or a 2,28%. This means, as the table
indicates, that a 2,28% of cases are on the left of 64. But remember: we want to know the
percentage of cases with value greater than 64. Therefore, we deduce 100%2,28%=97,72%.
With this we know that a 7,72% of the values have a value greater than 64.
Normality valoration
We can know if a normal distribution is a good approximation to our data distribution by:
Visual diagnosis: Histograms or symmetric stemplots without blanks and without
atypical observations.
Numeric diagnosis: Rule of 689599,7 and others. We calculate the points (𝜇 − 𝜎 and
𝜇 + 𝜎, 𝜇 + 𝜎, 𝜇 − 2𝜎, 𝜇 + 2𝜎, 𝜇 − 3𝜎 and 𝜇 + 3𝜎) and we recount the frequencies to
see if the rule is followed.
Value 64: z=(6472)/4=2. We look for z=2 in the table A and we get a value of 0,0228
or a 2,28%. This means, as the table indicates, that a 2,28% of the cases are on the left
of 64 (and have a value under 64). But remember: we want to know the percentage of
cases with a value greater than 64. Therefore, we subtract 100%2,28% = 97,72%. With
this we know that a 97,72% of the values have a value greater than 64.
10
Data Analysis
Clara Castells
WEEK 5: DATASETS WITH TWO VARIABLES. CORRELATION AND REGRESSION
Response variable and explanatory variable
We usually find variables that can be related.
A response variable measures a result.
An explanatory variable influences or explains the changes in the response variable.
For example:

The weight and height of a group of people.
The time dedicated to studying an exam and the obtained grades of a group pf people.
How to analyse datasets with two variables:
1. Start with a graph
2. Identify general aspects and main patterns.
3. Use numerical summaries to describe main patterns.
4. See if the patterns of the distribution can be summarized by a smooth or regular
function.
The scatterplot
The first step to find out a relation between to variables is to build a scatterplot, which allows
to describe the general appearance of the relation by:

The form of the distribution
We say that the relation between variables is linear if a straight line approximates the
relation.
If the relation can be approximated by another function we say that it is nonlinear.
There is also the possibility that there is no relation between variables.

The direction of the distribution
The variables are positively associated if the superior values to the mean in a variable
come with superior values to the mean of the other variable.
The variables are negatively associated if superior values to the mean come with
inferior values to the mean of the other variable.
11
Data Analysis

Clara Castells
The strength of the relation.
Linear associations are specially important.
A linear association is strong if the points of the scatterplot are dispersed around the
principal form (the line that describes the relation).
We can also find outliers.
Correlation
Scatterplot: it allows to identify the direction (positive or negative), the form (linear or
nonlinear) and the strength (fit) of the relation between two variables.
A more objective way to determine the force of the relation between variables is numerical
summary: concept of correlation among variables.
The correlation coefficient measures strength and direction of the linear relation between two
numerical variables x and y.
12
Data Analysis
Clara Castells
Correlation r is a mean of the sum of the products of the standardized variables.
It can also be expressed as:
The numerator is called covariance and is equal to:
Correlation: properties


r doesn’t have units, since zx and zy don’t have any.
r takes values from 1 to 1
The linear association is stronger when r is near 1 or 1.
If r=1 or r=1 all the points of the scatterplot fall on a line.
r only measures the strength of the linear associations. A r near 0 indicates that there
is no linear association, but there could be no linear association.
r is strongly affected by a few outliers.
13
Data Analysis
Clara Castells
Covariance
It computes an average of the products of deviations with respect to the mean.
14
Data Analysis
Clara Castells
In the previous diagram, given that the most frequent (and larger) products are positive, the
covariance will be positive.
What does covariance measure?
It measures if the linear relation between two variables is strong.
If it is positive it shows that there is positive linear association.
If it is negative it shows that there is an inverse association.
If it is approximately 0 it says there is no linear association.
But the value of the covariance measure depends not only on the strength of the association
but also on the units of measure (product of the deviations).
We have to normalize the measure to be able to better interpret it.
Linear regression
If an explanatory variable x and a response variable y are lineally associated, we want to
summarize their relation by a line.
This line could be used to predict the values of y.
Linear relation: y = a + bx
Motivation:
 Simplicity.
 Frequently found in practice.
15
Data Analysis
Clara Castells
When a variable causes or explains another one:
Reponse variable: the variable that we want to explain.
Explanatory variable: the variable that we use to explain the reponse variable.
Example. We want to explain family consumption:
 Reponse variable: consumption.
 Explanatory variable: income.
Residual or Error = observed value  predicted value.
We want to make the residuals or errors as small as possible (this will give us a and b of the
line with the best fit):
16
Data Analysis
Clara Castells
Prediction:
We must be careful and not use the regression line to predict out of the interval of values of
the explanatory variable x.
Given a value of x which is not in the sample, we compute the predicted value for y.
How much gas would it be used if 40.000 families would use heating?
1.23 + 0.202 (40) = 9.31 , so 9.31 millions of cubic meters.
Coefficient of determination
R2=r2
The coefficient of correlation squared shows the percentage of the variation of y that we can
17
Data Analysis
Clara Castells
explain with the variation of x.
The closer to 1, the line has a better fit; the closer to 0, the line has a worse fit.
Attention with:
Association does not imply causality: if we find a strong correlation between the mathematics
grade and the language grade, we cannot state that one grade causes the other one.
18
Data Analysis
Clara Castells
WEEK 6: DATASETS WITH TWO VARIABLES (I)
Residual analysis
As our data is not aliened in a perfect line, there are some residuals. The line predict is in the
form y^ : a +bx (it's important to put ^ above the y because the line is predicted, it is not the
exact value that we obtain).
Residuals: error of predicted (value observed (real value) value predicted) > the one of the
line. Above the line are positive residuals, and below the line are negative residuals.
Residual diagram: we represent the value of the of the residuals at the vertical axis and the value
of the explanatory variable at the horizontal axis. (Scatterplot of the residuals of the regression
in relation to the explanatory variable).
Here we observe that the residuals are always distributed arround their mean, which is always
0.
The sum of the residuals is always 0, although your data is spread, it is always 0 because the line
is computed in a way to balance the distance of the point. The residuals have to be distributed
without any special pattern above and below the horizontal axis.
There has to be the same quantity of residuals above and below the lines because we have to
remember that its sum is always 0.
Error of residual = yi ^yi
19
Data Analysis
Clara Castells
Influential Observations:
Outliers with respect to x and y.
They are outliers depending on how far they are from the mean of x and the mean of y. If it's
very far from x, it's an outlier of x and if it's very far from the mean of y, it's an outlier of y.
An outlier is influential if when we eliminate it, the regression line changes in a more accurate
way respect the values.
How can we check if a regression is influential or not?
The point (82,93) it's an outlier. But is it influential? We have to prove it. If the line changes a
lot, it will be influential.
To check it we have to run the regression with the point and without it:
20
Data Analysis
Clara Castells
Here we see that the point (82,93) it's an influential outlier because the regression line changes
a lot to adjust to the other values of the graph.
If some points are not outliers, the regression line doesn't change too much; they are not
influential. When they are not influential, the line gets shorter but doesn't change its form.
However, when they are influential the line moves and change the direction or form.
When we eliminate an outlier with respect to y...
We have our regression line, with an outlier (19) with respect to y (because it's far from the
mean of y).
And we analyse what happens when we eliminate it:
We see that when we eliminate the outlier, our regression line changes but not too much
compared with the above example, so the outlier is a noninfluential outlier as our regression
line doesn't change so much in relation with the initial regression line.
Outliers respect to Y don't produce much changes, but when it's an outlier respect to X it's
usually influential.
21
Data Analysis
Clara Castells
Excessive spread: median trace
Median Trace: there is something changing the variability of Y that doesn't let us see the centre.
The median trace divides the X axis in various intervals and for each interval it computes the
median of Y. Then we figure out that there's a pattern. With the median trace you can figure out
the relation between X and Y.
 How is it done?
Divide the horizontal axis in equally sized sectors and compute the median (or the mean for the
mean trace) of the dependent variable.
With this graph, we cannot take a clear relation, so we do the median trace and we divide the
horizontal axis in 5 sections
Then we unify the mean of each section to see a pattern. Then with this pattern we can see the
relation between X and Y.
We can divide the Xaxis is the sections that we want, for example 5 or 4 depending on what we
are searching.
In this graph we can se that 90km/h is the optimal speed because it has the lower consum. So
we have get to a relation between X and Y and then we can answer some questions about the
Data.
22
Data Analysis
Clara Castells
Nonlinear relations:
When we have a nonlinear relation we have to use logarithms to make the transformation in
order to get the relation between the variables X and Y, and in order to study them.
A math digression: some properties of logarithm.
1. Exponential function y=e^x. We also write y= exp(x).
2. We define a natural logarithm, we just call it ln, as the inverse of this function: ln(y) = ln
(exp(x)) = x. Only defined for x > 0.
23
Data Analysis
Clara Castells
We have to take into account the elasticity. But why?
An elasticity tells us the effect of the explanatory variable on the dependent variable in
percentual terms.
The result comes from the properties of logarithms.
Problems with regression and correlation:
 Extrapolation: means predicting very far away from your sample. You are trying to predict the
growth. You must predict close to your sample. It is the process of estimating, beyond the
original observation range, the value of a variable on the basis of its relationship with another
variable. It is similar to interpolation, which produces estimates between known observations,
but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless
results.
Predictions can be very wrong if we use values for the explanatory variable which are very
different from the ones in the sample.
Example: a sample of 1 and 2 year old boys.
Height (cm)=45+20*Age(years)

Height predicted for 1 year old: 45+20*1=65 cm
Height predicted for 2 year old: 45+20*2=85 cm
Height predicted for 18 year old: 45+20*18=405 cm
* Extrapolating gets a prediction for an 18 year old of 4 meters 5 cm.
24
Data Analysis
Clara Castells
 Latent variables: are variables that may be affecting in your study. Variables that you don't see
but can be affecting the values of your regression. are variables that are not directly observed
but are rather inferred (through a mathematical model) from other variables that are observed
(directly measured).
There are 2 possible effects:

A relation is suggested but it is false.
A relation is hidden.
 Using means of the variables: it's good to figure out relations, but the relation is going to be
stronger than what it is, because you use too much data. If you use individual data the relation
is going to be smaller.
 Association and causality: Correlation implies association, but not causation. Conversely,
causation implies association, but not correlation.
When an article says that causation was found, this means that the researchers found
that changes in one variable they measured directly causedchanges in the other.
When researchers find a correlation, which can also be called an association, what they are
saying is that they found a relationship between two, or more, variables. Correlations can be
positive  so that as one variable goes up, so does the other; or they can be negative, which
would mean that as one variable goes up another goes down.
Association should not be confused with causality; if X causes Y, then the two are associated
(dependent). However, associations can arise between variables in the presence
(i.e., X causes Y) and absence (i.e., they have a common cause) of a causal relationship.
25
Data Analysis
Clara Castells
26
Data Analysis
Clara Castells
WEEK 7: DATASETS WITH TWO VARIABLES (II)
Relation between two categorical variables
1Summarize the information in a frequency table that provides combined distribution of the 2
variables.
It also provides the marginal distributions (distributions of each variable alone)
It can be useful to represent the marginal distributions with bar diagrams (even though it doesn’t
indicate relations between the 2 variables).
2 Calculate the condicional distributions (establish conditions and see the proportions in which
they happen). The process to do it is as follows:
1st We establish a condition: We focus only in a type of individuals (for example, children
whose parents both smoke).
2nd We do the same thing for for the other categories (for example, families in which
both parents smoke or families in which none of them smoke).
Summary of the example:
18,75% of the sample smokes.
Group of both smoker parents: 22,56%
Group of one smoker parent: 18,56%
27
Data Analysis
Clara Castells
Group of none smoker parent: 14,9%
As there are differences, we can state that there is relation between the 2 variables: children
from smoker parents smoke more than children of non smoker parents.
*it is not important which variable we establish as a condition.
Latent variables
The presence of a latent variable can cause a relation or association between two observed
categorical variables to change and even invert.
Example: In a university, more men are accepted than women. Calculating, er can state that
there is a relation between sex and admission. However, there is a latens variable called
“studies”, which takes two values: Physics and Chemistry. If we study this varriable, the relation
changes.
Simpson paradox
We say that a Simpson paradox takes place when a relation or association that exists for all or
some groups can change direction when the data is combined in a single group.
Summary:
If the distributions of a variable X contitional to the different values of a variable Y are very
different, we can say that there is a relation between the 2 variables.
In the opposite case, we don’t have evidence.
*In big samples, a very small difference indicates that there is no relation. In small samples is
difficult to say.
One numerical and one categorical variable
To analyse this kind of relations, we will build and analyse:
Graphical summaries of the numerical variable for each possible value of the categorical.
Numerical summaries of the numerical variable for each possible value of the categorical.
Example: Money spent during the weekend depending on the sex.
When using OdStats, we will use 1Num1Cat
28
Data Analysis
Clara Castells
Not only we have to look at the mean consumption. Also the standard deviation must be taken
into account (greater in men indicates that women’s spending is more concentrated).
*It can be useful to present boxplots instead of the summary table in order to make it more
understandable.
When a categorical variable can be ordered we can state that there is assotiation between the
numerical and the categorical variable (which can be positive or negative). Example of of
ordenable categorical variable: studies level.
Two numerical and one categorical variable
We will analyse the relation between the two numerical for every categorical by itself.
Example: There are 100 employees. We want to analyse the relation between their gross annual
salary, the years they have been working for the company and the department where they work.
Relation between their gross salary and the years they have been working for the
company.
Is this relation different in every department?
We draw a scatterplot and look at the regression line, without taking the categorical variable
into account or taking it with OdStats.
When the results of the regressions are very different for every categorical variable, we can state
that there is relation.
29
Data Analysis
Clara Castells
WEEK 8: TIME SERIES
Time serie
A time series is a collection of data refered to a variable, cronologically ordered. The serie can
have annual, trimestral, mensual, diary and even for hours or minutes periodicity, such as in the
stock market. They help make statistical predictions.
Disposing of historical information helps us make economic decisions.

If we observe a more or less systematic behavior of a variable with time, it is logic to
think that this behavior will continue in the future. This observation is the base of
statistics previsions.
Components of a times serie
Trend (T): Longterm behavior of the serie.
Cycle (C): Midterm behavior, generally associated to the economic cycle.
Seasonality (S): Shortterm behavior (generally a year), repeatef in the longer periods. It is
associated to the clima or the social habits.
Irregular (I): Shortterm factor, punctual and unpredictable, not explained by the other
components.
Time series classic theory supposes that all time serie is the result of combinating these
components, even in an additive, multiplicative or mixed form.

Additive: 𝑌𝑡 = 𝑇𝑡 + 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡
Multiplicative: 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 ∗ 𝑆𝑡 ∗ 𝐼𝑡
(it can be transformed into additive: 𝑙𝑛(𝑌𝑡 ) = 𝑙𝑛(𝑇𝑡 ) + 𝑙𝑛(𝐶−𝐿 ) + 𝑙𝑛(𝑆𝑧 ) ⊢ 𝑙𝑛(𝐼𝑡 ) )
Mixed: 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡 or 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 ∗ 𝑆𝑡 + 𝐼𝑡 etc
Example of time series.
*How to find the components of a serie
This is important for two reasons:
1. To know which behavior causes the variations of a serie (for example, if we are told that
unemployment has decreased 0.7 points since august, how do we know if this is an
improvement of the tendency or it is a seasonal factor?).
2. To predict the future behavior of the serie.There are two methods:
30
Data Analysis
Clara Castells
Function adjustment: to find out a linear trend, we look for the regression, which is the function
that describes the series.
If the time series originally looks as nonlinear, it can be transformed by taking the log of each
value.
Moving average: Isolate the component of Tendency and Cycle (both together).
This technique consists of calculating the mean of n consecutive periods of the original series
and each new value of the moving average dismisses the oldest value and incorporates a new
one. For example:
Year
Original values
Centered moving average of order 3
Centered moving average of order 5
2007
2008
2009
2010
2011
10
11
12
13
14
(10+11+12)/3=11
(11+12+13)/3=12
(12+13+14)/3=13

(10+11+12+13+14/5=12

In the previous table we see the method to calculate centered mobile average (they correspond
to the central value) of odd number. To calculate these averages for even numbers, a more
complex method is needed:
Year
Original values
NON centered moving average of order 4
Centered moving average of order 4
2007
2008
2009
2010
2011
10
11
12
13
14
(10+11+12+13)/4=11,5
(11+12+13+14)/4=12,5

((11,5+12,5)/2=12

As we can see, we caculate the non centered moving averages and then we calculate the
centered moving average making the mean of the two moving averages.
31
Data Analysis
Clara Castells
The advantage of the moving average is that it really represents the trend, it doesn’t depend on
the line.
The disadvantage it has is that you cannot predict.
Shortterm components
If the series is additive 𝑌𝑡 = 𝑇𝑡 + 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡 . Therefore, we can do: 𝑆𝑡 + 𝐼+ = 𝑌𝑡 − (𝑇𝑡 + 𝐶𝑡 )
And since the moving average (MMt) captures the components of cycle anf trend, we can obtain:
𝑆𝑡 + 𝐼𝑡 = 𝑌𝑡 − 𝑀𝑀𝑡
If the series is multiplicative, either we do a logaritmic transformation before starting, or we do:
𝑆𝑡 + 𝐼𝑡 = 𝑌𝑡 /𝑀𝑀𝑡
Example of a calculation of S+I:
32
Data Analysis
Clara Castells
We have a S+I series that mixes the components Seasonality (S) and Irregular (I).
Is there any way of finding and isolating only the component Seasonaality (S)? Yes, there is.
Seasonal components of the same period repeat.
For example: in a threemonth series, the spring component is always the same, year after year.
In a similar way, the summer components are all the same etc.
Seasonal component calculation:
We may be interested in knowing which effect has the seasonal component of a time series. As
it is repeated with time, we can isolate it by the following process:
1. We isolate the tendency and cycle components by the moving average method.
2. We find the seasonal component and the irregularity component doing the following
operation in an additive series:
S+I= Y(T+C) = Original Value – Moving average of order n
To eliminate the irregulat component of the result we have obtained (and to know this way the
seasonal component), we can suppose that, as the irregular component is aleatory (we cannot
predict it), its mean is 0 in the case of an additive series and 1 in the case of a multiplicative one.
Prediction:
If we use a function to predict the tendency component anf if we know the seasonal component,
we can predict the future values of the series. For example, an additive series would be this way:
Y = (Tendency function) + S
33
Data Analysis
Clara Castells
WEEK 9: MEASURES OF INQUALITY AND CONCENTRATION
Inequality and concentration
The inequality and/or concentration indexes provide a summarized measure of the degree of
inequality or concentration.
Inequality and concentration are related concepts.
The Gini index
Example: Imagine an inheritance of 100 million € that is divided among 3 families in the following
way:
A
2
2
4
B
1
7
7
Percentage of heirs
20%
70%
10%
Percentage of inheritance
4%
7%
89%
Ineritance per person (xi)
Family members (ni)
Family total heritance (xini)
C
89
1
89
We want to calculate a number that indicates the inequality degree of the inheritance
distribution.
This number has to be located in a scale in which the minimum corresponds to the maximum
equality situation (everyone receives the same inheritance), while the maximum corresponds to
complete inequality situation.
Procedure:
1 Order the families, starting for the one in which the individual inheritance is the lowest.
2 We call P the cumulative population percentage (heirs) and we call Q the cumulative
wealth percentage (inheritance).
3 We calculate the differences PQ and we add them.
Family
B
A
C
Heirs percentage
70
20
10
Inheritance percentage
7
4
89
P
70
90
100
Q
7
11
100
PQ
63
79
0
Q
0
0
100
PQ
70
90
0
We compare this with the maximum inequality possible situation:
Family
B
A
C
Heirs percentage
70
20
10
Inheritance percentage
0
0
100
P
70
90
100
34
Data Analysis
Clara Castells
We define the LorenzGini Index as the quotient between the observed PQ difference and the
PQ in the maximum inequality situation.
Lorenz curve
The red line represents the actual inequality. The furthest the red and the blue line are, the
higher the inequality is.
We observe that the inequality is very high.
The difference index
Another way to measure the inequality is comparing the rent (or any other characteristic) of
every pair of individuals of the population.
Family (i)
Members (ni)
Wealth (xi)
B
7
1
A
2
2
C
1
89
35
Data Analysis
Clara Castells
Now we compare this value with the one that would result in the maximum inequality case.
Family (i)
Members (ni)
Wealth (xi)
B
7
0
A
2
0
C
1
100
It also fluctuates between 0 (maximum equality) and 1 (maximum inequality).
The indexes are relative measures (they are dimensionless), and are invariable to proportional
alterations to the analysed variable.
Concentration index
Where si is the market quota of the company i. the values of the index are between k/n and 1.
36
Data Analysis
Clara Castells
Concentration Herfindahl index
Where si is the market quota and n is the total number of companies.
The index takes a minimum value equal to 1/n and a maximum value equal to 1 (monopoly).
Properties:

Not ambiguous behaviour: given 2 markets, the H index can say which of the 2 markets
in more concentrated.

Scale invariance: The relative dimension of each firm doesn’t affect the H index
calculation.

Transference: The H measure increments when a small firm market quota decreases in
favour of a big firm.

Monotonicity: if the n firms have identical market quota, the H measure has to be
decreasing with respect to n.
37
Data Analysis

Clara Castells
Cardinality: If we divide each firm in k firms which are equal, the H measure decreases
in the same proportion.
38
...