Data Analysis (2016)Apunte Inglés
Vista previa del texto
WEEK 1: INTRODUCTION TO STATISTICS
We call statistics the science that deals with the obtaining of information from numerical data.
Specifically, we focus on applied statistics, which owns three principal major fields of study: -Obtaining data: We get data to answer specific questions.
-Data analysis: We organize and describe these data using graphs, numerical summaries and mathematical models.
-Statistical interference: Interpret the data and obtain conclusions that can be applied to a broader group, and determine the reliability of the conclusions.
Samples When obtaining data, 2 key concepts must be taken into account.: - A population is the group of individuals that we want to study.
A sample is the individuals (or the part of the population) that we really study.
There are diverse ways to choose a sample, but some of them can lead to false conclusions. We will divide the methods in biased and non-biased. There are two types: - - Sample of volunteers: They are the people who have a particular interest in answering to the subject of the study and, therefore, voluntarily offer to participate. For example, if in the TV is asked if someone has to go to prison and this person has lots of enemies, it is clear that the result to the question won’t reflect reality.
Sample of convenience: It is the one where individuals of easier access. For example, degustation campaigns of certain type of food.
The non-biased samples are the ones that give the same probability to each individual of the population to be elected. Depending on the size of the population, we distinguish 3 types: Simple Random Sample: A sample of n size consists of n individuals of a population elected by a way in which they all have the same probability to be elected. The selection process is randomly done, by a series of statistics programmes, or by a random digits table.
This table is characterized for having independent numbers -knowing a part of the table doesn’t provide full information of the others- where every value has the same probability to be a digit between 0 and 9. This way, we give a number to every individual, we choose a row of the table and look at the last digits of the numbers: we choose the individual that has a number that coincides with the last digits of the last digits of a number of the table, and this for every individual.
Stratified Random Sample: It divides the population in similar individual groups (stratums) and then does the “Simple Random Sample” in each stratum to combine them and obtain a complete sample.
Multiple Stage Random Sample: Applies the “Simple Random Sample” by stages, first in the city, then in a neighbourhood, afterwards in a street and finally in a living. The fist sample can be stratified.
1 Data Analysis Clara Castells The survey and its possible problems: When making a survey, in addition of doing questions related to what we want to study, other questions related to the individual’s characteristics need to be done (sex, age, occupation…). In addition, even if the election method has been random, there can be some problems: Lack of range: Some of the groups of the population are out of the selection process of a sample. For example, marginal groups.
Lack of answer: When an individual doesn’t want to participate or cannot be contacted.
Slope of answers: Asked people can lie if they are asked for unpopular or illegal behaviours. Lack of memory can also influence.
Writing of the questions: Some questions can generate confusion or induce to a certain answer.
Data organization Once we have the surveys, we have to organize the collected information creating a data base in the computer. There are some key concepts in this organization.
Individuals: persons, animals or things that are described in the data set. In a data base, every individual is a column.
Observation or case: in a data set, an individual and its variables. In a data base, every observation is a row.
Categorical variable: it indicates to which group or category the individual belongs.
Quantitative variable: it takes numerical values and it is possible to perform mathematical operations, such as sums or means.
Distribution of a variable: it tells us which values a variable takes and with which frequency.
Frequency table: It presents the frequencies in which the values or range of values (intervals or classes) of a variable are observed.
Absolute frequency: times that we observe a value in an interval or a class.
Absolute cumulative frequency: Sum of all the absolute frequencies of each observation or case until the moment. The last frequency will be equal to the number of cases there are, for example, if there are seven cases it will be seven.
Relative frequency: percentage of times that we observe a value in an interval or class.
Relative cumulative frequency: Sum of all the relative frequencies of each observation or case until the moment. It shows the percentage of variables that have a value which is inferior to the maximum number of the interval. For example, if the interval (20,30) has a relative cumulative frequency of 0,40 there will be a 40% of the values smaller than 30.
2 Data Analysis Clara Castells WEEK 2: DESCRIPTIVE ANALYSIS OF DATA Graphs and histograms The first thing that needs to be done with a data set is describe it. We will first examine the graphs and later, other numerical analysis tools. We can distinguish two types of graphs: the ones used for categorical variables and the ones used for numerical variables.
Categorical variables graphs can be: - Bar diagrams, if we want to compare different categorical variables.
Piecharts, if we want to compare all the categorical variables.
Numerical variables graphs can be: - Histograms: graphical representations of a frequency table. To create them we need to create intervals of the same length, count how many cases there are and draw the graph. It is necessary to choose correctly the number of classes.
To analyse their form: *A histogram is symmetric if the left and the right sides have approximately the same form.
*A histogram is asymmetric if one side has a different form to the other. It is skewed to the left if this side is longer, and skewed to the right if the right side is longer.
Atypical observations must also be identified.
- Stemplots: It can be used for small data sets. To compute it: 1. Organize values from lower to higher.
2. Separate each observation in a stem (which will contain all the digits except for the last one) and a leaf (which will contain the last digit) Stems are drawn vertically from lower to higher and the leaves are placed next to their stem (left to right, from lower to higher).
The unit of the leaves must be specified to avoid errors, and if the digits can be approximated if they are big numbers, that the stems can be divided when there are too many leaves and that, if we turn the diagram, it is similar to a histogram.
Numerical analysis of data: the centre.
For numerical variables, we can describe distributions numerically with the help of a set of measures. Basically, we can describe the centre and the spread.
To describe the centre or mean value, we can use: - - The mean: It is obtained by adding all the values and dividing by the number of cases. It is a good indicator of the centre when the distribution is symmetric.
The median: It is the value of the central observation when the cases are ordered from lower to higher, if there is an odd number of cases. If the number of cases is even, it will be the mean between the two central cases. We use the median when the distribution is asymmetrical.
The mode: The observation with a higher frequency.
3 Data Analysis Clara Castells On the other hand, we must take into account the existence of the five numerical summaries that allow to describe a numerical data set. These numbers are: - The median or the value that separates the 50% of the observations: The maximum The minimum The first quartile or the value under of which we have the 25% of the observations.
The third quartile or the value under which we have a 75% of the observations.
In addition, we can do 2 operations with these two values: - The spread: Maximum-Minimum The interquartilic range: 3Q – 1Q These five numbers allow us too to build the boxplot: 1. Draw a line and mark the 5 numerical summaries.
2. Draw a “box” that includes the values from 1Q to 3Q.
3. Draw two lines that go out from this box and reach the maximum and the minimum values.
4. We also have to be careful with the outliers: An observation is an outlier if it is greater than 3Q + (1.5 * Interquartilic range) or inferior to 1Q - (1.5*Interquartilic range). The outliers are drawn apart.
Numerical analysis of data: The spread and other measures.
To calculate the spread we will use the standard deviation, which measures the spread with relation to the mean.
To calculate the standard deviation, these steps must be followed.
Calculate the difference between each observation’s value and the mean.
Raise the result of the difference to square.
Add all the squared differences.
Divide by all the observations minus 1.
Compute the square root of the step 4 (only taking the positive result).
On the other hand, there also exist other measures, such as: P% percentile: value in a position under which we obtain p% cases.
Variation coefficient: Standard/ mean deviation.
Asymmetry measures: (mean-mode)/Standard deviation or (mean-median)/Standard deviation.
Kurtosis measures: It measures the degree of concentration of the frequencies in relation to the mean.
In conclusion, we can say that for an asymmetric or with extreme values distribution we will use the 5 numerical summaries. For a symmetric distribution we will use the mean and the standard deviation.
4 Data Analysis Clara Castells WEEK 3: GROUPED DATA We understand as grouped data a set of data of a numerical variable presented in the form of a frequency table. We don’t know the original information; this means, the data case by case.
Therefore, we need to work with data grouped in intervals or value ranges.
However, we can calculate almost every numerical summary and this way describing quite good the data set.
Imagine we have the following data set: We present special attention to the 5th column (the blue one), since it is a key tool to calculate the numerical summaries. The midpoint of an interval is its upper limit (or maximum of the interval) minus its lower limit (or the minimum of the interval) divided by 2. The formula could be expressed in the following way: Interval midpoint = (Upper limit – Lower limit)/2 Once we have calculated the midpoints, we can find the five numerical summaries: 1st Quartile calculation: We take the midpoint of the interval that has the observation determined by: (Number of cases+1)/4.
In this case, we have 280 cases, if we do 281/4 = 70.25. therefore, we would take 17500 as the value of the first quartile, since 70.25 is found between the observations 70 and 71, located in the third interval.
Median calculation: We take the midpoint of the interval that has the observation determined by: (Number of cases+1)/2.
In this case, we have 280 cases, so 281/2=140.5. Therefore, we would take 17500 as the median value, since 140.5 is found between the observations 140 and 141, located in the third interval.
3rd Quartile calculation: We take the midpoint of the interval that has the observation determined by: (Number of cases+1) *0,75.
In this case, we have 280 cases, so 281*0.75 =210.75. Therefore, we would take 25000 as the median value, since 210.75 is found between the observations 210 and 211, located in the fourth interval.
Maximum and minimum calculation: We take the lowest value (the inferior limit of the first interval) to know the minimum, and the highest value (the upper limit of the last interval) to know the maximum.
We can calculate the other numerical summaries also by the midpoint. We look at the last column (the orange one), where we have supposed that each of the values of each interval are 5 Data Analysis Clara Castells equal to the midpoint, and this is why we multiplicate the midpoint times the absolute frequency. Once this is done, we can calculate some numerical summaries: Mean calculation: We add all the values of the orange column and divide them by the total number of observations. This way, we get: 6187500/280=22098,21.
Therefore, the mean=Total sum of the intervals / number of cases.
Standard deviation calculation: We simply calculate the standard deviation as we usually do; but as we do not know the values of all the observations, we do it using the midpoint and we multiplicate it times the absolute frequency. The rest of the procedure is the same.
For example, the standard deviation of the first interval is: o Midpoint - mean= 5000-22098,21= -17098,21 o (Midpoint – mean)2= (-17098,21)2= 282348931,76 o (Midpoint – mean)2*Absolute frequency= (-17098,21)2*15= 435233976,4.
o We would afterwards add the results of each interval, divide by the number of observations minus 1; and we would do the square root of all of this.
Data transformation If we want to change measure units (change from dollars to euros, for example), it must be taken into account how this changes affect the numerical summaries. Basically, we distinguish two types of change: Origin change: it takes place when we add or subtract a number (a constant) to the original variable). Hence, suppose X is our original variable, and a is any constant. An origin change of the variable origin change of the variable X will give us a transformed variable, which we call Y. The change is expressed with the following equation: Y=X±a.
The origin change shifts the graph to the right or left (depending of a). In this change, only the position measures change (mean, quartiles…) but the spread and the form don’t change.
Scale change: It takes place when we multiply or divide the data by a number. We suppose X is our original variable, and b is any constant. A scale change of the X variable will give us a transformed variable, which we call Y. The change is expressed with the following equation: Y=X*b (if we multiply) or Y=X/b (if we divide).
This type of change makes the size of the histogram change, depending if we multiplicate or divide. In this change, the position measures (mean, quartiles,…) and the dispersion measures (standard deviation, kurtosis…) vary, only the form remains constant.
Linear transformations: The two changes together, and we express them with the equation: Y=(X±a)/b or Y=(X±a) *b. In this change, the position measures (mean, quartiles…) and the dispersion measures (standard deviation, kurtosis…) vary, only the form remains constant.
There exist another type of transformations which are not that frequent: the non-linear transformations. They are based on non-linear functions, such as logarithmic or exponential, and are used to convert asymmetric distributions into symmetric and calculate this way the numerical summaries which are only valid for these distributions (mean, standard deviation…).
6 Data Analysis Clara Castells When these transformations are applied, everything changes: form, spread and position. In addition, we cannot calculate the new mean and the standard deviation with the old data. This means, that if we do a logarithmic transformation, the new mean will NOT be the logarithm of the old mean.
OdStatistics allows to do all these transformations in a quick way.
7 Data Analysis Clara Castells WEEK 4: DENSITY CURVES AND HISTOGRAMS (week 4 slides) When exploring a numerical or quantitative variable: 1. We draw a graph (a histogram or a stemplot) 2. We analyse the general appearance of the distribution (centre, spread, shape) and the atypical observations.
3. We choose a numerical summary to describe briefly the centre and the spread of the distribution.
In addition, we can describe some histograms with lots of observations with a smooth curve. To be able to do so, the histogram must be regular and, therefore, must accomplish the following: 1. It must be symmetric.
2. Both sides must decrease gradually.
3. There cannot be atypical information nor notable gaps.
Therefore, the density curve (technical name for the previously described curve) is a mathematical model that provides us with a good data description, even though this description is idealized since it ignores the atypical values and the small irregularities.
Finally, it must be said that the histogram depends on the number of chosen classes while the density curve does not.
Density curve example One one hand, the density curve define under it an are exactly equal to 1, this means, the region under the curve contains the total proportion of observations. This allows us to, for eample, know the proportion of cases under a value or locate center measures as the median (which divides the area under the curve in two halves, each of which contains 50% of the cases). As in the rest of distributions, the mean ans the median match if the form is symmetric and if it is asymmetric the mean moves to the longest side. We can also locate position measures (first quartile…).
Normal distribution Normal density curves are a special class of density curves. They characteryze for : - being symmetric having an only mode or peak having a bell form they are described by simply giving the mean 𝜇 and a standard deviation 𝜎.
8 Data Analysis Clara Castells These kind of distributions are very important, since: - They describe well a part of the real data sets.
They approximate well the results of many aleatory processes.
Many interference statistics processes are based in their properties.
There exist two important properties from these curves: 1. The mean 𝜇 is located in the center of the curve.
2. Typical deviation 𝜎 controls the curve spread.
In addition, the mean and the typical deviation allow us to calculate the inflection points since they are those equal to 𝜇 ± 𝜎.
The rule 68-95-99.7 says that: - 68% of observations are located between 𝜇 − 𝜎 and 𝜇 + 𝜎.
95% of observations are located between 𝜇 − 2𝜎 and 𝜇 + 2𝜎 99.7% of observations are located between 𝜇 − 3𝜎 and 𝜇 + 3𝜎.
Standard normal distribution If we want to compare two cases expressed in different systems of measure, we will use a chriteria that measures in standard deviations, so that we can know which case is the biggest.
This chriteria is called standarized observation (z) and indicates how many typical deviations (𝜎) is the original observation (x) from the mean (µ) and in which direction.
It can be calculated by: 𝑧= 𝑥−𝜇 𝜎 Variable z is a linear transformation from x variable, so, z of mean (µ) will be 0 (it is at 0 typical deviations) and z of standard deviation (𝜎) will be 1. Then, as all the normal distributions share the same properties, we can “standarize” the data and transform any normal curve (𝜇, 𝜎) in the normal standard curve N(0,1).
Calculating a value with the standard normal distribution The standard normal distribution allows us to calculate a percentage (percentage of cases under a X value) and a value (value under which a concrete % of the cases are found). To do these calculations, we need to standarize the normal distribution and the table A, which we were given in class.
To calculate a value, for example: in a normal distribution N(72,4) we want to know under which value is the 60% of the class. This means that we look for the value of z of the chart A with a value equal or very close to 60. This is because the ‘z’ of the chart A tell us which area of the normal standard distribution is under them. If we search, we will find that the closest value is z=0,25 with an area of 59,87% of observations. Once this is done, we isolate x: 9 Data Analysis Clara Castells 0,25 = (x-72) / 4 1 = x – 72 X = 73 Therefore, under the value 73 we find approximately 60% of this distribution cases.
Calculating a percentage with the standard normal distribution To calculate a percentage, for example: in a normal distribution N(72,4) we want to know under which percentage of cases have a value greater than 64. First, we need to standarize the value 64: z = (64 – 72)/ 4 = -2 We look for z=-2 in table A and we obtain a value of 0,0228 or a 2,28%. This means, as the table indicates, that a 2,28% of cases are on the left of 64-. But remember: we want to know the percentage of cases with value greater than 64. Therefore, we deduce 100%-2,28%=97,72%.
With this we know that a 7,72% of the values have a value greater than 64.
Normality valoration We can know if a normal distribution is a good approximation to our data distribution by: Visual diagnosis: Histograms or symmetric stemplots without blanks and without atypical observations.
Numeric diagnosis: Rule of 68-95-99,7 and others. We calculate the points (𝜇 − 𝜎 and 𝜇 + 𝜎, 𝜇 + 𝜎, 𝜇 − 2𝜎, 𝜇 + 2𝜎, 𝜇 − 3𝜎 and 𝜇 + 3𝜎) and we recount the frequencies to see if the rule is followed.
Value 64: z=(64-72)/4=-2. We look for z=-2 in the table A and we get a value of 0,0228 or a 2,28%. This means, as the table indicates, that a 2,28% of the cases are on the left of 64 (and have a value under 64). But remember: we want to know the percentage of cases with a value greater than 64. Therefore, we subtract 100%-2,28% = 97,72%. With this we know that a 97,72% of the values have a value greater than 64.
10 Data Analysis Clara Castells WEEK 5: DATASETS WITH TWO VARIABLES. CORRELATION AND REGRESSION Response variable and explanatory variable We usually find variables that can be related.
A response variable measures a result.
An explanatory variable influences or explains the changes in the response variable.
For example: - The weight and height of a group of people.
The time dedicated to studying an exam and the obtained grades of a group pf people.
How to analyse datasets with two variables: 1. Start with a graph 2. Identify general aspects and main patterns.
3. Use numerical summaries to describe main patterns.
4. See if the patterns of the distribution can be summarized by a smooth or regular function.
The scatterplot The first step to find out a relation between to variables is to build a scatterplot, which allows to describe the general appearance of the relation by: - The form of the distribution We say that the relation between variables is linear if a straight line approximates the relation.
If the relation can be approximated by another function we say that it is non-linear.
There is also the possibility that there is no relation between variables.
- The direction of the distribution The variables are positively associated if the superior values to the mean in a variable come with superior values to the mean of the other variable.
The variables are negatively associated if superior values to the mean come with inferior values to the mean of the other variable.
11 Data Analysis - Clara Castells The strength of the relation.
Linear associations are specially important.
A linear association is strong if the points of the scatterplot are dispersed around the principal form (the line that describes the relation).
We can also find outliers.
Correlation Scatterplot: it allows to identify the direction (positive or negative), the form (linear or non-linear) and the strength (fit) of the relation between two variables.
A more objective way to determine the force of the relation between variables is numerical summary: concept of correlation among variables.
The correlation coefficient measures strength and direction of the linear relation between two numerical variables x and y.
12 Data Analysis Clara Castells Correlation r is a mean of the sum of the products of the standardized variables.
It can also be expressed as: The numerator is called covariance and is equal to: Correlation: properties - - r doesn’t have units, since zx and zy don’t have any.
r takes values from -1 to 1 The linear association is stronger when r is near 1 or -1.
If r=1 or r=-1 all the points of the scatterplot fall on a line.
r only measures the strength of the linear associations. A r near 0 indicates that there is no linear association, but there could be no linear association.
r is strongly affected by a few outliers.
13 Data Analysis Clara Castells Covariance It computes an average of the products of deviations with respect to the mean.
14 Data Analysis Clara Castells In the previous diagram, given that the most frequent (and larger) products are positive, the covariance will be positive.
What does covariance measure? It measures if the linear relation between two variables is strong.
If it is positive it shows that there is positive linear association.
If it is negative it shows that there is an inverse association.
If it is approximately 0 it says there is no linear association.
But the value of the covariance measure depends not only on the strength of the association but also on the units of measure (product of the deviations).
We have to normalize the measure to be able to better interpret it.
Linear regression If an explanatory variable x and a response variable y are lineally associated, we want to summarize their relation by a line.
This line could be used to predict the values of y.
Linear relation: y = a + bx Motivation: - Simplicity.
- Frequently found in practice.
15 Data Analysis Clara Castells When a variable causes or explains another one: Reponse variable: the variable that we want to explain.
Explanatory variable: the variable that we use to explain the reponse variable.
Example. We want to explain family consumption: - Reponse variable: consumption.
- Explanatory variable: income.
Residual or Error = observed value - predicted value.
We want to make the residuals or errors as small as possible (this will give us a and b of the line with the best fit): 16 Data Analysis Clara Castells Prediction: We must be careful and not use the regression line to predict out of the interval of values of the explanatory variable x.
Given a value of x which is not in the sample, we compute the predicted value for y.
How much gas would it be used if 40.000 families would use heating? 1.23 + 0.202 (40) = 9.31 , so 9.31 millions of cubic meters.
Coefficient of determination R2=r2 The coefficient of correlation squared shows the percentage of the variation of y that we can 17 Data Analysis Clara Castells explain with the variation of x.
The closer to 1, the line has a better fit; the closer to 0, the line has a worse fit.
Attention with: Association does not imply causality: if we find a strong correlation between the mathematics grade and the language grade, we cannot state that one grade causes the other one.
18 Data Analysis Clara Castells WEEK 6: DATASETS WITH TWO VARIABLES (I) Residual analysis As our data is not aliened in a perfect line, there are some residuals. The line predict is in the form y^ : a +bx (it's important to put ^ above the y because the line is predicted, it is not the exact value that we obtain).
Residuals: error of predicted (value observed (real value)- value predicted) --> the one of the line. Above the line are positive residuals, and below the line are negative residuals.
Residual diagram: we represent the value of the of the residuals at the vertical axis and the value of the explanatory variable at the horizontal axis. (Scatterplot of the residuals of the regression in relation to the explanatory variable).
Here we observe that the residuals are always distributed arround their mean, which is always 0.
The sum of the residuals is always 0, although your data is spread, it is always 0 because the line is computed in a way to balance the distance of the point. The residuals have to be distributed without any special pattern above and below the horizontal axis.
There has to be the same quantity of residuals above and below the lines because we have to remember that its sum is always 0.
Error of residual = yi- ^yi 19 Data Analysis Clara Castells Influential Observations: Outliers with respect to x and y.
They are outliers depending on how far they are from the mean of x and the mean of y. If it's very far from x, it's an outlier of x and if it's very far from the mean of y, it's an outlier of y.
An outlier is influential if when we eliminate it, the regression line changes in a more accurate way respect the values.
How can we check if a regression is influential or not? The point (82,93) it's an outlier. But is it influential? We have to prove it. If the line changes a lot, it will be influential.
To check it we have to run the regression with the point and without it: 20 Data Analysis Clara Castells Here we see that the point (82,93) it's an influential outlier because the regression line changes a lot to adjust to the other values of the graph.
If some points are not outliers, the regression line doesn't change too much; they are not influential. When they are not influential, the line gets shorter but doesn't change its form.
However, when they are influential the line moves and change the direction or form.
When we eliminate an outlier with respect to y...
We have our regression line, with an outlier (19) with respect to y (because it's far from the mean of y).
And we analyse what happens when we eliminate it: We see that when we eliminate the outlier, our regression line changes but not too much compared with the above example, so the outlier is a non-influential outlier as our regression line doesn't change so much in relation with the initial regression line.
Outliers respect to Y don't produce much changes, but when it's an outlier respect to X it's usually influential.
21 Data Analysis Clara Castells Excessive spread: median trace Median Trace: there is something changing the variability of Y that doesn't let us see the centre.
The median trace divides the X axis in various intervals and for each interval it computes the median of Y. Then we figure out that there's a pattern. With the median trace you can figure out the relation between X and Y.
- How is it done? Divide the horizontal axis in equally sized sectors and compute the median (or the mean for the mean trace) of the dependent variable.
With this graph, we cannot take a clear relation, so we do the median trace and we divide the horizontal axis in 5 sections Then we unify the mean of each section to see a pattern. Then with this pattern we can see the relation between X and Y.
We can divide the X-axis is the sections that we want, for example 5 or 4 depending on what we are searching.
In this graph we can se that 90km/h is the optimal speed because it has the lower consum. So we have get to a relation between X and Y and then we can answer some questions about the Data.
22 Data Analysis Clara Castells Non-linear relations: When we have a non-linear relation we have to use logarithms to make the transformation in order to get the relation between the variables X and Y, and in order to study them.
A math digression: some properties of logarithm.
1. Exponential function y=e^x. We also write y= exp(x).
2. We define a natural logarithm, we just call it ln, as the inverse of this function: ln(y) = ln (exp(x)) = x. Only defined for x > 0.
23 Data Analysis Clara Castells We have to take into account the elasticity. But why? An elasticity tells us the effect of the explanatory variable on the dependent variable in percentual terms.
The result comes from the properties of logarithms.
Problems with regression and correlation: - Extrapolation: means predicting very far away from your sample. You are trying to predict the growth. You must predict close to your sample. It is the process of estimating, beyond the original observation range, the value of a variable on the basis of its relationship with another variable. It is similar to interpolation, which produces estimates between known observations, but extrapolation is subject to greater uncertainty and a higher risk of producing meaningless results.
Predictions can be very wrong if we use values for the explanatory variable which are very different from the ones in the sample.
Example: a sample of 1 and 2 year old boys.
Height (cm)=45+20*Age(years) - Height predicted for 1 year old: 45+20*1=65 cm Height predicted for 2 year old: 45+20*2=85 cm Height predicted for 18 year old: 45+20*18=405 cm * Extrapolating gets a prediction for an 18 year old of 4 meters 5 cm.
24 Data Analysis Clara Castells - Latent variables: are variables that may be affecting in your study. Variables that you don't see but can be affecting the values of your regression. are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed (directly measured).
There are 2 possible effects: - A relation is suggested but it is false.
A relation is hidden.
- Using means of the variables: it's good to figure out relations, but the relation is going to be stronger than what it is, because you use too much data. If you use individual data the relation is going to be smaller.
- Association and causality: Correlation implies association, but not causation. Conversely, causation implies association, but not correlation.
When an article says that causation was found, this means that the researchers found that changes in one variable they measured directly causedchanges in the other.
When researchers find a correlation, which can also be called an association, what they are saying is that they found a relationship between two, or more, variables. Correlations can be positive - so that as one variable goes up, so does the other; or they can be negative, which would mean that as one variable goes up another goes down.
Association should not be confused with causality; if X causes Y, then the two are associated (dependent). However, associations can arise between variables in the presence (i.e., X causes Y) and absence (i.e., they have a common cause) of a causal relationship.
25 Data Analysis Clara Castells 26 Data Analysis Clara Castells WEEK 7: DATASETS WITH TWO VARIABLES (II) Relation between two categorical variables 1-Summarize the information in a frequency table that provides combined distribution of the 2 variables.
It also provides the marginal distributions (distributions of each variable alone) It can be useful to represent the marginal distributions with bar diagrams (even though it doesn’t indicate relations between the 2 variables).
2- Calculate the condicional distributions (establish conditions and see the proportions in which they happen). The process to do it is as follows: 1st- We establish a condition: We focus only in a type of individuals (for example, children whose parents both smoke).
2nd- We do the same thing for for the other categories (for example, families in which both parents smoke or families in which none of them smoke).
Summary of the example: 18,75% of the sample smokes.
Group of both smoker parents: 22,56% Group of one smoker parent: 18,56% 27 Data Analysis Clara Castells Group of none smoker parent: 14,9% As there are differences, we can state that there is relation between the 2 variables: children from smoker parents smoke more than children of non smoker parents.
*it is not important which variable we establish as a condition.
Latent variables The presence of a latent variable can cause a relation or association between two observed categorical variables to change and even invert.
Example: In a university, more men are accepted than women. Calculating, er can state that there is a relation between sex and admission. However, there is a latens variable called “studies”, which takes two values: Physics and Chemistry. If we study this varriable, the relation changes.
Simpson paradox We say that a Simpson paradox takes place when a relation or association that exists for all or some groups can change direction when the data is combined in a single group.
Summary: If the distributions of a variable X contitional to the different values of a variable Y are very different, we can say that there is a relation between the 2 variables.
In the opposite case, we don’t have evidence.
*In big samples, a very small difference indicates that there is no relation. In small samples is difficult to say.
One numerical and one categorical variable To analyse this kind of relations, we will build and analyse: -Graphical summaries of the numerical variable for each possible value of the categorical.
-Numerical summaries of the numerical variable for each possible value of the categorical.
Example: Money spent during the weekend depending on the sex.
When using OdStats, we will use 1Num1Cat 28 Data Analysis Clara Castells Not only we have to look at the mean consumption. Also the standard deviation must be taken into account (greater in men indicates that women’s spending is more concentrated).
*It can be useful to present boxplots instead of the summary table in order to make it more understandable.
When a categorical variable can be ordered we can state that there is assotiation between the numerical and the categorical variable (which can be positive or negative). Example of of ordenable categorical variable: studies level.
Two numerical and one categorical variable We will analyse the relation between the two numerical for every categorical by itself.
Example: There are 100 employees. We want to analyse the relation between their gross annual salary, the years they have been working for the company and the department where they work.
Relation between their gross salary and the years they have been working for the company.
Is this relation different in every department? We draw a scatterplot and look at the regression line, without taking the categorical variable into account or taking it with OdStats.
When the results of the regressions are very different for every categorical variable, we can state that there is relation.
29 Data Analysis Clara Castells WEEK 8: TIME SERIES Time serie A time series is a collection of data refered to a variable, cronologically ordered. The serie can have annual, trimestral, mensual, diary and even for hours or minutes periodicity, such as in the stock market. They help make statistical predictions.
Disposing of historical information helps us make economic decisions.
- If we observe a more or less systematic behavior of a variable with time, it is logic to think that this behavior will continue in the future. This observation is the base of statistics previsions.
Components of a times serie Trend (T): Long-term behavior of the serie.
Cycle (C): Mid-term behavior, generally associated to the economic cycle.
Seasonality (S): Short-term behavior (generally a year), repeatef in the longer periods. It is associated to the clima or the social habits.
Irregular (I): Short-term factor, punctual and unpredictable, not explained by the other components.
Time series classic theory supposes that all time serie is the result of combinating these components, even in an additive, multiplicative or mixed form.
- Additive: 𝑌𝑡 = 𝑇𝑡 + 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡 Multiplicative: 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 ∗ 𝑆𝑡 ∗ 𝐼𝑡 (it can be transformed into additive: 𝑙𝑛(𝑌𝑡 ) = 𝑙𝑛(𝑇𝑡 ) + 𝑙𝑛(𝐶−𝐿 ) + 𝑙𝑛(𝑆𝑧 ) ⊢ 𝑙𝑛(𝐼𝑡 ) ) Mixed: 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡 or 𝑌𝑡 = 𝑇𝑡 ∗ 𝐶𝑡 ∗ 𝑆𝑡 + 𝐼𝑡 etc Example of time series.
*How to find the components of a serie This is important for two reasons: 1. To know which behavior causes the variations of a serie (for example, if we are told that unemployment has decreased 0.7 points since august, how do we know if this is an improvement of the tendency or it is a seasonal factor?).
2. To predict the future behavior of the serie.There are two methods: 30 Data Analysis Clara Castells Function adjustment: to find out a linear trend, we look for the regression, which is the function that describes the series.
If the time series originally looks as non-linear, it can be transformed by taking the log of each value.
Moving average: Isolate the component of Tendency and Cycle (both together).
This technique consists of calculating the mean of n consecutive periods of the original series and each new value of the moving average dismisses the oldest value and incorporates a new one. For example: Year Original values Centered moving average of order 3 Centered moving average of order 5 2007 2008 2009 2010 2011 10 11 12 13 14 (10+11+12)/3=11 (11+12+13)/3=12 (12+13+14)/3=13 - (10+11+12+13+14/5=12 - In the previous table we see the method to calculate centered mobile average (they correspond to the central value) of odd number. To calculate these averages for even numbers, a more complex method is needed: Year Original values NON centered moving average of order 4 Centered moving average of order 4 2007 2008 2009 2010 2011 10 11 12 13 14 (10+11+12+13)/4=11,5 (11+12+13+14)/4=12,5 - ((11,5+12,5)/2=12 - As we can see, we caculate the non centered moving averages and then we calculate the centered moving average making the mean of the two moving averages.
31 Data Analysis Clara Castells The advantage of the moving average is that it really represents the trend, it doesn’t depend on the line.
The disadvantage it has is that you cannot predict.
Short-term components If the series is additive 𝑌𝑡 = 𝑇𝑡 + 𝐶𝑡 + 𝑆𝑡 + 𝐼𝑡 . Therefore, we can do: 𝑆𝑡 + 𝐼+ = 𝑌𝑡 − (𝑇𝑡 + 𝐶𝑡 ) And since the moving average (MMt) captures the components of cycle anf trend, we can obtain: 𝑆𝑡 + 𝐼𝑡 = 𝑌𝑡 − 𝑀𝑀𝑡 If the series is multiplicative, either we do a logaritmic transformation before starting, or we do: 𝑆𝑡 + 𝐼𝑡 = 𝑌𝑡 /𝑀𝑀𝑡 Example of a calculation of S+I: 32 Data Analysis Clara Castells We have a S+I series that mixes the components Seasonality (S) and Irregular (I).
Is there any way of finding and isolating only the component Seasonaality (S)? Yes, there is.
Seasonal components of the same period repeat.
For example: in a three-month series, the spring component is always the same, year after year.
In a similar way, the summer components are all the same etc.
Seasonal component calculation: We may be interested in knowing which effect has the seasonal component of a time series. As it is repeated with time, we can isolate it by the following process: 1. We isolate the tendency and cycle components by the moving average method.
2. We find the seasonal component and the irregularity component doing the following operation in an additive series: S+I= Y-(T+C) = Original Value – Moving average of order n To eliminate the irregulat component of the result we have obtained (and to know this way the seasonal component), we can suppose that, as the irregular component is aleatory (we cannot predict it), its mean is 0 in the case of an additive series and 1 in the case of a multiplicative one.
Prediction: If we use a function to predict the tendency component anf if we know the seasonal component, we can predict the future values of the series. For example, an additive series would be this way: Y = (Tendency function) + S 33 Data Analysis Clara Castells WEEK 9: MEASURES OF INQUALITY AND CONCENTRATION Inequality and concentration The inequality and/or concentration indexes provide a summarized measure of the degree of inequality or concentration.
Inequality and concentration are related concepts.
The Gini index Example: Imagine an inheritance of 100 million € that is divided among 3 families in the following way: A 2 2 4 B 1 7 7 Percentage of heirs 20% 70% 10% Percentage of inheritance 4% 7% 89% Ineritance per person (xi) Family members (ni) Family total heritance (xini) C 89 1 89 We want to calculate a number that indicates the inequality degree of the inheritance distribution.
This number has to be located in a scale in which the minimum corresponds to the maximum equality situation (everyone receives the same inheritance), while the maximum corresponds to complete inequality situation.
Procedure: 1- Order the families, starting for the one in which the individual inheritance is the lowest.
2- We call P the cumulative population percentage (heirs) and we call Q the cumulative wealth percentage (inheritance).
3- We calculate the differences P-Q and we add them.
Family B A C Heirs percentage 70 20 10 Inheritance percentage 7 4 89 P 70 90 100 Q 7 11 100 P-Q 63 79 0 Q 0 0 100 P-Q 70 90 0 We compare this with the maximum inequality possible situation: Family B A C Heirs percentage 70 20 10 Inheritance percentage 0 0 100 P 70 90 100 34 Data Analysis Clara Castells We define the Lorenz-Gini Index as the quotient between the observed P-Q difference and the P-Q in the maximum inequality situation.
Lorenz curve The red line represents the actual inequality. The furthest the red and the blue line are, the higher the inequality is.
We observe that the inequality is very high.
The difference index Another way to measure the inequality is comparing the rent (or any other characteristic) of every pair of individuals of the population.
Family (i) Members (ni) Wealth (xi) B 7 1 A 2 2 C 1 89 35 Data Analysis Clara Castells Now we compare this value with the one that would result in the maximum inequality case.
Family (i) Members (ni) Wealth (xi) B 7 0 A 2 0 C 1 100 It also fluctuates between 0 (maximum equality) and 1 (maximum inequality).
The indexes are relative measures (they are dimensionless), and are invariable to proportional alterations to the analysed variable.
Concentration index Where si is the market quota of the company i. the values of the index are between k/n and 1.
36 Data Analysis Clara Castells Concentration Herfindahl index Where si is the market quota and n is the total number of companies.
The index takes a minimum value equal to 1/n and a maximum value equal to 1 (monopoly).
Properties: - Not ambiguous behaviour: given 2 markets, the H index can say which of the 2 markets in more concentrated.
- Scale invariance: The relative dimension of each firm doesn’t affect the H index calculation.
- Transference: The H measure increments when a small firm market quota decreases in favour of a big firm.
- Monotonicity: if the n firms have identical market quota, the H measure has to be decreasing with respect to n.
37 Data Analysis - Clara Castells Cardinality: If we divide each firm in k firms which are equal, the H measure decreases in the same proportion.