and Stata
College 1. 3-9-2019
Data analysis = “The ability to take data – to be able to understand it, to process it, to extract value
from it, to visualize it, to communicate it's going to be a hugely important skill in the next decades,
not only at the professional level but even at the educational level for elementary school kids, for
high school kids, for college kids. Because now we really do have essentially free and ubiquitous
data.”
Hacking skills = be able to use a computer
Math & statistics knowledge = you need to know the correct
methods
Substantive expertise = know of economic theories, how people
react, behaviour
Combine all three and you have ‘data science’
Linear Regression (OLS)
Research questions:
- By how much does the value of a variable change when the values of some other variables
change?
- Given the values of some variables, can I predict the value of another variable?
- Which one is really important?
Fundamental principle:
- Description of the (linear) relationship between variables.
- Causality/economic model: Independent Variables (IV) Dependent Variable (DV).
- Which independent variables influences the dependent variable?
- Correlation will be tested. Not causality
Example:
Countries vary in level of income inequality; How can this variation be explained?
Economic theory Hypotheses:
- More developed countries are less unequal
- More educated countries are less unequal
- Agricultural countries are more unequal
Superficially all three hypotheses seem correct: Inequality is lower in more developed, more
educated and less agricultural countries
Regression analysis shows that only education is really important
,Simple linear regression model (1/17) One variable
Empirical relationship = same thing, idea is that we can never know the theoretical relationship. We
don’t really find out what theoretical relationship is. We want to try to observe outcomes of x and y
and do the best thing to estimate for Beta. That is b1 and b2.
We don’t know 𝛽1 and 𝛽2. 𝛽2 will be our slope
- Let’s assume that national income Y is a linear function of the population’s educational level
X.
- This linear function has the unknown parameters 𝛽1 und 𝛽2 (which we want to estimate).
- Assume further that we have a sample with four observations.
,In the perfect world all points will be on the line. If we would know x we would know y.
- If the relationship between x and y is exactly linear, all observations (Q) would be on a
straight line
- The value of y for a given x would be exactly determinable.
- In practice, economic relationships are rarely exactly linear
- They usually depend on additional properties (e.g., randomness).
- The true values of y for a given x (datapoints P) are then different from the ones on the
straight line.
The actual points are P1 t/m P4. They will be different from the line that we observe.
Difference between P’s and Q’s are e
Even is the theoretical relationship is true, there is a linear relationship, there will be errors in the
data. And that is fine.
, - We extend the model to 𝑦𝑖 = 𝛽1 + 𝛽2𝑥𝑖 + 𝑒𝑖 to account for such deviations, with 𝑒𝑖 being
the error term.
- Every observed value 𝑦𝑖 has...
o a non-random component 𝛽1 + 𝛽2𝑥𝑖 and predict from the theoretical
relationship
o a random component 𝑒𝑖 .
Model Assumptions
1. All variables must be measured at interval level (continuous scale) and measure without error
2. For each value of the independent variables, E(𝑒) = 0 (i.e., the mean value of the error term is 0).
Theoretical relationship, expected value of the error term needs to be zero. As small as possible. That
does not mean that the actual sum of the error points needs to be zero.
3. Homoscedasticity (versus heteroscedasticity). Variance of ei is equal for any value of X: var (ei | xi )
= σ = constant. The variance of the , how much the P’s actual deviates from the Q’s this is