Boek:
ISLRv2_website.pdf (su.domains)
Chapter 1: Introduction
Supervised learning = building a statistical model for predicting an output based on one or more
inputs
Regression = predicting a continuous or quantitative output (price,..)
Classification = predicting a qualitative output (gender, up/down,..)
Unsupervised learning = the inputs are not supervising the outputs
- No outcome variable, just a set of predictors/features measured on a set of samples.
- Objective is more fuzzy: find groups of samples that behave similarly, find features that
behave similarly, find linear combinations of features with the most variation, . . .
- It’s difficult to know how well you are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised
learning.
- we lack a response vari- able that can supervise our analysis
Clustering = - grouping individuals according to observed characteristics : here we are not trying to
predict an output variable
Association = determining rules that describe large portions of a dataset
ISL (= introduction to Statistical learning) based on 4 premesis
- Many statistical learning methods are relevant and useful in a wide range of academic and
non-academic disciplines, beyond just the statistical sciences
- Statistical learning should not be viewed as a series of black boxes : no single approach will
perform well in all possible applications
- While it is important to know what job is performed by each cog, it is not necessary to have
the skills to construct the machine inside the box
- We presume that the reader is interested in applying statistical learning methods to real-
world problems
Chapter 2: Statistical learning
= set of tools for making sense of complex datasets
X = input/predictor/independent variable
Y = output/response/dependent variable
f represents the systematic information that X provides about Y à statistical learning refers to a set
of approaches for estimating f
- e captures measurement errors = random error term, which is independent of X and has
mean zero
Why estimate f?
- prediction
- inference
1. prediction
𝑌"= 𝑓$(𝑋) à error term averages to zero
- 𝑓$= estimate for f
- 𝑌" = resulting prediction for Y à often treated as a black box = one is not typically concerned
with the exact term of 𝑓$ , provided that it yields accurate predictions for Y.
,Ideal predictor of Y: mean-squared prediction error: is the function that
minimizes over all functions g(.) at all points X = x
The accuracy of 𝑌" as a prediction for Y depends on 2 quantities
- reducible error = we can potentially improve the accuracy of 𝑓$ by using the most appropriate
statistical learning technique to estimate f
- irreducible error = no matter how well we estimate f, we cannot reduce the error introduced
by ε (bc Y is also a function of ε wich cannot be predicted using X.
o The quantity ε may contain unmeasured variables that are useful in predicting Y: and if
they are not measured or unmeasurable, they can’t be used in the prediction
o Expected value:
o Goal: minimize the reducible error
! irreducible error will always provide an
upper bound on the accuracy of our
prediction for Y
Proof: decompose expected squared error
Expected value is 0
(2nd)
2. Inference
Understand the relationship between X and Y: In this situation we wish to estimate f, but our goal is
not necessarily to make predictions for Y à 𝑓$ cannot be treated as a black box: we need to know the
exact form
- which predectors are associated with the response variable
o identifying the important predictors
- what is the relationship between the predictor and the response
o positive or negative relationship
- what type of model best explains the relationship?
How do we estimate f?
Models of estimating f
- parametric
- non-parametric
training data = n different data points/ observations that we want to fit in our model
ð goal = apply a statistical learning method to the training data in order to estimate the
unknow function f
,parametric
reduces the problem of estimating f down to one of estimating a set of parameters because it
assumes a form for f => it simplifies the problem
1. Make an assumption about the function form of f (bv linear: p+1 parameters)
2. After selecting a model, use training data to fit or train the model (bv least squares)
parametric and structured models: the lineal model is important:
- specified in terms of p+1 parameters: {β0, β1, β2, ... , βp }
- estimate parameters by fitting the model to training data
- almost never correct but serves good and interpretable approximation to unknown true
function à good to see interference
disadvantages: the model we choose will usually not match the true unknown form of f
è choose more flexible modes: estimate a greater number of parameters
è potential to inaccurately estimate f if the form of f assumed is wrong
è more complex model à overfitting: they follow the errors, or noise, to closely
advantages: more interpretable (easier to explain the results)
non-parametric
does not make an explicit assumption on the functional form of f à attempt to get as close to the
data points as possible, without being too rough or too smooth
advantage: has the potential of fitting in a wider range of possible shapes of f
disadvantage: does not reduce the problem, so a larger number of observations is needed for an
accurate estimate of f
non-parametric model:
thin-plate spline: technique that does not impose any pre-specified model on f. It instead attempts
to produce an estimate for f that is as close as possible to the observed data
- importance of level of smoothness
Trade-offs
Restrictive > flexible
- for interference: more interpretable
Flexible > restrictive
- predictions: interpretability not of interest
- wider range of possible shapes
Prediction accuracy vs interpretability
- lin models are easy to interpret
- thin-plate spines not
Good fit vs over-fit or under-fit
Parsimony vs black-box
- prefer simpler model involving fewer
variables over a black-box predictor
involving them all if they have the same
result
The more performant à the less interpretive it becomes
Supervised vs unsupervised learning
We can seek to understand the relationships between the variables between the observations
- using cluster analysis or clustering: look whether observations fall into distinct groups
- sometimes difficulty as variables can’t be put easily in groups because they overlap
, Regression vs classification problems
regression problems: with quantitative data
- use of least squares
- use of K-nearest-neighbors
classification problems: with qualitative data
- use of logistic regression: binary
- use of K-nearest-neighbors
Assessing model accuracy
No best method for every data set à selecting the best approach is therefore very important
Measuring the quality of Fit
Mean squared error = how well its predicted value for a given observation is close to the true response
value for that observation à does it match the observed data?
MSE= small if the predicted responses are very close to the true responses
MSE= large if for some observations, the predicted and true responses differ substantially
- we are interested in the accuracy of the predictions that we obtain when we apply our
method to previously unseen test data à not in the training data
In other words, if we had a large number of test observations we could compute the average squared
prediction error for these observations (x0,y0).
- Select the model for which this is as small as possible
- Fundamental problem: there is no guarantee that the method with the lowest training MSE
will also have the lowest test MSE
o Test MSE often much larger then training MSE
ISLRv2_website.pdf (su.domains)
Chapter 1: Introduction
Supervised learning = building a statistical model for predicting an output based on one or more
inputs
Regression = predicting a continuous or quantitative output (price,..)
Classification = predicting a qualitative output (gender, up/down,..)
Unsupervised learning = the inputs are not supervising the outputs
- No outcome variable, just a set of predictors/features measured on a set of samples.
- Objective is more fuzzy: find groups of samples that behave similarly, find features that
behave similarly, find linear combinations of features with the most variation, . . .
- It’s difficult to know how well you are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised
learning.
- we lack a response vari- able that can supervise our analysis
Clustering = - grouping individuals according to observed characteristics : here we are not trying to
predict an output variable
Association = determining rules that describe large portions of a dataset
ISL (= introduction to Statistical learning) based on 4 premesis
- Many statistical learning methods are relevant and useful in a wide range of academic and
non-academic disciplines, beyond just the statistical sciences
- Statistical learning should not be viewed as a series of black boxes : no single approach will
perform well in all possible applications
- While it is important to know what job is performed by each cog, it is not necessary to have
the skills to construct the machine inside the box
- We presume that the reader is interested in applying statistical learning methods to real-
world problems
Chapter 2: Statistical learning
= set of tools for making sense of complex datasets
X = input/predictor/independent variable
Y = output/response/dependent variable
f represents the systematic information that X provides about Y à statistical learning refers to a set
of approaches for estimating f
- e captures measurement errors = random error term, which is independent of X and has
mean zero
Why estimate f?
- prediction
- inference
1. prediction
𝑌"= 𝑓$(𝑋) à error term averages to zero
- 𝑓$= estimate for f
- 𝑌" = resulting prediction for Y à often treated as a black box = one is not typically concerned
with the exact term of 𝑓$ , provided that it yields accurate predictions for Y.
,Ideal predictor of Y: mean-squared prediction error: is the function that
minimizes over all functions g(.) at all points X = x
The accuracy of 𝑌" as a prediction for Y depends on 2 quantities
- reducible error = we can potentially improve the accuracy of 𝑓$ by using the most appropriate
statistical learning technique to estimate f
- irreducible error = no matter how well we estimate f, we cannot reduce the error introduced
by ε (bc Y is also a function of ε wich cannot be predicted using X.
o The quantity ε may contain unmeasured variables that are useful in predicting Y: and if
they are not measured or unmeasurable, they can’t be used in the prediction
o Expected value:
o Goal: minimize the reducible error
! irreducible error will always provide an
upper bound on the accuracy of our
prediction for Y
Proof: decompose expected squared error
Expected value is 0
(2nd)
2. Inference
Understand the relationship between X and Y: In this situation we wish to estimate f, but our goal is
not necessarily to make predictions for Y à 𝑓$ cannot be treated as a black box: we need to know the
exact form
- which predectors are associated with the response variable
o identifying the important predictors
- what is the relationship between the predictor and the response
o positive or negative relationship
- what type of model best explains the relationship?
How do we estimate f?
Models of estimating f
- parametric
- non-parametric
training data = n different data points/ observations that we want to fit in our model
ð goal = apply a statistical learning method to the training data in order to estimate the
unknow function f
,parametric
reduces the problem of estimating f down to one of estimating a set of parameters because it
assumes a form for f => it simplifies the problem
1. Make an assumption about the function form of f (bv linear: p+1 parameters)
2. After selecting a model, use training data to fit or train the model (bv least squares)
parametric and structured models: the lineal model is important:
- specified in terms of p+1 parameters: {β0, β1, β2, ... , βp }
- estimate parameters by fitting the model to training data
- almost never correct but serves good and interpretable approximation to unknown true
function à good to see interference
disadvantages: the model we choose will usually not match the true unknown form of f
è choose more flexible modes: estimate a greater number of parameters
è potential to inaccurately estimate f if the form of f assumed is wrong
è more complex model à overfitting: they follow the errors, or noise, to closely
advantages: more interpretable (easier to explain the results)
non-parametric
does not make an explicit assumption on the functional form of f à attempt to get as close to the
data points as possible, without being too rough or too smooth
advantage: has the potential of fitting in a wider range of possible shapes of f
disadvantage: does not reduce the problem, so a larger number of observations is needed for an
accurate estimate of f
non-parametric model:
thin-plate spline: technique that does not impose any pre-specified model on f. It instead attempts
to produce an estimate for f that is as close as possible to the observed data
- importance of level of smoothness
Trade-offs
Restrictive > flexible
- for interference: more interpretable
Flexible > restrictive
- predictions: interpretability not of interest
- wider range of possible shapes
Prediction accuracy vs interpretability
- lin models are easy to interpret
- thin-plate spines not
Good fit vs over-fit or under-fit
Parsimony vs black-box
- prefer simpler model involving fewer
variables over a black-box predictor
involving them all if they have the same
result
The more performant à the less interpretive it becomes
Supervised vs unsupervised learning
We can seek to understand the relationships between the variables between the observations
- using cluster analysis or clustering: look whether observations fall into distinct groups
- sometimes difficulty as variables can’t be put easily in groups because they overlap
, Regression vs classification problems
regression problems: with quantitative data
- use of least squares
- use of K-nearest-neighbors
classification problems: with qualitative data
- use of logistic regression: binary
- use of K-nearest-neighbors
Assessing model accuracy
No best method for every data set à selecting the best approach is therefore very important
Measuring the quality of Fit
Mean squared error = how well its predicted value for a given observation is close to the true response
value for that observation à does it match the observed data?
MSE= small if the predicted responses are very close to the true responses
MSE= large if for some observations, the predicted and true responses differ substantially
- we are interested in the accuracy of the predictions that we obtain when we apply our
method to previously unseen test data à not in the training data
In other words, if we had a large number of test observations we could compute the average squared
prediction error for these observations (x0,y0).
- Select the model for which this is as small as possible
- Fundamental problem: there is no guarantee that the method with the lowest training MSE
will also have the lowest test MSE
o Test MSE often much larger then training MSE