Summary

Top data mining summary

Rating

Sold

Pages

Uploaded on

10-10-2024

Written in

2022/2023

Learning this summary is guaranteed to succeed. I myself had a 15/20.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit Gent (UGent)
Study: Handelsingenieur
Course: Statistisch modelleren en datamining

All documents for this subject (4)

Document information

Uploaded on: October 10, 2024
Number of pages: 57
Written in: 2022/2023
Type: Summary

Subjects

datamining

Content preview

Boek:
ISLRv2_website.pdf (su.domains)
Chapter 1: Introduction
Supervised learning = building a statistical model for predicting an output based on one or more
inputs
Regression = predicting a continuous or quantitative output (price,..)
Classification = predicting a qualitative output (gender, up/down,..)
Unsupervised learning = the inputs are not supervising the outputs
- No outcome variable, just a set of predictors/features measured on a set of samples.
- Objective is more fuzzy: find groups of samples that behave similarly, find features that
behave similarly, find linear combinations of features with the most variation, . . .
- It’s difficult to know how well you are doing.
- Different from supervised learning, but can be useful as a pre-processing step for supervised
learning.
- we lack a response vari- able that can supervise our analysis

Clustering = - grouping individuals according to observed characteristics : here we are not trying to
predict an output variable
Association = determining rules that describe large portions of a dataset

ISL (= introduction to Statistical learning) based on 4 premesis
- Many statistical learning methods are relevant and useful in a wide range of academic and
non-academic disciplines, beyond just the statistical sciences
- Statistical learning should not be viewed as a series of black boxes : no single approach will
perform well in all possible applications
- While it is important to know what job is performed by each cog, it is not necessary to have
the skills to construct the machine inside the box
- We presume that the reader is interested in applying statistical learning methods to real-
world problems

Chapter 2: Statistical learning
= set of tools for making sense of complex datasets
X = input/predictor/independent variable
Y = output/response/dependent variable

f represents the systematic information that X provides about Y à statistical learning refers to a set
of approaches for estimating f
- e captures measurement errors = random error term, which is independent of X and has
mean zero

Why estimate f?
- prediction
- inference

1. prediction
𝑌"= 𝑓$(𝑋) à error term averages to zero
- 𝑓$= estimate for f
- 𝑌" = resulting prediction for Y à often treated as a black box = one is not typically concerned
with the exact term of 𝑓$ , provided that it yields accurate predictions for Y.

,Ideal predictor of Y: mean-squared prediction error: is the function that
minimizes over all functions g(.) at all points X = x

The accuracy of 𝑌" as a prediction for Y depends on 2 quantities
- reducible error = we can potentially improve the accuracy of 𝑓$ by using the most appropriate
statistical learning technique to estimate f
- irreducible error = no matter how well we estimate f, we cannot reduce the error introduced
by ε (bc Y is also a function of ε wich cannot be predicted using X.
o The quantity ε may contain unmeasured variables that are useful in predicting Y: and if
they are not measured or unmeasurable, they can’t be used in the prediction
o Expected value:
o Goal: minimize the reducible error
! irreducible error will always provide an
upper bound on the accuracy of our
prediction for Y

Proof: decompose expected squared error

Expected value is 0
(2nd)

2. Inference
Understand the relationship between X and Y: In this situation we wish to estimate f, but our goal is
not necessarily to make predictions for Y à 𝑓$ cannot be treated as a black box: we need to know the
exact form
- which predectors are associated with the response variable
o identifying the important predictors
- what is the relationship between the predictor and the response
o positive or negative relationship
- what type of model best explains the relationship?

How do we estimate f?
Models of estimating f
- parametric
- non-parametric

training data = n different data points/ observations that we want to fit in our model
ð goal = apply a statistical learning method to the training data in order to estimate the
unknow function f

,parametric
reduces the problem of estimating f down to one of estimating a set of parameters because it
assumes a form for f => it simplifies the problem
1. Make an assumption about the function form of f (bv linear: p+1 parameters)
2. After selecting a model, use training data to fit or train the model (bv least squares)

parametric and structured models: the lineal model is important:
- specified in terms of p+1 parameters: {β0, β1, β2, ... , βp }
- estimate parameters by fitting the model to training data
- almost never correct but serves good and interpretable approximation to unknown true
function à good to see interference

disadvantages: the model we choose will usually not match the true unknown form of f
è choose more flexible modes: estimate a greater number of parameters
è potential to inaccurately estimate f if the form of f assumed is wrong
è more complex model à overfitting: they follow the errors, or noise, to closely
advantages: more interpretable (easier to explain the results)

non-parametric
does not make an explicit assumption on the functional form of f à attempt to get as close to the
data points as possible, without being too rough or too smooth
advantage: has the potential of fitting in a wider range of possible shapes of f
disadvantage: does not reduce the problem, so a larger number of observations is needed for an
accurate estimate of f

non-parametric model:
thin-plate spline: technique that does not impose any pre-specified model on f. It instead attempts
to produce an estimate for f that is as close as possible to the observed data
- importance of level of smoothness

Trade-offs
Restrictive > flexible
- for interference: more interpretable
Flexible > restrictive
- predictions: interpretability not of interest
- wider range of possible shapes
Prediction accuracy vs interpretability
- lin models are easy to interpret
- thin-plate spines not
Good fit vs over-fit or under-fit
Parsimony vs black-box
- prefer simpler model involving fewer
variables over a black-box predictor
involving them all if they have the same
result
The more performant à the less interpretive it becomes

Supervised vs unsupervised learning
We can seek to understand the relationships between the variables between the observations
- using cluster analysis or clustering: look whether observations fall into distinct groups
- sometimes difficulty as variables can’t be put easily in groups because they overlap

, Regression vs classification problems
regression problems: with quantitative data
- use of least squares
- use of K-nearest-neighbors
classification problems: with qualitative data
- use of logistic regression: binary
- use of K-nearest-neighbors

Assessing model accuracy
No best method for every data set à selecting the best approach is therefore very important

Measuring the quality of Fit
Mean squared error = how well its predicted value for a given observation is close to the true response
value for that observation à does it match the observed data?

MSE= small if the predicted responses are very close to the true responses
MSE= large if for some observations, the predicted and true responses differ substantially
- we are interested in the accuracy of the predictions that we obtain when we apply our
method to previously unseen test data à not in the training data

In other words, if we had a large number of test observations we could compute the average squared
prediction error for these observations (x0,y0).
- Select the model for which this is as small as possible
- Fundamental problem: there is no guarantee that the method with the lowest training MSE
will also have the lowest test MSE
o Test MSE often much larger then training MSE

$15.17

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

MarieVerhelst60

5.0

(1)

Get to know the seller

MarieVerhelst60 Universiteit Gent

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

4 weeks ago

5.0

1 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller MarieVerhelst60. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.17. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46458 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Top data mining summary

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?