Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien 4.2 TrustPilot
logo-home
Resume

Dataming summary ch1-3

Note
-
Vendu
-
Pages
45
Publié le
28-04-2025
Écrit en
2023/2024

The best summary you can get for the first chapters. Success garanteed for these chapters on the difficult exam!!!












Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Infos sur le Document

Publié le
28 avril 2025
Nombre de pages
45
Écrit en
2023/2024
Type
Resume

Aperçu du contenu

Data Mining
Chapter 1

Statistical learning = tools for understanding data

 Supervised: building a statistical model for predicting or estimating an output based on one
or more inputs
 Unsupervised: inputs without an output; learn relationships and structure from such data


Sorts of data

 Wage data
o Examine a number of factors that relate to wages for a group of people (men)
o To understand the association between age and education, as well as the calander
year, on his wage
o Predicting a continuous or quantitative output value  regression problem

 Stock market data
o Predicting a non-numerical value; categorical or qualitative output  classification
problem
o Goal is predict whether the index increase or decrease on a given day, using the past 5
days’ percentage changes in the index

 Gene expression data
o Only observing the input variables with no corresponding output
o Clustering problem = understand which types of customers are similar to each other by
grouping individuals according to their observed characteristics

Chapter 2

2.1 What is statistical learning?

Example: the goal is to develop an accurate model that can be used to predict sales on the basis of
the 3 media budgets

Input variables = X ( X1, X2, X3 …)
Different names:
- Predictors
- Independent variable
- Features
- Variables

Output variables = Y
Different names:
- Response
- Dependent variable

Relationship between X and Y in his general form:

 f : some fixed but unknown function of X 1, …, Xp

1

, o may involve more than one variable
 ԑ : random error term
o independent of X
o has mean zero



Example:

Income is a simulated data set, so f is
known and is the blue curve in the right-
handed panel

The vertical lines represent the error
term ԑ some lie above the blue curve and
some under it.

Overall, the errors have +/- mean zero



2.1.1 Why estimating f?

There are 2 main reasons that we may wish to estimate f: (1) prediction and (2) inference

(1) Prediction
A set of inputs X are available, but the output Y is not easily obtained.

We predict the Y using

 : the estimate for f  treated as a black box (not concerned with the exact form of
, provided that it yield accurate predictions for Y)

 : the resulting prediction for Y
o The accuracy depends on:
 Reducible error = improve the accuracy of by using the most
appropriate statistical learning technique to estimate f
 BUT there will still be some error in it
 Irreducible error = Y is also a function of ԑ, which cannot be
predicted using X. no matter how well we estimate f, we cannot
reduce the error introduced by ԑ.
It is larger than zero because ԑ may contain unmeasured variables
that are useful in predicting Y. there are unmeasured, so f cannot use
them for its prediction.

Assume for a moment that both and X are fixed, the only variability comes from ԑ.




2

,  : average or expected value of the squared difference between
predicted and actual value of Y

 Var(ԑ) : variance associated with the error term ԑ

 Focus of this book: minimize the reducible error

(2) Inference

Interested in the association between Y and X 1, …, Xp
Answering the following questions:
1. Which predictor are associated with the response?
2. What is the relationship between the response and each predictor?
3. Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?

Some models can be conducted both for prediction and inference. (ex. Real estate setting: some are
interested in crime rate, some are interested in association between the price of a house and a view
of the river)



2.1.2 How to estimate f?

Training data = use the observations to train/teach our method how to estimate f.

It consists of

xij = value of the jth predictor/input for observation i (
)

Our goal is to apply a statistical learning method to the training data in order to estimate the
unknown function f. Or in other words , we want to find an such that .

(1) Parametric methods
Two steps:
1) Make an assumption about the functional form or shape of f .
Ex. Linear model:


Instead of having to estimate an entirely arbitrary p-dimensional function f(X), only
estimate p + 1 coefficients β0,β1, …, βp .
2) After a model has been selected, we need a procedure that uses training data to fit
or train the model

Ex. Linear model:


It reduces the problem of estimating f down to one of estimating a set of parameters.
Disadvantage: Choosing a model that not matches the true unknown form of f.


3

,  Try to address this by choosing a more flexible model BUT requires estimating
more parameters.
 These more complex models can lead to overfitting the data (they follow
errors too closely)




True function of f linear model fit by least squares


(2) Non-parametric methods
They seek an estimate of f that gets as close to the data points as possible without being too
rough or wiggly. You don’t choose a shape, so they have the potential to accurately fit a
wider range of possible shapes for f.

Disadvantage: they do not reduce the problem to a small number of parameters, a very large
number of observations is required in order to obtain an accurate estimate for f.

Ex. Thin-plate-spline: it does not impose any pre-specified model on f, attempts to procedure
an estimate for f that is close as possible to the observed data. The data analyst must select a
level of smoothness. BUT the spline fit is way more variable than the true function f.
( overfitting)




True function of f thin- plate spline fit


4
€8,46
Accéder à l'intégralité du document:

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Faites connaissance avec le vendeur
Seller avatar
merelgeladi

Document également disponible en groupe

Thumbnail
Package deal
Datamining super combo
-
2 2025
€ 19,12 Plus d'infos

Faites connaissance avec le vendeur

Seller avatar
merelgeladi Universiteit Gent
Voir profil
S'abonner Vous devez être connecté afin de suivre les étudiants ou les cours
Vendu
0
Membre depuis
8 mois
Nombre de followers
0
Documents
6
Dernière vente
-

0,0

0 revues

5
0
4
0
3
0
2
0
1
0

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

Student with book image

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions