Resume

Dataming summary ch1-3

Note

Vendu

Pages

Publié le

28-04-2025

Écrit en

2023/2024

The best summary you can get for the first chapters. Success garanteed for these chapters on the difficult exam!!!

Établissement

Cours

Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

Signaler une violation de copyright

École, étude et sujet

Établissement: Universiteit Gent (UGent)
Cours: Handelsingenieur
Cours: Statistisch modelleren en datamining

Tous les documents sur ce sujet (4)

Infos sur le Document

Publié le: 28 avril 2025
Nombre de pages: 45
Écrit en: 2023/2024
Type: Resume

Sujets

Aperçu du contenu

Data Mining
Chapter 1

Statistical learning = tools for understanding data

 Supervised: building a statistical model for predicting or estimating an output based on one
or more inputs
 Unsupervised: inputs without an output; learn relationships and structure from such data

Sorts of data

 Wage data
o Examine a number of factors that relate to wages for a group of people (men)
o To understand the association between age and education, as well as the calander
year, on his wage
o Predicting a continuous or quantitative output value  regression problem

 Stock market data
o Predicting a non-numerical value; categorical or qualitative output  classification
problem
o Goal is predict whether the index increase or decrease on a given day, using the past 5
days’ percentage changes in the index

 Gene expression data
o Only observing the input variables with no corresponding output
o Clustering problem = understand which types of customers are similar to each other by
grouping individuals according to their observed characteristics

Chapter 2

2.1 What is statistical learning?

Example: the goal is to develop an accurate model that can be used to predict sales on the basis of
the 3 media budgets

Input variables = X ( X1, X2, X3 …)
Different names:
- Predictors
- Independent variable
- Features
- Variables

Output variables = Y
Different names:
- Response
- Dependent variable

Relationship between X and Y in his general form:

 f : some fixed but unknown function of X 1, …, Xp

1

, o may involve more than one variable
 ԑ : random error term
o independent of X
o has mean zero

Example:

Income is a simulated data set, so f is
known and is the blue curve in the right-
handed panel

The vertical lines represent the error
term ԑ some lie above the blue curve and
some under it.

Overall, the errors have +/- mean zero

2.1.1 Why estimating f?

There are 2 main reasons that we may wish to estimate f: (1) prediction and (2) inference

(1) Prediction
A set of inputs X are available, but the output Y is not easily obtained.

We predict the Y using

 : the estimate for f  treated as a black box (not concerned with the exact form of
, provided that it yield accurate predictions for Y)

 : the resulting prediction for Y
o The accuracy depends on:
 Reducible error = improve the accuracy of by using the most
appropriate statistical learning technique to estimate f
 BUT there will still be some error in it
 Irreducible error = Y is also a function of ԑ, which cannot be
predicted using X. no matter how well we estimate f, we cannot
reduce the error introduced by ԑ.
It is larger than zero because ԑ may contain unmeasured variables
that are useful in predicting Y. there are unmeasured, so f cannot use
them for its prediction.

Assume for a moment that both and X are fixed, the only variability comes from ԑ.

2

,  : average or expected value of the squared difference between
predicted and actual value of Y

 Var(ԑ) : variance associated with the error term ԑ

 Focus of this book: minimize the reducible error

(2) Inference

Interested in the association between Y and X 1, …, Xp
Answering the following questions:
1. Which predictor are associated with the response?
2. What is the relationship between the response and each predictor?
3. Can the relationship between Y and each predictor be adequately summarized using a
linear equation, or is the relationship more complicated?

Some models can be conducted both for prediction and inference. (ex. Real estate setting: some are
interested in crime rate, some are interested in association between the price of a house and a view
of the river)

2.1.2 How to estimate f?

Training data = use the observations to train/teach our method how to estimate f.

It consists of

xij = value of the jth predictor/input for observation i (
)

Our goal is to apply a statistical learning method to the training data in order to estimate the
unknown function f. Or in other words , we want to find an such that .

(1) Parametric methods
Two steps:
1) Make an assumption about the functional form or shape of f .
Ex. Linear model:

Instead of having to estimate an entirely arbitrary p-dimensional function f(X), only
estimate p + 1 coefficients β0,β1, …, βp .
2) After a model has been selected, we need a procedure that uses training data to fit
or train the model

Ex. Linear model:

It reduces the problem of estimating f down to one of estimating a set of parameters.
Disadvantage: Choosing a model that not matches the true unknown form of f.

3

,  Try to address this by choosing a more flexible model BUT requires estimating
more parameters.
 These more complex models can lead to overfitting the data (they follow
errors too closely)

True function of f linear model fit by least squares

(2) Non-parametric methods
They seek an estimate of f that gets as close to the data points as possible without being too
rough or wiggly. You don’t choose a shape, so they have the potential to accurately fit a
wider range of possible shapes for f.

Disadvantage: they do not reduce the problem to a small number of parameters, a very large
number of observations is required in order to obtain an accurate estimate for f.

Ex. Thin-plate-spline: it does not impose any pre-specified model on f, attempts to procedure
an estimate for f that is close as possible to the observed data. The data analyst must select a
level of smoothness. BUT the spline fit is way more variable than the true function f.
( overfitting)

True function of f thin- plate spline fit

4

€8,46

Accéder à l'intégralité du document:

Garantie de satisfaction à 100%

Disponible immédiatement après paiement

En ligne et en PDF

Tu n'es attaché à rien

Faites connaissance avec le vendeur

merelgeladi

Document également disponible en groupe

Faites connaissance avec le vendeur

merelgeladi Universiteit Gent

Voir profil

Vendu

Membre depuis

8 mois

Nombre de followers

Documents

Dernière vente

0,0

0 revues

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Vous obtenez un PDF, disponible immédiatement après votre achat. Le document acheté est accessible à tout moment, n'importe où et indéfiniment via votre profil.

Garantie de remboursement : comment ça marche ?

Notre garantie de satisfaction garantit que vous trouverez toujours un document d'étude qui vous convient. Vous remplissez un formulaire et notre équipe du service client s'occupe du reste.

Auprès de qui est-ce que j'achète ce résumé ?

Stuvia est une place de marché. Alors, vous n'achetez donc pas ce document chez nous, mais auprès du vendeur merelgeladi. Stuvia facilite les paiements au vendeur.

Est-ce que j'aurai un abonnement?

Non, vous n'achetez ce résumé que pour €8,46. Vous n'êtes lié à rien après votre achat.

Peut-on faire confiance à Stuvia ?

4.6 étoiles sur Google & Trustpilot (+1000 avis) 46458 résumés ont été vendus ces 30 derniers jours Fondée en 2010, la référence pour acheter des résumés depuis déjà 16 ans

Dataming summary ch1-3

École, étude et sujet

Infos sur le Document

Sujets

Aperçu du contenu

Plus de cours sur Universiteit Gent (UGent) > Handelsingenieur

Document également disponible en groupe

Faites connaissance avec le vendeur

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Le document ne convient pas ? Choisis un autre document

Paye comme tu veux, apprends aussitôt

Foire aux questions

Qu'est-ce que j'obtiens en achetant ce document ?

Garantie de remboursement : comment ça marche ?

Auprès de qui est-ce que j'achète ce résumé ?

Est-ce que j'aurai un abonnement?

Peut-on faire confiance à Stuvia ?