Resumen

Summary Exam Cheatsheet - Data Mining

Puntuación

Vendido

Páginas

Subido en

05-11-2025

Escrito en

2025/2026

A4 cheatsheet for the Data Mining exam

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Universiteit Utrecht (UU)
Estudio: Data Science
Grado: Data Mining (INFOMDM)

Todos documentos para esta materia (1)

Información del documento

Subido en: 5 de noviembre de 2025
Número de páginas: 2
Escrito en: 2025/2026
Tipo: Resumen

Temas

data mining

Vista previa del contenido

n(G)n(E)
1 Classification Trees tinct values there are 2L−1 − 1 dis- Prediction rule model trained on the m-th bootstrap Laplace smoothing: n̂(G, E) = N = n(G, E) • We are interested in expres- phs: they have no chordless cycles of
Strong and weak points tinct splits to consider, because there In leaf nodes, we predict the average sample. Note that we never actual- sing conditional indepen- length greater than three. A cycle is
target value of all cases falling into ly average models, we average their Adding 1 to the function such to avo- dence constraints. chordless if only the successive pairs
Strong points: are 2L − 2 non-empty proper subsets id zero probability estimates:
are the same as the observed counts.
of {b1 , b2 , ..., bL }, and as a subset and that node: predictions. • There is a straightforward of vertices in the cycle are connected
• Easy to interpret Saturated model estimates cell pro- correspondence between by an edge. (There is no “shortcut”
the complement of that subset result ŷ(t) = y(t) = 1
P
y The average error made by the mo- count(wi ,c)+1 babilities by dividing cell count by such constraints being sa- in the cycle).
• Automatically selects rele- in the same split → divide by 2. N (t) i∈t i Pˆ (wi |c) = P
vant attributes dels acting individually is: wj ∈V count(wj ,c)+|V | the total number of observations → tisfied and the elimination
• Can handle both numerical mk possibilities (curse of dimensio- of u-terms. Let C1 , ..., Cm be a RIP ordering of
Shortcut: Sort p(0|x = bl ), then one of where N (t) is the number of cases 1 E 2 cliques, with seperator sets S1, ..., Sm
and categorical attributes EAV = M P (x) [em (x) ] nality) • This correspondence is
the L − 1 subsets is the optimal split. falling into node t. Multinomial Naive Bayes makes two established by applying
1.2 Pruning assumptions: 4.1 Independence model the factorisation criterion: Maximum likelihood fitted counts:
Weak points: We predict the value of c that mini- If we assume that the errors have ze-
Cost-complexity pruning: weakest Avoid the curse by assuming inde- X ⊥ Y |Z iff there exist Qm
• Single trees are usually not mizes the residual sum of squares ro mean and are uncorrelated, we ob- • Word occurrences are inde- pendence: n(xC )
link pruning (RSS): tain: pendent within each class. functions g and h such j=1 j
among the top performers that: n̂(x) = Qm
• Positional independence. n(xS )
1.1 Impurity Terminology: EBAG = Pˆ (G, E) = Pˆ (G)Pˆ (E) = logP (x, y, z) = j=2 j
RSS(t) = i∈t (yi − c)2
P
Impurity measure = measure that 1 PM E NB works because although proba- n(G) n(E) n(G)n(E) g(x, z) + h(y, z)
• Tree T Splitting rule The mean squared er- [em (x)2 ] = ( n )( n ) =
indicates ’how far’ a node is removed • Collection of leaf nodes T̃ M 2 m=1 P (x) bility estimates may be way off, the n2 4.6 Log-linear expansion Maximum likelihood fitted proba-
from containing only observations of ror (MSE) of a tree T is given by: 1 order is still correct. bilities:
Let X = (X1 , X2 , ..., Xk ) be a vector
a single class. Total cost of a tree: Cα = R(T ) + α|T̃ | M EAV 3.2 Feature selection Fitted counts: n̂(G, E) = nPˆ (G, E) = Qm
1 P P 2 Random forests of discrete random variables, and n(xC )
R(T ) = N n(G)n(E) n(G)n(E) j=1
Impurity function with resubstition error R(T ) and a t∈T̃ i∈t (yi − y(t)) Attempt to de-correlate predicti-
Entropy
n( )= n
K = 1, 2, ..., k denote the set of indi- Pˆ (x) = Qm
j
The impurity i(t) of a node t is a func- penalty for the complexity of the tree Entropy is the average amount of n2 ces (coordinates) of X. The log-linear N n(xS )
tion of the relative frequencies of the where N is the size of the learning ons of individual trees, such that information generated by observing j=2 j
α|T̃ |, (α ≥ 0) EP (x) [em (x)en (x)] is closer to zero. expansion of the probability distri-
classes in that node: sample. the value of a random variable. Al- Factorisation criterion for indepen- bution PK (X) is: Iterative Proportional Fitting (IPF)
Note: R(T ) = so can be interpreted as measure of dence: If the model is not decomposable,
i(t) = φ(p1 , p2 , ..., pJ ) n of wrong classifications in T The contribution of the node t to the Same as bagging, but then with ran- the uncertainty about the value of X P
logPK (x) = a∈K ua (xa ) there is no closed form solution for
MSE of T is: dom subsets of features. prior to the observation. logP (x, y) = g ∗ (x) + h∗ (y)
n of examples in training sample the ML estimates. IPF is an iterative
where pj (j = 1, ..., J) are the relative 1 P (y − y(t))2 Boosting ( Conditional independence) where the sum is taken over all pos- algorithm to find ML estimates:
R(t) = N Conditional entropy: X and Y are conditionally indepen-
frequencies of the J different classes Depending on α, different pruned i∈t i Same as bagging, except that the sible subsets a of K = {1, 2, ...K}.
in node t. subtrees will have the lowest total trees are grown sequentially; each dent given Z iff: 1. Fit row to mar-
1
n̂(x1 , x2 )(1)
P
cost. For α = 0, the tree with smal- so we can write: tree is grown using information from H(X|Y ) = x,y log2 P (x|y) = • We get a (set of) u-term(s) gin: =
Sensible requirements of any quanti- lest R(T ) wins. previously grown trees. It does so by P (x, y|z) = P (x|z)P (y|z) for each subset of the varia-
P
R(T ) = t∈T̃ R(t)
P
− x,y log2 P (x|y) ˆ
n1 (x1 ) ∗ P (x2 |x1 ) (0) =
fication of impurity: Smallest minimizing subtree fitting small trees to the residuals, bles
Also written as X ⊥ Y |Z • We code the values of Xi as
Theorem: For any value of α, there such that fˆ is slowly improved in the n̂(x1,x2)(0)
• Should be at maximum exists a smallest minimizing subtree The best split s∗ of t is that split areas where it does not perform well.
If X and Y independent, then H(X) = {0, 1, ..., di − 1}, where di is n1 (x1 ) ∗
when the observations are T (α) of Tmax that satisfies the follo- which most decreases R(T ). H(X|Y ). Factorisation criterion for conditio- n̂1 (x1 )(0)
nal independence: the size of the domain of
distributed evenly over all wing conditions: Parameters: Mutual information Xi . 2. Fit column to mar-
classes The decrease in R(T ) of a split s in For random variables X and Y , their gin: n̂(x1 , x2 )(2) =
• Should be at minimum • T (α) minimizes total node t is given by: • Number of trees B. Cross- mutual information is given by: logP (x, y, z) = g ∗ (x, z) + h∗ (y, z) • Set ua (xa ) = 0 whenever
when all observations be- cost for that value validation is used to select. 4.2 Conditional Independence xi = 0 for any Xi with i ∈ a n2 (x2 ) ∗ Pˆ (x1 |x2 )(1) =
∆R(s, t) = R(t) − R(l) − R(r) I(X; Y ) = H(X) − H(X|Y ) = to eliminate redundant pa-
long to a single class of α: Cα (T (α)) = • Shrinkage parameter λ. Graph
rameters. n̂(x1,x2)(1)
• Should be a symmetric minT ≤Tmax Cα (T ) 2 Tree ensembles Controls learning rate. Ty- P (x,y) The conditional independence graph n2 (x2 ) ∗
n̂2 (x2 )(1)
P P
function of p1 , ..., pJ Reducible and Irreducible Error pical value 0.01 or 0.001. x y P (x, y)log2 P (x)P (y) of X is the undirected graph G =
• This is analogous to the ca-
• T (α) is a pruned subtree se where X is binary.
of all trees that minimi- Reducible error ⇓ Small λ requires big B to (K, E) where i, j is not in the edge set 4.7 Model testing
Quality of a split achieve good performance. • There are as many u-terms
We define the quality of a binary ze total cost: if Cα (T ) = Mutual information measures the re- iff: Log-likelihood
• Number of splits d. Con- duction in uncertainty about X achie- in the full log-linear ex-
split s in node t as the reduction of Cα (T (α)) then T (α) ≤ T EP (Y |x) [(Y − fˆ(x))2 ] = (f (x) − fˆ(x))2 trols complexity of boos- pansion as there are cells The likelihood score of model M
impurity that it achieves: ved by observing the value of Y . Xi ⊥ Xj |rest is the probability of the observed
Decomposition of total cost ted ensemble. Often, d = 1 in the contingency table:
The total cost of a tree is the sum of + EP (Y |x) [(Y − f (x))2 ] 4.3 2x2 table npossible values per cel data using the fitted cell probabili-
∆i(s, t) = i(t) − {π(l)i(l) + π(r)i(r)} works well. To estimate I(X; Y ) from data:
the contributions of its leaf nodes: n . ties according to model M. The log-
3 Text mining Probability function: cells
Irreducible error ⇑ I(X; Y ) = likelihoodscore of model M is:
where l is the left child of t, r is the P
Cα (T ) = R(T )+α|T̃ | = t∈T̃ (R(t)+α) Independence and u-terms
right child of t, π(l) is the proportion Bias-Variance Decomposition Rules of Probability Pˆ (x,y) logP (x1 , x2 ) = logp(0, 0) + Proof: If for all s ⊆ K with us (xs ) , 0, LM − x n(x)log Pˆ M (x)
P P ˆ P
x y P (x, y)log2 Pˆ (x)Pˆ (y)
P
of cases that is sent to the left, and Sum rule: P (X) = Y P (X, Y ) p(1,0) p(0,1) we have s ⊆ a ∪ b or s ⊆ a ∪ c:
R(T ) is the number of errors we ma- log( )x + log( )x +
π(r) the proportion of cases that is
ke in node t if we predict the majori- Squared bias ⇓ p(0,0) 1 p(0,0) 2 P The log-likelihoodscore of saturated
sent to the right. Product rule: P (X, Y ) = P (X)P (Y |X) logPK (x) = u (x ) +
ty class, divided by the total number n(x,y) p(1,1)p(0,0) P s⊆a∪b s s model is:
where Pˆ (x, y) = N and Pˆ (x) = log( )x x
P
Impurity functions p(0,1)p(1,0) 1 2 s⊆a∪c u s (x s ) − s⊆a us (xs )
We define the class F of impuri-
of observations in the training sam- ED [(f (x) − fˆ(x))2 ] = If X and Y independent, then: n(x)
ple. n(x) Hierarchical models Lsat = x n(x)log N
P
P (X, Y ) = P (X)P (Y )
ty functions for two-class problems
Finding the Tk and corresponding (f (x) − ED [(fˆ(x)])2 3.1 Naive Bayes N Reparameterizing the right hand si- In most applications, it does not ma-
that has this property: Model deviance
αk + ED [(fˆ(x) − ED [fˆ(x)])2 ] Naive-Bayes assumption: 3.3 Logistic Regression de leads to the so-called log-linear ke sense to include u123 unless u12 ,
The deviance of model M is:
Generative models = Classify based expansion: u13 , and u23 are also present. A log-
• φ(0) = φ(1) = 0 (minimum Tt : branch of T with root node t
Variance ⇑ P (x1 , ..., xm |c) = P (x1 |c) · · · P (xm |c) on probability distribution of X ge- linear model is hierarchical if the
at p(0) = 0 and p(0) = 1 2 cells observed ∗ log observed
P
logP (x1 , x2 ) = u∅ + u1 x1 + u2 x2 + presence of a term implies that all
After pruning in t, its contribution nerated by model (Naive Bayes, Line- u12 x1 x2 f itted
• φ(p(0)) = φ(1 − p(0)) (sym- Features are assumed to be indepen- lower-order terms are also present.
metric)
to total cost is: MSE = bias2 + variance ar/Quadratic Discriminant Analysis)
4.4 Cross-product ratio Hence, a hierarchical model is uni-
dent within each class. For binary The deviance difference between M0
2.1 Reducing variance by bagging
• φ′′ (p(0)) < 0, 0 < p(0) < 1 Cα ({t}) = R(t) + α features, we only have to estimate m Discriminative models = Classify The cross-product ration between bi- quely identified by listing its highest and M1 for large N is:
(strictly concave) Bagging is short for Bootstrap Aggre- probabilities per class. Curse of di- based on the conditional distribution nary variables X1 and X2 is: order interaction terms.
The contribution of Tt to the total gating. mensionality is avoided. of Y given X. Probability distributi- Graphical models
cost is: on of X is not modeled. 2(LM1 − LM0 ≈ M0 χv2
Resubstition error: Measures the Multinomial Naive Bayes p(1,1)p(0,0) Given its independence graph G =
fraction of cases that is classified in- Training: Represent document d as sequence cpr(X1 , X2 ) = (K, E), the log-linear model for the
Cα (Tt ) = t ′ ∈T̃ (R(t ′ ) + α) = R(Tt ) + p(0,1)p(1,0) We reject null hypothesis that M0 is
P
correctly if we assign every case in t of words d = ⟨w1 , w2 , ..., wn ⟩ Logistic response function: random vector X is a graphical mo-
1. Draw a sample with repla- the true model when:
node t to the majority class of that α|T̃t | cpr(X1 , X2 ) > 1 pos. ass. del for X if the distribution of X is ar-
cement from the training ⊤
node: set (should be of same size
Class priors:
E[Y |x] = P (Y = 1|x) = eβ x = bitrary apart from constraints of the 2(LM1 − LM0 > χ2 ′
The tree obtained by pruning in t be- ⊤ form that for all pairs of coordina- v;α
i(t) = 1 − maxj p(j|t) as training set). N 1+e x
β cpr(X1 , X2 ) < 1 neg. ass.
tes not in the edge set E, the u-terms
comes better than T when: 2. Grow a tree on this boot- Pˆ (c) = N c 1
strap sample (pruning not doc ⊤ cpr(X1 , X2 ) = 1 no ass. containing the selected coordinates with chi-square distribution χ2 ′
where p(j|t) is the relative frequency Cα ({t}) = Cα (Tt ) 1+e−β x are equal to zero. v;α
necessary). cN B = 4.5 Joint distribution of Three Di- with v degrees of freedom and signi-
of class j in node t. g(t): the critical α value for node 3. Repeat these steps M times Maximum likelihood estimation
argmaxc∈C logP (c) n logP (wk |c) mensional Bernoulli Maximum likelihood: ficance level α, where v is the num-
Q
t For each non-terminal node t we to create M different trees. k=1 Probability distribution:
Gini index: compute its critical value: P (x1 , x2 , x3 ) = ber of additional restrictions (zero
y ML estimator of graphical log-linear u-terms) of M0 compared to M1 .
i(t) = p(0|t)p(1|t) = p(0|t)(1 − p(0|t))
Prediction: Predicting word wi given class: P (yi ) = p i (1 − p)1−yi , y ∈ 0, 1; i = p(0, 0, 0)(1−x1 )(1−x2 )(1−x3 ) · model M satisfies the likelihood
R(t)−R(Tt ) i 4.8 Model selection
g(t) = 1, ..., n x x
· · p(1, 1, 1) 1 2 3x equations:
|T̃t |−1 1. Predict on test sample count(wi ,c) Two components: the lack-of-fit of
Entropy: using each of the M trees Pˆ (wi |c) = P the model and the complexity of the
1.3 Regression trees
in turn. wj ∈V count(wj ,c) Fitted response function: Log-linear expansion: n̂M ˆM
a = N Pa = na
i(t) = −p(0|t)logp(0|t)−p(1|t)logp(1|t) model.
Apply tree-based modesl to pro- 2. Take the majority vote of
Splits on numeric attributes blems with numeric targets. Three the M predictions as the fi- Predicting class c given word(s): e β̂ ⊤ x logP (x1 , x2 , x3 ) = whenever the subset of vertices a in AIC(M) = dev(M) + 2dim(M)
There is only a finite number of dis- elements necessary: Pˆ (Y = 1|x) = ⊤ u∅ +u1 x1 +u2 x2 +u3 x3 +u12 x1 x2 + the graph form a clique.
nal prediction. 1+eβ̂ x u13 x1 x3 + u23 x2 x3 + u123 x1 x2 x3
tinct splits, because there are at most Pˆ (c,X ) BIC(M) = dev(M) + log(N )dim(M)
n distinct values of a numeric attri- • A way to select a split at For regression problems, generate M Pˆ (c|Xi ) = ˆ i 4 Undirected graphical models Observed = fitted for every marginal
bute in the training sample. every non-terminal node. P (Xi ) With: table corresponding to a complete
bootstrap samples, and combine the Saturated model Where dim(M) is the number of pa-
• A rule for determining predictions by averaging: subgraph. rameters (number of u-terms) of the
n(G,E) cpr(X2 ,X3 |X1=1)
P (c, Xi ) = P (c) w∈X Pˆ (w)
ˆ ˆ Pˆ (G, E) =
Q
Shortcut: Optimal split can only oc- when a node is terminal. n u123 = log( ) Decomposable models model.
cur between consecutive values with • A rule for assigning a pre- i cpr(X2 ,X3 |X1 =0) Decomposable graphical models ha-
1 PM fˆ (x)
fˆBAG (x) = M Search
different class distributions. dicted value ŷ(t) to every m=1 m Pˆ (Xi ) = c∈C Pˆ (c, Xi )
P For the saturated model, the fitted ve explicit formulas for the maxi- Exhaustive search is usually not fea-
terminal node t. counts: Why log-linear representation? mum likelihood estimates. They ha-
Splits on categorical attributes sible. Straightforward approach is lo-
For a categorical attribute with L dis- where fˆm (x) is the prediction of the ve triangulated independence gra-

$7.74

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

xanderpoortvliet

Conoce al vendedor

xanderpoortvliet Universiteit Utrecht

Ver perfil

Seguir

Vendido

Nuevo en Stuvia

Miembro desde

1 mes

Número de seguidores

Documentos

Última venta

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller xanderpoortvliet. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $7.74. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now