Summary

Summary Exam Cheatsheet - Data Mining

Rating

Sold

Pages

Uploaded on

05-11-2025

Written in

2025/2026

A4 cheatsheet for the Data Mining exam

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit Utrecht (UU)
Study: Data Science
Course: Data Mining (INFOMDM)

All documents for this subject (1)

Document information

Uploaded on: November 5, 2025
Number of pages: 2
Written in: 2025/2026
Type: Summary

Subjects

data mining

Content preview

n(G)n(E)
1 Classification Trees tinct values there are 2L−1 − 1 dis- Prediction rule model trained on the m-th bootstrap Laplace smoothing: n̂(G, E) = N = n(G, E) • We are interested in expres- phs: they have no chordless cycles of
Strong and weak points tinct splits to consider, because there In leaf nodes, we predict the average sample. Note that we never actual- sing conditional indepen- length greater than three. A cycle is
target value of all cases falling into ly average models, we average their Adding 1 to the function such to avo- dence constraints. chordless if only the successive pairs
Strong points: are 2L − 2 non-empty proper subsets id zero probability estimates:
are the same as the observed counts.
of {b1 , b2 , ..., bL }, and as a subset and that node: predictions. • There is a straightforward of vertices in the cycle are connected
• Easy to interpret Saturated model estimates cell pro- correspondence between by an edge. (There is no “shortcut”
the complement of that subset result ŷ(t) = y(t) = 1
P
y The average error made by the mo- count(wi ,c)+1 babilities by dividing cell count by such constraints being sa- in the cycle).
• Automatically selects rele- in the same split → divide by 2. N (t) i∈t i Pˆ (wi |c) = P
vant attributes dels acting individually is: wj ∈V count(wj ,c)+|V | the total number of observations → tisfied and the elimination
• Can handle both numerical mk possibilities (curse of dimensio- of u-terms. Let C1 , ..., Cm be a RIP ordering of
Shortcut: Sort p(0|x = bl ), then one of where N (t) is the number of cases 1 E 2 cliques, with seperator sets S1, ..., Sm
and categorical attributes EAV = M P (x) [em (x) ] nality) • This correspondence is
the L − 1 subsets is the optimal split. falling into node t. Multinomial Naive Bayes makes two established by applying
1.2 Pruning assumptions: 4.1 Independence model the factorisation criterion: Maximum likelihood fitted counts:
Weak points: We predict the value of c that mini- If we assume that the errors have ze-
Cost-complexity pruning: weakest Avoid the curse by assuming inde- X ⊥ Y |Z iff there exist Qm
• Single trees are usually not mizes the residual sum of squares ro mean and are uncorrelated, we ob- • Word occurrences are inde- pendence: n(xC )
link pruning (RSS): tain: pendent within each class. functions g and h such j=1 j
among the top performers that: n̂(x) = Qm
• Positional independence. n(xS )
1.1 Impurity Terminology: EBAG = Pˆ (G, E) = Pˆ (G)Pˆ (E) = logP (x, y, z) = j=2 j
RSS(t) = i∈t (yi − c)2
P
Impurity measure = measure that 1 PM E NB works because although proba- n(G) n(E) n(G)n(E) g(x, z) + h(y, z)
• Tree T Splitting rule The mean squared er- [em (x)2 ] = ( n )( n ) =
indicates ’how far’ a node is removed • Collection of leaf nodes T̃ M 2 m=1 P (x) bility estimates may be way off, the n2 4.6 Log-linear expansion Maximum likelihood fitted proba-
from containing only observations of ror (MSE) of a tree T is given by: 1 order is still correct. bilities:
Let X = (X1 , X2 , ..., Xk ) be a vector
a single class. Total cost of a tree: Cα = R(T ) + α|T̃ | M EAV 3.2 Feature selection Fitted counts: n̂(G, E) = nPˆ (G, E) = Qm
1 P P 2 Random forests of discrete random variables, and n(xC )
R(T ) = N n(G)n(E) n(G)n(E) j=1
Impurity function with resubstition error R(T ) and a t∈T̃ i∈t (yi − y(t)) Attempt to de-correlate predicti-
Entropy
n( )= n
K = 1, 2, ..., k denote the set of indi- Pˆ (x) = Qm
j
The impurity i(t) of a node t is a func- penalty for the complexity of the tree Entropy is the average amount of n2 ces (coordinates) of X. The log-linear N n(xS )
tion of the relative frequencies of the where N is the size of the learning ons of individual trees, such that information generated by observing j=2 j
α|T̃ |, (α ≥ 0) EP (x) [em (x)en (x)] is closer to zero. expansion of the probability distri-
classes in that node: sample. the value of a random variable. Al- Factorisation criterion for indepen- bution PK (X) is: Iterative Proportional Fitting (IPF)
Note: R(T ) = so can be interpreted as measure of dence: If the model is not decomposable,
i(t) = φ(p1 , p2 , ..., pJ ) n of wrong classifications in T The contribution of the node t to the Same as bagging, but then with ran- the uncertainty about the value of X P
logPK (x) = a∈K ua (xa ) there is no closed form solution for
MSE of T is: dom subsets of features. prior to the observation. logP (x, y) = g ∗ (x) + h∗ (y)
n of examples in training sample the ML estimates. IPF is an iterative
where pj (j = 1, ..., J) are the relative 1 P (y − y(t))2 Boosting ( Conditional independence) where the sum is taken over all pos- algorithm to find ML estimates:
R(t) = N Conditional entropy: X and Y are conditionally indepen-
frequencies of the J different classes Depending on α, different pruned i∈t i Same as bagging, except that the sible subsets a of K = {1, 2, ...K}.
in node t. subtrees will have the lowest total trees are grown sequentially; each dent given Z iff: 1. Fit row to mar-
1
n̂(x1 , x2 )(1)
P
cost. For α = 0, the tree with smal- so we can write: tree is grown using information from H(X|Y ) = x,y log2 P (x|y) = • We get a (set of) u-term(s) gin: =
Sensible requirements of any quanti- lest R(T ) wins. previously grown trees. It does so by P (x, y|z) = P (x|z)P (y|z) for each subset of the varia-
P
R(T ) = t∈T̃ R(t)
P
− x,y log2 P (x|y) ˆ
n1 (x1 ) ∗ P (x2 |x1 ) (0) =
fication of impurity: Smallest minimizing subtree fitting small trees to the residuals, bles
Also written as X ⊥ Y |Z • We code the values of Xi as
Theorem: For any value of α, there such that fˆ is slowly improved in the n̂(x1,x2)(0)
• Should be at maximum exists a smallest minimizing subtree The best split s∗ of t is that split areas where it does not perform well.
If X and Y independent, then H(X) = {0, 1, ..., di − 1}, where di is n1 (x1 ) ∗
when the observations are T (α) of Tmax that satisfies the follo- which most decreases R(T ). H(X|Y ). Factorisation criterion for conditio- n̂1 (x1 )(0)
nal independence: the size of the domain of
distributed evenly over all wing conditions: Parameters: Mutual information Xi . 2. Fit column to mar-
classes The decrease in R(T ) of a split s in For random variables X and Y , their gin: n̂(x1 , x2 )(2) =
• Should be at minimum • T (α) minimizes total node t is given by: • Number of trees B. Cross- mutual information is given by: logP (x, y, z) = g ∗ (x, z) + h∗ (y, z) • Set ua (xa ) = 0 whenever
when all observations be- cost for that value validation is used to select. 4.2 Conditional Independence xi = 0 for any Xi with i ∈ a n2 (x2 ) ∗ Pˆ (x1 |x2 )(1) =
∆R(s, t) = R(t) − R(l) − R(r) I(X; Y ) = H(X) − H(X|Y ) = to eliminate redundant pa-
long to a single class of α: Cα (T (α)) = • Shrinkage parameter λ. Graph
rameters. n̂(x1,x2)(1)
• Should be a symmetric minT ≤Tmax Cα (T ) 2 Tree ensembles Controls learning rate. Ty- P (x,y) The conditional independence graph n2 (x2 ) ∗
n̂2 (x2 )(1)
P P
function of p1 , ..., pJ Reducible and Irreducible Error pical value 0.01 or 0.001. x y P (x, y)log2 P (x)P (y) of X is the undirected graph G =
• This is analogous to the ca-
• T (α) is a pruned subtree se where X is binary.
of all trees that minimi- Reducible error ⇓ Small λ requires big B to (K, E) where i, j is not in the edge set 4.7 Model testing
Quality of a split achieve good performance. • There are as many u-terms
We define the quality of a binary ze total cost: if Cα (T ) = Mutual information measures the re- iff: Log-likelihood
• Number of splits d. Con- duction in uncertainty about X achie- in the full log-linear ex-
split s in node t as the reduction of Cα (T (α)) then T (α) ≤ T EP (Y |x) [(Y − fˆ(x))2 ] = (f (x) − fˆ(x))2 trols complexity of boos- pansion as there are cells The likelihood score of model M
impurity that it achieves: ved by observing the value of Y . Xi ⊥ Xj |rest is the probability of the observed
Decomposition of total cost ted ensemble. Often, d = 1 in the contingency table:
The total cost of a tree is the sum of + EP (Y |x) [(Y − f (x))2 ] 4.3 2x2 table npossible values per cel data using the fitted cell probabili-
∆i(s, t) = i(t) − {π(l)i(l) + π(r)i(r)} works well. To estimate I(X; Y ) from data:
the contributions of its leaf nodes: n . ties according to model M. The log-
3 Text mining Probability function: cells
Irreducible error ⇑ I(X; Y ) = likelihoodscore of model M is:
where l is the left child of t, r is the P
Cα (T ) = R(T )+α|T̃ | = t∈T̃ (R(t)+α) Independence and u-terms
right child of t, π(l) is the proportion Bias-Variance Decomposition Rules of Probability Pˆ (x,y) logP (x1 , x2 ) = logp(0, 0) + Proof: If for all s ⊆ K with us (xs ) , 0, LM − x n(x)log Pˆ M (x)
P P ˆ P
x y P (x, y)log2 Pˆ (x)Pˆ (y)
P
of cases that is sent to the left, and Sum rule: P (X) = Y P (X, Y ) p(1,0) p(0,1) we have s ⊆ a ∪ b or s ⊆ a ∪ c:
R(T ) is the number of errors we ma- log( )x + log( )x +
π(r) the proportion of cases that is
ke in node t if we predict the majori- Squared bias ⇓ p(0,0) 1 p(0,0) 2 P The log-likelihoodscore of saturated
sent to the right. Product rule: P (X, Y ) = P (X)P (Y |X) logPK (x) = u (x ) +
ty class, divided by the total number n(x,y) p(1,1)p(0,0) P s⊆a∪b s s model is:
where Pˆ (x, y) = N and Pˆ (x) = log( )x x
P
Impurity functions p(0,1)p(1,0) 1 2 s⊆a∪c u s (x s ) − s⊆a us (xs )
We define the class F of impuri-
of observations in the training sam- ED [(f (x) − fˆ(x))2 ] = If X and Y independent, then: n(x)
ple. n(x) Hierarchical models Lsat = x n(x)log N
P
P (X, Y ) = P (X)P (Y )
ty functions for two-class problems
Finding the Tk and corresponding (f (x) − ED [(fˆ(x)])2 3.1 Naive Bayes N Reparameterizing the right hand si- In most applications, it does not ma-
that has this property: Model deviance
αk + ED [(fˆ(x) − ED [fˆ(x)])2 ] Naive-Bayes assumption: 3.3 Logistic Regression de leads to the so-called log-linear ke sense to include u123 unless u12 ,
The deviance of model M is:
Generative models = Classify based expansion: u13 , and u23 are also present. A log-
• φ(0) = φ(1) = 0 (minimum Tt : branch of T with root node t
Variance ⇑ P (x1 , ..., xm |c) = P (x1 |c) · · · P (xm |c) on probability distribution of X ge- linear model is hierarchical if the
at p(0) = 0 and p(0) = 1 2 cells observed ∗ log observed
P
logP (x1 , x2 ) = u∅ + u1 x1 + u2 x2 + presence of a term implies that all
After pruning in t, its contribution nerated by model (Naive Bayes, Line- u12 x1 x2 f itted
• φ(p(0)) = φ(1 − p(0)) (sym- Features are assumed to be indepen- lower-order terms are also present.
metric)
to total cost is: MSE = bias2 + variance ar/Quadratic Discriminant Analysis)
4.4 Cross-product ratio Hence, a hierarchical model is uni-
dent within each class. For binary The deviance difference between M0
2.1 Reducing variance by bagging
• φ′′ (p(0)) < 0, 0 < p(0) < 1 Cα ({t}) = R(t) + α features, we only have to estimate m Discriminative models = Classify The cross-product ration between bi- quely identified by listing its highest and M1 for large N is:
(strictly concave) Bagging is short for Bootstrap Aggre- probabilities per class. Curse of di- based on the conditional distribution nary variables X1 and X2 is: order interaction terms.
The contribution of Tt to the total gating. mensionality is avoided. of Y given X. Probability distributi- Graphical models
cost is: on of X is not modeled. 2(LM1 − LM0 ≈ M0 χv2
Resubstition error: Measures the Multinomial Naive Bayes p(1,1)p(0,0) Given its independence graph G =
fraction of cases that is classified in- Training: Represent document d as sequence cpr(X1 , X2 ) = (K, E), the log-linear model for the
Cα (Tt ) = t ′ ∈T̃ (R(t ′ ) + α) = R(Tt ) + p(0,1)p(1,0) We reject null hypothesis that M0 is
P
correctly if we assign every case in t of words d = ⟨w1 , w2 , ..., wn ⟩ Logistic response function: random vector X is a graphical mo-
1. Draw a sample with repla- the true model when:
node t to the majority class of that α|T̃t | cpr(X1 , X2 ) > 1 pos. ass. del for X if the distribution of X is ar-
cement from the training ⊤
node: set (should be of same size
Class priors:
E[Y |x] = P (Y = 1|x) = eβ x = bitrary apart from constraints of the 2(LM1 − LM0 > χ2 ′
The tree obtained by pruning in t be- ⊤ form that for all pairs of coordina- v;α
i(t) = 1 − maxj p(j|t) as training set). N 1+e x
β cpr(X1 , X2 ) < 1 neg. ass.
tes not in the edge set E, the u-terms
comes better than T when: 2. Grow a tree on this boot- Pˆ (c) = N c 1
strap sample (pruning not doc ⊤ cpr(X1 , X2 ) = 1 no ass. containing the selected coordinates with chi-square distribution χ2 ′
where p(j|t) is the relative frequency Cα ({t}) = Cα (Tt ) 1+e−β x are equal to zero. v;α
necessary). cN B = 4.5 Joint distribution of Three Di- with v degrees of freedom and signi-
of class j in node t. g(t): the critical α value for node 3. Repeat these steps M times Maximum likelihood estimation
argmaxc∈C logP (c) n logP (wk |c) mensional Bernoulli Maximum likelihood: ficance level α, where v is the num-
Q
t For each non-terminal node t we to create M different trees. k=1 Probability distribution:
Gini index: compute its critical value: P (x1 , x2 , x3 ) = ber of additional restrictions (zero
y ML estimator of graphical log-linear u-terms) of M0 compared to M1 .
i(t) = p(0|t)p(1|t) = p(0|t)(1 − p(0|t))
Prediction: Predicting word wi given class: P (yi ) = p i (1 − p)1−yi , y ∈ 0, 1; i = p(0, 0, 0)(1−x1 )(1−x2 )(1−x3 ) · model M satisfies the likelihood
R(t)−R(Tt ) i 4.8 Model selection
g(t) = 1, ..., n x x
· · p(1, 1, 1) 1 2 3x equations:
|T̃t |−1 1. Predict on test sample count(wi ,c) Two components: the lack-of-fit of
Entropy: using each of the M trees Pˆ (wi |c) = P the model and the complexity of the
1.3 Regression trees
in turn. wj ∈V count(wj ,c) Fitted response function: Log-linear expansion: n̂M ˆM
a = N Pa = na
i(t) = −p(0|t)logp(0|t)−p(1|t)logp(1|t) model.
Apply tree-based modesl to pro- 2. Take the majority vote of
Splits on numeric attributes blems with numeric targets. Three the M predictions as the fi- Predicting class c given word(s): e β̂ ⊤ x logP (x1 , x2 , x3 ) = whenever the subset of vertices a in AIC(M) = dev(M) + 2dim(M)
There is only a finite number of dis- elements necessary: Pˆ (Y = 1|x) = ⊤ u∅ +u1 x1 +u2 x2 +u3 x3 +u12 x1 x2 + the graph form a clique.
nal prediction. 1+eβ̂ x u13 x1 x3 + u23 x2 x3 + u123 x1 x2 x3
tinct splits, because there are at most Pˆ (c,X ) BIC(M) = dev(M) + log(N )dim(M)
n distinct values of a numeric attri- • A way to select a split at For regression problems, generate M Pˆ (c|Xi ) = ˆ i 4 Undirected graphical models Observed = fitted for every marginal
bute in the training sample. every non-terminal node. P (Xi ) With: table corresponding to a complete
bootstrap samples, and combine the Saturated model Where dim(M) is the number of pa-
• A rule for determining predictions by averaging: subgraph. rameters (number of u-terms) of the
n(G,E) cpr(X2 ,X3 |X1=1)
P (c, Xi ) = P (c) w∈X Pˆ (w)
ˆ ˆ Pˆ (G, E) =
Q
Shortcut: Optimal split can only oc- when a node is terminal. n u123 = log( ) Decomposable models model.
cur between consecutive values with • A rule for assigning a pre- i cpr(X2 ,X3 |X1 =0) Decomposable graphical models ha-
1 PM fˆ (x)
fˆBAG (x) = M Search
different class distributions. dicted value ŷ(t) to every m=1 m Pˆ (Xi ) = c∈C Pˆ (c, Xi )
P For the saturated model, the fitted ve explicit formulas for the maxi- Exhaustive search is usually not fea-
terminal node t. counts: Why log-linear representation? mum likelihood estimates. They ha-
Splits on categorical attributes sible. Straightforward approach is lo-
For a categorical attribute with L dis- where fˆm (x) is the prediction of the ve triangulated independence gra-

$7.78

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

xanderpoortvliet

Get to know the seller

xanderpoortvliet Universiteit Utrecht

View profile

Sold

New on Stuvia

Member since

1 month

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller xanderpoortvliet. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $7.78. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46567 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Summary Exam Cheatsheet - Data Mining

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?