100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

Summary Cheat Sheet including all equations and examples 2023/2024

Beoordeling
5,0
(1)
Verkocht
6
Pagina's
2
Geüpload op
16-10-2023
Geschreven in
2023/2024

Maximize your chances of acing the exam with this meticulously crafted cheat sheet. Designed specifically for exam success, it distills complex data mining concepts into easy-to-understand formulas and techniques, including the XAI lecture. Tailored to fit exam guidelines, this sheet is your allowed companion during the test, ensuring you have all the vital information at your fingertips. Whether you're grappling with decision trees, entropy calculations, or Naïve Bayes, this cheat sheet is your roadmap to exam excellence. (Got a 9.5 using this cheat sheet)

Meer zien Lees minder
Instelling
Vak








Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
16 oktober 2023
Aantal pagina's
2
Geschreven in
2023/2024
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

Week 1: Missing Values: Remove Features if majority of in- Example: Consider dataset with sunburned people: 3 out of 8 get sun- is assigned to the instance. Overall, bagging decreases the variance.
stances are missing; Removing Instances is not ideal with limited burned (Entropy = 0.9544). For hair color, blond has 4 people with a Boosting: Prioritize misclassified instances by weight adjustment.
data; Replacing with mean, median, or mode introduces noise; Au- split of 2 sunburned, 2 not (Entropy = 1); brown has 3 people, none Retrain model till criterion is met. Boosting is a sequential ensemble
toencoders are unsupervised neural nets with Encoder & Decoder sunburned (Entropy = 0); red-haired is 1 person, not sunburned (En- method that iteratively adjusts the weight of observation as per the
components. tropy = 0). Compute the Information Gain for the attribute ”Hair” last classification. If an instance is incorrectly classified, it increases
x−min(x) using: its weight. The term ‘boosting’ refers to methods that convert a weak
Normalization: Transform values to x′ = max(x)−min(x) .  
4 3 1 learner to a stronger one. It usually decreases the bias error and builds
Standardization: Shift values by mean and scale by standard devi- IG(Hair) = 0.9544 − ×1+ ×0+ ×0 strong predictive models.
ation: z = x−µ . 8 8 8
σ Distance Functions:
Pearson’s Correlation: Solution: IG(Hair) = 0.4544
P Bayesian Learning: Classify based on feature frequencies, assuming
(xi − x̄)(yi − ȳ) independence. v
R = pP u d d
(xi − x̄)2 (yi − ȳ)2 Naive Bayes: Naı̈ve Bayes does not impose any restrictions on the
P uX X
number of decision classes to be produced, although the example in Euclidean(x, y) = t (xi − yi )2 , Manhattan(x, y) = |xi − yi |
the lecture referred to a binary classification problem. The remaining i=1 i=1
Chi-squared:
X (Oij − Eij )2 options hold (this supervised learning method assumes that features d d
! p1
χ2 = have the same importance and are independent).
Hamm.(x, y) =
X
I(xi ̸= yi ), Mink.(x, y, p) =
X
|xi − yi | p
Eij
i=1 i=1
rowi ×colj P (H) × P (E|H)
where Eij = .
total
P (H|E) =
Example: Consider the contingency table: P (E)
Week 3 Evaluation and model selection: Generalization Ca-
Dogs Cats Rabbits Example: P r[E1 |H] might be P r[animal = cat|pet = yes]. pability: Model’s performance on unseen data. Hyperparameters
Shelter A 5 3 2 Given a dataset with the following entries for playing outside based regulate model construction.
Shelter B 6 8 4 on weather outlook and temperature:
Outlook Temperature Play Performance Measures assess model’s predictive capabilities.
To calculate χ2 : 1. Compute row, column and total sums. 2. Cal- Sunny Hot Yes Data Splitting: Hold-out splits dataset into training, validation,
rowi ×colj Overcast Mild Yes and test. Risks: Improper representation, unrealistically high results.
culate expected frequencies: Eij = . 3. Plug into the chi-
total Rainy Cool No
squared formula. Sunny Mild Yes Stratification matches sample’s class distribution to the entire pop-
Using the formula, for dogs at Shelter A: Sunny Cool Yes ulation.
Rainy Mild No K-fold Cross-validation averages results over k iterations using k-1
(5 + 6) × (5 + 3) 11 × 8 From the data, we derive the probabilities: P (Play=Yes) = for training.
E11 = = = 3.93
28 28 4
P (Play=No) = 2
P (Outlook=Sunny—Play=Yes) = 3
Nested K-fold CV: Inner CV used to tune hyperparameters (vali-
6 6 4
1 dation) to get best model for outer CV.
Continue this process for all cells. P (Temperature=Cool—Play=Yes) = 4
The chi-squared statistic is: Using the Naı̈ve Bayes formula for the likelihood someone Hyperparameter Tuning: Grid Search: All combinations. Ran-
will play outside when the outlook is sunny and temperature dom Search: Subset of combinations.
X (Oij − Eij )2 is cool: P (Play=Yes—Outlook=Sunny, Temperature=Cool) =
χ2 = P (Outlook=Sunny—Play=Yes)×P (Temperature=Cool—Play=Yes)× Confusion Matrix:
i,j
Eij P (Play=Yes)
Substituting in the values:
Predict Cat Predict Dog
Compute χ2 for all animals and sum them to get the final statistic. P (Play=Yes|Outlook=Sunny, Temperature=Cool) = 43 × 14 × 46 = 81 True Cat TP FN
Encoding: Label Encoding for ordinal data, e.g., education level; Then do the same for P (Play=NO—Outlook=Sunny, Temperature=Cool) True Dog FP TN
One-hot Encoding for nominal data, e.g., cat/dog/rabbit. Under- and normalize to get a result between 0 and 1.
sampling for large datasets can cause data loss and lower accuracy; Lazy Learning: Similarity function crucial. KNN uses K-nearest
Oversampling can lead to overfitting with small datasets. neighbors and a distance function. Sensitive to outliers and distance
function selection.
Week 2 Pattern classification: Rule-based Learning: Classify Random Forest: Aggregates decision trees’ outputs through major- TP + TN TP
with rules from features and values. Decision trees are popular. Data ity vote. More trees may overfit. uses bagging Shallow decision trees Accuracy = , Recall =
splits based on high entropy attributes. T otal TP + FN
might lead to overly simple models unable to fit the data. A model
Entropy: that underfits will have high training and high test errors. Hence, poor TP (1 + β 2 ) × precision × recall
X P recision = , Fβ =
Entropy(P ) = − pi log pi performance on training and test sets indicates underfitting, which TP + FP β 2 × precision + recall
i means the hypotheses are not complex enough to include the true
but unknown prediction function. The shallower the tree, the less
Weighted Entropy: variance we have in our predictions. However, we can start to inject
too much bias at some point as shallow trees (e.g., stumps) cannot Bias-Variance Trade-off: Bias (fit) low bias fits data well. High
capture interactions and complex patterns present in our data. Bag- bias → model is too simple for the data (underfitting).) vs. Variance
X xi
inf o(x) = × entropyi
nobs ging generates additional data for training from the dataset (using (consistency training to test) High variance → models the random
random sampling with replacement). Every element is equally prob- noise in the training data (overfitting).
Information Gain: able to be selected. Such datasets are used to train multiple models Overfitting Solutions: Pruning in Trees: Simplify by removing less
in parallel. The average of all the predictions from different ensemble impactful subtrees. Higher k in kNN: Larger k reduces overfitting
gain(fi ) = inf o(root) − inf o(fi ) models is calculated. The decision class resulting from a majority vote risk.
€3,99
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
giyantogoossens
5,0
(1)

Beoordelingen van geverifieerde kopers

Alle reviews worden weergegeven
1 jaar geleden

5,0

1 beoordelingen

5
1
4
0
3
0
2
0
1
0
Betrouwbare reviews op Stuvia

Alle beoordelingen zijn geschreven door echte Stuvia-gebruikers na geverifieerde aankopen.

Maak kennis met de verkoper

Seller avatar
giyantogoossens Tilburg University
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
8
Lid sinds
2 jaar
Aantal volgers
3
Documenten
1
Laatst verkocht
8 maanden geleden

5,0

1 beoordelingen

5
1
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen