Samenvatting

Summary Cheat Sheet including all equations and examples 2023/2024

Name: Cheat Sheet including all equations and examples 2023/2024
SKU: doc_3630866
Rating: 5.00 (1 reviews)
Author: giyantogoossens

Beoordeling

5,0

(1)

Verkocht

Pagina's

Geüpload op

16-10-2023

Geschreven in

2023/2024

Maximize your chances of acing the exam with this meticulously crafted cheat sheet. Designed specifically for exam success, it distills complex data mining concepts into easy-to-understand formulas and techniques, including the XAI lecture. Tailored to fit exam guidelines, this sheet is your allowed companion during the test, ensuring you have all the vital information at your fingertips. Whether you're grappling with decision trees, entropy calculations, or Naïve Bayes, this cheat sheet is your roadmap to exam excellence. (Got a 9.5 using this cheat sheet)

Meer zien Lees minder

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Data science and Society
Vak: Data Mining for Business and Governance (880022M6)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 16 oktober 2023
Aantal pagina's: 2
Geschreven in: 2023/2024
Type: Samenvatting

Onderwerpen

decision trees
entropy calculations
naive bayes classifier
conditional probability
chi squared
gini impurity
information gain
feature selection
explainable artificial intelligence

Voorbeeld van de inhoud

Week 1: Missing Values: Remove Features if majority of in- Example: Consider dataset with sunburned people: 3 out of 8 get sun- is assigned to the instance. Overall, bagging decreases the variance.
stances are missing; Removing Instances is not ideal with limited burned (Entropy = 0.9544). For hair color, blond has 4 people with a Boosting: Prioritize misclassified instances by weight adjustment.
data; Replacing with mean, median, or mode introduces noise; Au- split of 2 sunburned, 2 not (Entropy = 1); brown has 3 people, none Retrain model till criterion is met. Boosting is a sequential ensemble
toencoders are unsupervised neural nets with Encoder & Decoder sunburned (Entropy = 0); red-haired is 1 person, not sunburned (En- method that iteratively adjusts the weight of observation as per the
components. tropy = 0). Compute the Information Gain for the attribute ”Hair” last classification. If an instance is incorrectly classified, it increases
x−min(x) using: its weight. The term ‘boosting’ refers to methods that convert a weak
Normalization: Transform values to x′ = max(x)−min(x) .
4 3 1 learner to a stronger one. It usually decreases the bias error and builds
Standardization: Shift values by mean and scale by standard devi- IG(Hair) = 0.9544 − ×1+ ×0+ ×0 strong predictive models.
ation: z = x−µ . 8 8 8
σ Distance Functions:
Pearson’s Correlation: Solution: IG(Hair) = 0.4544
P Bayesian Learning: Classify based on feature frequencies, assuming
(xi − x̄)(yi − ȳ) independence. v
R = pP u d d
(xi − x̄)2 (yi − ȳ)2 Naive Bayes: Naı̈ve Bayes does not impose any restrictions on the
P uX X
number of decision classes to be produced, although the example in Euclidean(x, y) = t (xi − yi )2 , Manhattan(x, y) = |xi − yi |
the lecture referred to a binary classification problem. The remaining i=1 i=1
Chi-squared:
X (Oij − Eij )2 options hold (this supervised learning method assumes that features d d
! p1
χ2 = have the same importance and are independent).
Hamm.(x, y) =
X
I(xi ̸= yi ), Mink.(x, y, p) =
X
|xi − yi | p
Eij
i=1 i=1
rowi ×colj P (H) × P (E|H)
where Eij = .
total
P (H|E) =
Example: Consider the contingency table: P (E)
Week 3 Evaluation and model selection: Generalization Ca-
Dogs Cats Rabbits Example: P r[E1 |H] might be P r[animal = cat|pet = yes]. pability: Model’s performance on unseen data. Hyperparameters
Shelter A 5 3 2 Given a dataset with the following entries for playing outside based regulate model construction.
Shelter B 6 8 4 on weather outlook and temperature:
Outlook Temperature Play Performance Measures assess model’s predictive capabilities.
To calculate χ2 : 1. Compute row, column and total sums. 2. Cal- Sunny Hot Yes Data Splitting: Hold-out splits dataset into training, validation,
rowi ×colj Overcast Mild Yes and test. Risks: Improper representation, unrealistically high results.
culate expected frequencies: Eij = . 3. Plug into the chi-
total Rainy Cool No
squared formula. Sunny Mild Yes Stratification matches sample’s class distribution to the entire pop-
Using the formula, for dogs at Shelter A: Sunny Cool Yes ulation.
Rainy Mild No K-fold Cross-validation averages results over k iterations using k-1
(5 + 6) × (5 + 3) 11 × 8 From the data, we derive the probabilities: P (Play=Yes) = for training.
E11 = = = 3.93
28 28 4
P (Play=No) = 2
P (Outlook=Sunny—Play=Yes) = 3
Nested K-fold CV: Inner CV used to tune hyperparameters (vali-
6 6 4
1 dation) to get best model for outer CV.
Continue this process for all cells. P (Temperature=Cool—Play=Yes) = 4
The chi-squared statistic is: Using the Naı̈ve Bayes formula for the likelihood someone Hyperparameter Tuning: Grid Search: All combinations. Ran-
will play outside when the outlook is sunny and temperature dom Search: Subset of combinations.
X (Oij − Eij )2 is cool: P (Play=Yes—Outlook=Sunny, Temperature=Cool) =
χ2 = P (Outlook=Sunny—Play=Yes)×P (Temperature=Cool—Play=Yes)× Confusion Matrix:
i,j
Eij P (Play=Yes)
Substituting in the values:
Predict Cat Predict Dog
Compute χ2 for all animals and sum them to get the final statistic. P (Play=Yes|Outlook=Sunny, Temperature=Cool) = 43 × 14 × 46 = 81 True Cat TP FN
Encoding: Label Encoding for ordinal data, e.g., education level; Then do the same for P (Play=NO—Outlook=Sunny, Temperature=Cool) True Dog FP TN
One-hot Encoding for nominal data, e.g., cat/dog/rabbit. Under- and normalize to get a result between 0 and 1.
sampling for large datasets can cause data loss and lower accuracy; Lazy Learning: Similarity function crucial. KNN uses K-nearest
Oversampling can lead to overfitting with small datasets. neighbors and a distance function. Sensitive to outliers and distance
function selection.
Week 2 Pattern classification: Rule-based Learning: Classify Random Forest: Aggregates decision trees’ outputs through major- TP + TN TP
with rules from features and values. Decision trees are popular. Data ity vote. More trees may overfit. uses bagging Shallow decision trees Accuracy = , Recall =
splits based on high entropy attributes. T otal TP + FN
might lead to overly simple models unable to fit the data. A model
Entropy: that underfits will have high training and high test errors. Hence, poor TP (1 + β 2 ) × precision × recall
X P recision = , Fβ =
Entropy(P ) = − pi log pi performance on training and test sets indicates underfitting, which TP + FP β 2 × precision + recall
i means the hypotheses are not complex enough to include the true
but unknown prediction function. The shallower the tree, the less
Weighted Entropy: variance we have in our predictions. However, we can start to inject
too much bias at some point as shallow trees (e.g., stumps) cannot Bias-Variance Trade-off: Bias (fit) low bias fits data well. High
capture interactions and complex patterns present in our data. Bag- bias → model is too simple for the data (underfitting).) vs. Variance
X xi
inf o(x) = × entropyi
nobs ging generates additional data for training from the dataset (using (consistency training to test) High variance → models the random
random sampling with replacement). Every element is equally prob- noise in the training data (overfitting).
Information Gain: able to be selected. Such datasets are used to train multiple models Overfitting Solutions: Pruning in Trees: Simplify by removing less
in parallel. The average of all the predictions from different ensemble impactful subtrees. Higher k in kNN: Larger k reduces overfitting
gain(fi ) = inf o(root) − inf o(fi ) models is calculated. The decision class resulting from a majority vote risk.

€3,99

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

giyantogoossens

5,0

(1)

Beoordelingen van geverifieerde kopers

Alle reviews worden weergegeven

tygovandenherik1 Finance & Control · 7 beoordelingen

1 jaar geleden

5,0

1 beoordelingen

Betrouwbare reviews op Stuvia

Alle beoordelingen zijn geschreven door echte Stuvia-gebruikers na geverifieerde aankopen.

Maak kennis met de verkoper

giyantogoossens Tilburg University

Bekijk profiel

Volgen

Verkocht

Lid sinds

2 jaar

Aantal volgers

Documenten

Laatst verkocht

8 maanden geleden

5,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper giyantogoossens. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 50201 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen