Summary

Summary Cheatsheet for Data Mining for Business and Governance Exam (2 pages)

Rating

Sold

Pages

Uploaded on

27-05-2024

Written in

2023/2024

Prepare effectively for your Data Mining for Business and Governance exam with this concise and structured cheatsheet. Spanning 2 pages, this resource is tailored for exam success, offering a quick reference guide organized by lecture topics. Featuring LaTeX-rendered mathematical formulas for clarity, the cheatsheet provides a clear overview of essential concepts and formulas crucial for the exam. Designed with blank areas for personal notes, this cheatsheet allows you to customize your study material to suit your learning style. Enhance your exam preparation and boost your confidence with this resource.

Show more Read less

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Tilburg University (UVT)
Study: Data Science & Society
Course: Data Mining for Business & Governance (880662-M-6) (880662M6)

All documents for this subject (1)

Document information

Uploaded on: May 27, 2024
Number of pages: 2
Written in: 2023/2024
Type: Summary

Subjects

cheatsheet
data mining
machine learning
data science

Content preview

§ Basic Positively(R) skewed – mean > median > / union
mode; at least (1 − k12 )% no more than k sd of the train_df_scaled = scaler.fit_transform(train_df)
mean.Positively(R) skewed – mean > median > mode; test_df_scaled = scaler.transform(test_df)
at least (1 − k12 )% no more than k sd of the mean. § Evaluation and Model Selection out-of-sample evalu-
label encoding: assign integer numbers to each cate- ation. Optimizing hyperparameters. : three disjoint
gory. It only makes sense if there is an ordinal relationship sets:training, validation and test. Stratification:similar
among the categories. One-hot encoding: encode nom- class distribution. → k-fold cross-validation: mutually
inal features that lack an ordinal relationship; increases exclusive equal size subsets. nested k-fold CV: OIO.
the problem dimensionality. Class imbalance: Over- Hyperparameter tuning: random search. Bias: pre-
sampling; Undersampling; SMOTE(might induce noise); dictions - ground truth Variance: consistency in predic-
VarP = E(x2 ) − E(x)2 P-correlation = tions. complexity ↑ bias ↓ var↑ Decision tree pruning
(xi − x̄)(yi − ȳ) prepruning: node → leaf; postpruning: branches → leaf
pP ; χ2 association measure
(xi − x̄)2 (yi − ȳ)2
P
CV:cv_results = cross_validate(RandomForest
Pn Pn (Oij − Eij )2 pi × pj Classifier (random_state=42), X, y, cv=5) Grid
= i=1 j=1 ; Eij = Oij :observed Searchgrid_search = GridSearchCV(estimator=model,
Eij k
together; Eij : Expected value; param_grid= param_grid, cv=5)
Drop narows: df1.dropna(thresh=0.9*len(df), § XAI Interpretability: implicit capacity to explain
axis=1, inplace=True) Mean Imputation:df[’f’]. its reasoning process. Explainability: provide a jus-
fillna(mean_v, inplace=True) Normalization: tification for the predictions. Transparency: Algo-
sklearn.preprocessing.scaler = MinMaxScaler(); rithmic transparency, decomposability,and simulatabil-
df[’f’] = scaler.fit_transform(df[[’f’]]) Stan- ity. Intrinsically interpretable models: Linear re-
dardization: scaler = StandardScaler() La- gression, Decision tree, k-Nearest Neighbors. parsimo-
bel Encoding: encoder = LabelEncoder() nious (less is more). Post-hoc explanation methods:
df[’sex’] = encoder.fit_transform(df[’sex’]) Model-agnostic post-hoc: measure how the changes in the
label_encoder = LabelEncoder(); encoded_data = inputs affect the model’s outputs. 1 Partial dependency
label_encoder.fit_transform(Cancer_risk) plots. the marginaleffect of a feature on the model’s pre-
§ Classification Algorithms Rule-based learning: Deci- dictionwhen fixing the feature values. → average the class
sion Tree internal node: test on an attribute;branch: probabilities toa desired decision class. plot allows inspect
outcome of thetest; leaf node/terminal node: whether therelation between the feature and the target-
classPlabel; root node: topmost; entropy(P): = variable is monotonic, linear, etc. 2 Permutation fea-
−Pi i pi log2 Pi , measure of discorder(0 → pure). Infor- tureimportance: compute thefeature importance as the
mation value: weighted entropy. info gain: gain(fi ) = increase in themodel error when permuting the values
inf o(root) − inf o(fi ) Bayesian learning: assume ofthe feature being analyzed. Drawback: assume unre-
features are independent. Bayes’ theorem: P (Ci | alistic independency. 3 Shapley values (SHAP): com-
P (X | Ci ) · P (Ci ) putes the feature contribution. can be used in both lo-
X) = Naïve Bayes: P (X|Ci ) =
Qn P (X) cal and global contexts. cons: computationally expen-
k=1 P (xk |Ci ) = P (xi |Ci ) · P (x2 |Ci )...P (xn |Ci ) Normal- sive. 4 Local surrogates (LIME): generates synthetic
ization: P (C1 |X)/(P (C1 |X) + P (C2 |X)) The assump- instances around the small groups of instances. cons:
tions of independence and equalimportance of features are unstable 5 Global surrogates: approximate the behav-
rarely fulfilled. ior of the complex model with a a transparent model.
lazy learning: similar instancesshould lead to the same cons: describe the black-box model rather than problem.
decision classes. KNN: works well when theclasss are 6 Counterfactual explanations: describes the smallest
clearly sperated. odd k. sensitive tooutliers, the number changeto the feature values that produces adifferent de-
of neighbors andthe distance function. Minkowski:p=1- sired output Model-specific post-hoc: based on the rep-
Manhattan, p=1-Euclidean; Chebyshev: max difference; resentation structuresof the black-box models 1 Random
Cosine Similarity = cos(θ) Cosine Distance = 1 − cos(θ) Forests: compute the importanceof each problem feature
Ensemble learning: Bagging - bootstrap aggregation from their inner knowledge structures. cons: Feature im-
majority vote. Random Forest: build several decision portance based onimpurity can be misleading when fea-
trees, each using a randomselection (with replacement) tureshave many unique values. 2 Fuzzy Cognitive Maps:
of features and instances Boosting After a classifier Mi recurrent neural networks - neurons denote variables. Fea-
is learned, update the weights for difficult instances in ture importance is computed from theabsolute values of
next classifier Mi+1 Accuracy: (TP + TN) / all; Pre- weights connected toeach neuron in the network. cons:
cision = TP / (TP + FP); Recall =TP/(TP + FN); doesn’t consider activation values of neurons. Evalua-
Fβ = (1 + β 2 )pr/(β 2 p + r); Jaccard Index: IoU, overlap tion and measures Function level (number of rules of

1

$6.65

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

binli

Get to know the seller

binli Tilburg University

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller binli. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $6.65. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 43175 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Summary Cheatsheet for Data Mining for Business and Governance Exam (2 pages)

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?