mode; at least (1 − k12 )% no more than k sd of the train_df_scaled = scaler.fit_transform(train_df)
mean.Positively(R) skewed – mean > median > mode; test_df_scaled = scaler.transform(test_df)
at least (1 − k12 )% no more than k sd of the mean. § Evaluation and Model Selection out-of-sample evalu-
label encoding: assign integer numbers to each cate- ation. Optimizing hyperparameters. : three disjoint
gory. It only makes sense if there is an ordinal relationship sets:training, validation and test. Stratification:similar
among the categories. One-hot encoding: encode nom- class distribution. → k-fold cross-validation: mutually
inal features that lack an ordinal relationship; increases exclusive equal size subsets. nested k-fold CV: OIO.
the problem dimensionality. Class imbalance: Over- Hyperparameter tuning: random search. Bias: pre-
sampling; Undersampling; SMOTE(might induce noise); dictions - ground truth Variance: consistency in predic-
VarP = E(x2 ) − E(x)2 P-correlation = tions. complexity ↑ bias ↓ var↑ Decision tree pruning
(xi − x̄)(yi − ȳ) prepruning: node → leaf; postpruning: branches → leaf
pP ; χ2 association measure
(xi − x̄)2 (yi − ȳ)2
P
CV:cv_results = cross_validate(RandomForest
Pn Pn (Oij − Eij )2 pi × pj Classifier (random_state=42), X, y, cv=5) Grid
= i=1 j=1 ; Eij = Oij :observed Searchgrid_search = GridSearchCV(estimator=model,
Eij k
together; Eij : Expected value; param_grid= param_grid, cv=5)
Drop narows: df1.dropna(thresh=0.9*len(df), § XAI Interpretability: implicit capacity to explain
axis=1, inplace=True) Mean Imputation:df[’f’]. its reasoning process. Explainability: provide a jus-
fillna(mean_v, inplace=True) Normalization: tification for the predictions. Transparency: Algo-
sklearn.preprocessing.scaler = MinMaxScaler(); rithmic transparency, decomposability,and simulatabil-
df[’f’] = scaler.fit_transform(df[[’f’]]) Stan- ity. Intrinsically interpretable models: Linear re-
dardization: scaler = StandardScaler() La- gression, Decision tree, k-Nearest Neighbors. parsimo-
bel Encoding: encoder = LabelEncoder() nious (less is more). Post-hoc explanation methods:
df[’sex’] = encoder.fit_transform(df[’sex’]) Model-agnostic post-hoc: measure how the changes in the
label_encoder = LabelEncoder(); encoded_data = inputs affect the model’s outputs. 1 Partial dependency
label_encoder.fit_transform(Cancer_risk) plots. the marginaleffect of a feature on the model’s pre-
§ Classification Algorithms Rule-based learning: Deci- dictionwhen fixing the feature values. → average the class
sion Tree internal node: test on an attribute;branch: probabilities toa desired decision class. plot allows inspect
outcome of thetest; leaf node/terminal node: whether therelation between the feature and the target-
classPlabel; root node: topmost; entropy(P): = variable is monotonic, linear, etc. 2 Permutation fea-
−Pi i pi log2 Pi , measure of discorder(0 → pure). Infor- tureimportance: compute thefeature importance as the
mation value: weighted entropy. info gain: gain(fi ) = increase in themodel error when permuting the values
inf o(root) − inf o(fi ) Bayesian learning: assume ofthe feature being analyzed. Drawback: assume unre-
features are independent. Bayes’ theorem: P (Ci | alistic independency. 3 Shapley values (SHAP): com-
P (X | Ci ) · P (Ci ) putes the feature contribution. can be used in both lo-
X) = Naïve Bayes: P (X|Ci ) =
Qn P (X) cal and global contexts. cons: computationally expen-
k=1 P (xk |Ci ) = P (xi |Ci ) · P (x2 |Ci )...P (xn |Ci ) Normal- sive. 4 Local surrogates (LIME): generates synthetic
ization: P (C1 |X)/(P (C1 |X) + P (C2 |X)) The assump- instances around the small groups of instances. cons:
tions of independence and equalimportance of features are unstable 5 Global surrogates: approximate the behav-
rarely fulfilled. ior of the complex model with a a transparent model.
lazy learning: similar instancesshould lead to the same cons: describe the black-box model rather than problem.
decision classes. KNN: works well when theclasss are 6 Counterfactual explanations: describes the smallest
clearly sperated. odd k. sensitive tooutliers, the number changeto the feature values that produces adifferent de-
of neighbors andthe distance function. Minkowski:p=1- sired output Model-specific post-hoc: based on the rep-
Manhattan, p=1-Euclidean; Chebyshev: max difference; resentation structuresof the black-box models 1 Random
Cosine Similarity = cos(θ) Cosine Distance = 1 − cos(θ) Forests: compute the importanceof each problem feature
Ensemble learning: Bagging - bootstrap aggregation from their inner knowledge structures. cons: Feature im-
majority vote. Random Forest: build several decision portance based onimpurity can be misleading when fea-
trees, each using a randomselection (with replacement) tureshave many unique values. 2 Fuzzy Cognitive Maps:
of features and instances Boosting After a classifier Mi recurrent neural networks - neurons denote variables. Fea-
is learned, update the weights for difficult instances in ture importance is computed from theabsolute values of
next classifier Mi+1 Accuracy: (TP + TN) / all; Pre- weights connected toeach neuron in the network. cons:
cision = TP / (TP + FP); Recall =TP/(TP + FN); doesn’t consider activation values of neurons. Evalua-
Fβ = (1 + β 2 )pr/(β 2 p + r); Jaccard Index: IoU, overlap tion and measures Function level (number of rules of
1