Resumen

Summary Cheatsheet for Data Mining for Business and Governance Exam (2 pages)

Puntuación

Vendido

Páginas

Subido en

27-05-2024

Escrito en

2023/2024

Prepare effectively for your Data Mining for Business and Governance exam with this concise and structured cheatsheet. Spanning 2 pages, this resource is tailored for exam success, offering a quick reference guide organized by lecture topics. Featuring LaTeX-rendered mathematical formulas for clarity, the cheatsheet provides a clear overview of essential concepts and formulas crucial for the exam. Designed with blank areas for personal notes, this cheatsheet allows you to customize your study material to suit your learning style. Enhance your exam preparation and boost your confidence with this resource.

Mostrar más Leer menos

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Tilburg University (UVT)
Estudio: Data Science & Society
Grado: Data Mining for Business & Governance (880662-M-6) (880662M6)

Todos documentos para esta materia (1)

Información del documento

Subido en: 27 de mayo de 2024
Número de páginas: 2
Escrito en: 2023/2024
Tipo: Resumen

Temas

cheatsheet
data mining
machine learning
data science

Vista previa del contenido

§ Basic Positively(R) skewed – mean > median > / union
mode; at least (1 − k12 )% no more than k sd of the train_df_scaled = scaler.fit_transform(train_df)
mean.Positively(R) skewed – mean > median > mode; test_df_scaled = scaler.transform(test_df)
at least (1 − k12 )% no more than k sd of the mean. § Evaluation and Model Selection out-of-sample evalu-
label encoding: assign integer numbers to each cate- ation. Optimizing hyperparameters. : three disjoint
gory. It only makes sense if there is an ordinal relationship sets:training, validation and test. Stratification:similar
among the categories. One-hot encoding: encode nom- class distribution. → k-fold cross-validation: mutually
inal features that lack an ordinal relationship; increases exclusive equal size subsets. nested k-fold CV: OIO.
the problem dimensionality. Class imbalance: Over- Hyperparameter tuning: random search. Bias: pre-
sampling; Undersampling; SMOTE(might induce noise); dictions - ground truth Variance: consistency in predic-
VarP = E(x2 ) − E(x)2 P-correlation = tions. complexity ↑ bias ↓ var↑ Decision tree pruning
(xi − x̄)(yi − ȳ) prepruning: node → leaf; postpruning: branches → leaf
pP ; χ2 association measure
(xi − x̄)2 (yi − ȳ)2
P
CV:cv_results = cross_validate(RandomForest
Pn Pn (Oij − Eij )2 pi × pj Classifier (random_state=42), X, y, cv=5) Grid
= i=1 j=1 ; Eij = Oij :observed Searchgrid_search = GridSearchCV(estimator=model,
Eij k
together; Eij : Expected value; param_grid= param_grid, cv=5)
Drop narows: df1.dropna(thresh=0.9*len(df), § XAI Interpretability: implicit capacity to explain
axis=1, inplace=True) Mean Imputation:df[’f’]. its reasoning process. Explainability: provide a jus-
fillna(mean_v, inplace=True) Normalization: tification for the predictions. Transparency: Algo-
sklearn.preprocessing.scaler = MinMaxScaler(); rithmic transparency, decomposability,and simulatabil-
df[’f’] = scaler.fit_transform(df[[’f’]]) Stan- ity. Intrinsically interpretable models: Linear re-
dardization: scaler = StandardScaler() La- gression, Decision tree, k-Nearest Neighbors. parsimo-
bel Encoding: encoder = LabelEncoder() nious (less is more). Post-hoc explanation methods:
df[’sex’] = encoder.fit_transform(df[’sex’]) Model-agnostic post-hoc: measure how the changes in the
label_encoder = LabelEncoder(); encoded_data = inputs affect the model’s outputs. 1 Partial dependency
label_encoder.fit_transform(Cancer_risk) plots. the marginaleffect of a feature on the model’s pre-
§ Classification Algorithms Rule-based learning: Deci- dictionwhen fixing the feature values. → average the class
sion Tree internal node: test on an attribute;branch: probabilities toa desired decision class. plot allows inspect
outcome of thetest; leaf node/terminal node: whether therelation between the feature and the target-
classPlabel; root node: topmost; entropy(P): = variable is monotonic, linear, etc. 2 Permutation fea-
−Pi i pi log2 Pi , measure of discorder(0 → pure). Infor- tureimportance: compute thefeature importance as the
mation value: weighted entropy. info gain: gain(fi ) = increase in themodel error when permuting the values
inf o(root) − inf o(fi ) Bayesian learning: assume ofthe feature being analyzed. Drawback: assume unre-
features are independent. Bayes’ theorem: P (Ci | alistic independency. 3 Shapley values (SHAP): com-
P (X | Ci ) · P (Ci ) putes the feature contribution. can be used in both lo-
X) = Naïve Bayes: P (X|Ci ) =
Qn P (X) cal and global contexts. cons: computationally expen-
k=1 P (xk |Ci ) = P (xi |Ci ) · P (x2 |Ci )...P (xn |Ci ) Normal- sive. 4 Local surrogates (LIME): generates synthetic
ization: P (C1 |X)/(P (C1 |X) + P (C2 |X)) The assump- instances around the small groups of instances. cons:
tions of independence and equalimportance of features are unstable 5 Global surrogates: approximate the behav-
rarely fulfilled. ior of the complex model with a a transparent model.
lazy learning: similar instancesshould lead to the same cons: describe the black-box model rather than problem.
decision classes. KNN: works well when theclasss are 6 Counterfactual explanations: describes the smallest
clearly sperated. odd k. sensitive tooutliers, the number changeto the feature values that produces adifferent de-
of neighbors andthe distance function. Minkowski:p=1- sired output Model-specific post-hoc: based on the rep-
Manhattan, p=1-Euclidean; Chebyshev: max difference; resentation structuresof the black-box models 1 Random
Cosine Similarity = cos(θ) Cosine Distance = 1 − cos(θ) Forests: compute the importanceof each problem feature
Ensemble learning: Bagging - bootstrap aggregation from their inner knowledge structures. cons: Feature im-
majority vote. Random Forest: build several decision portance based onimpurity can be misleading when fea-
trees, each using a randomselection (with replacement) tureshave many unique values. 2 Fuzzy Cognitive Maps:
of features and instances Boosting After a classifier Mi recurrent neural networks - neurons denote variables. Fea-
is learned, update the weights for difficult instances in ture importance is computed from theabsolute values of
next classifier Mi+1 Accuracy: (TP + TN) / all; Pre- weights connected toeach neuron in the network. cons:
cision = TP / (TP + FP); Recall =TP/(TP + FN); doesn’t consider activation values of neurons. Evalua-
Fβ = (1 + β 2 )pr/(β 2 p + r); Jaccard Index: IoU, overlap tion and measures Function level (number of rules of

1

$6.66

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

binli

Conoce al vendedor

binli Tilburg University

Ver perfil

Seguir

Vendido

Miembro desde

1 año

Número de seguidores

Documentos

Última venta

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller binli. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $6.66. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now