Otro

Cheatsheet Data Mining Final (summary)

Name: Cheatsheet Data Mining Final (summary)
SKU: doc_7615466
Rating: 4.00 (1 reviews)
Author: emmenhidde

Puntuación

4.0

(1)

Vendido

Páginas

Subido en

27-03-2025

Escrito en

2024/2025

This document contains the perfect cheatsheet for the data mining course. It is allowed to take in during the test and contains all of the course content including imputation. Thus it can also be used as a summary.

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Tilburg University (UVT)
Estudio: Data Science & Society
Grado: Data mining (880662M6)

Todos documentos para esta materia (2)

Información del documento

Subido en: 27 de marzo de 2025
Número de páginas: 2
Escrito en: 2024/2025
Tipo: Otro
Personaje: Desconocido

Temas

dimensionality reduction
cluste
cluster analysis
association rules
model selection
model evaluation
explainable artificial intelligence
pattern classification algorithms

Vista previa del contenido

L1 Introduction and Preliminaries. L2 Dimensionality Reduction. (2) Soft clustering allows points to be in multiple groups. Fuzzy Association Rule Mining finds rules with acceptable support
Supervised learning uses labelled datasets, unsupervised Visualization allows us to understand the data better. Box logic shows how much a point is associated with the centroid. and confidence. To enforce this, users set minimum support-
does not and infers patterns from a dataset without reference plots give the min, max, 1st and 3rd quartile and median. fuzzy c-means assigns points to clusters with membership and minimum confidence thresholds. A brute-force approach,
to outcomes or decisions. In pattern classification, Histogram shows distribution. (bell curve/gaussian, functions on a [0,1] scale. It minimizes an objective function to generating all possible rules and filtering those that fail, is
features/dimensions describe outcome/decision class, goal is skewness). Scatter determine memberships, which help compute fuzzy centroids. computationally prohibitive. Instead, we first find itemsets
to generalize beyond historical training data. For missing plot matrix.→ Analyses Each data point is allocated to a few clusters. Membership meeting the support threshold, then generate rules from
values, imputation strategies are: (1) remove the feature, how two numerical features values can be used to compute fuzzy centroids Membership frequent itemsets that meet the confidence threshold.
when majority of instances have missing value for that feature behave when contrasted degree However, this remains expensive, as there are 2^N - 1 possible
and/or variability is very high (-few features, feature is relevant,). with each other on a plane. itemsets for N items. If we put minimum support and minimum
(2) remove the instance for scattered missing values and Rule of dimensions. When confidence = 50%. For this table
features (limited instances). (3.1) replacing missing values for we increase the amount of example, we find:
a given feature with a representative value (mean or mode) (- dimensions, we increase the •Calculate distance functions: ||x|| A→C=[66.6%,66.6%]
introduce noise). (3.2) Neural network autoencoder odds of good classifications. •Divide one in question by other ones to the power of m-1 C→A=[66.6%,100%]
replacement First encoder, then decoder -> output. However, if we increase the •1/ point 2 is your answer Thus: A→C≠C→A
Feature Scaling techniques dimensions too much it becomes more time expensive. We Fuzzy prototype: The Apriori algorithm is able to generate association rules
need enough dimensions to solve, more dimensions could be fulfilling the minimum support and confidence requirements
Normalization: better, but to much can overfit. without exploring all possible association rules. The Apriori
Curse of dimensionality when we add more dimensions, the principle states that any subset of a frequent itemset must be
number of instances squares. So for 5 features, we need N 5 hierarchical frequent as well. Subsets with non-frequent items are not
Standardization: instances to get the same coverage. clustering build a interesting. A subset of a frequent
Feature selection selecting the features from the pool of hierarchy of clusters itemset must also be a frequent
features with the largest information gain. (1) Wrapper/based by either merging itemset. For example, if {AB} is a
methods iterates through the features by checking the info small clusters that frequent itemset, both {A} and {B}
gain, deleting the lowest absolute gain and checking the gain share similarities or must be frequent as well. The
again etc. (2) Embedded methods some dimensions will get splitting large ones frequent itemsets can be used to
filtered by building the used model. (not all features are used to that contain quite generate association rules. The use
build a decision tree) dissimilar data points of Apriori algorithm:
Correlation is only between numerical features Feature extraction creating new, reduced or combined (1) Agglomerative We use a lattice to visualize how
Pearsons R: features. (1) principle components analysis features can be clustering a “bottom
combined to become new unrelated ones. w is the weight to up” approach such
create a new component, it is used to take the explained x that each observation
variance from the total variance. To look into what variance is (data point) starts in its own cluster, and pairs of similar clusters
left Hybrid approach uses both feature selection and are merged as one moves up the hierarchy. We finish when all
extraction clusters have been merged into a single cluster.
Deep neural networks perform feature extraction internally, (2)Divisive clustering “top-down” method such that all
we don’t know what, and how they do what they do. Machine observations (data points) start in one big cluster, and splits are
learning has feature extraction done manually. performed recursively as one moves down the hierarchy.
spectral clustering if clusters have complex geometric
Association between categorical features shapes, like circles or parabolas. It transforms the eigenvalues
of the similarity matrix. The space defined by the eigenvalues
Chi of the similarity matrix has well- separated cluster structures.
Squared: Drawback is that the computation of the eigenvalues is
computationally expensive.
Evaluation metrics (1)Silhouette coefficients, measures the
goodness of clustering on a scale from -1 to 1. It measures a
combination of separated clusters and compact clusters. It can
be used to calculate the optimal value for k (n of groups).
(2)Dunn index clustering ratio, larger value is better
Gives:

L3 Cluster analysis.
Cluster analysis = divide population in clusters so that data
Blue Green Brown L4 Association Rules.
points in same group are more similar to other data points in
To measure the relationship between a categorical and the same group that those in other groups. Association Rule: if something, then something else is also
numerical feature: transform the numerical feature into centroid-based clustering , each group is represented by a likely to happen.
symbolic(categorical) and use Chi-squared. vector (prototype/centroid/cluster centre), this can be a non- 3 required: causality, implication, patterns.
Encoding strategies; since algorithms cannot deal with member of population created just for the representation of the Association rules allow determining in which way two or more
categorical features (1) label encoding – assign integer to group. These vector are used to discover the groups. categorical variables are associated. They encode, casualty,
category when variables have ordinal relation (weekdays). (2) (1) Hard clustering points are a part of only a single group. implication and association patterns characterizing data.
one-hot-encoding for nominal features that lack ordinal k-means Points are assigned to the nearest cluster to X mapped onto Y, results in X & Y being
relationship, each category is transformed into binary feature minimize distance. Clusters start with randomly selected a subset of all items in the set I.
(problem of dimensionality increases features2) dog/cat/mouse centres, updated iteratively by computing means until changes Antecedent → consequent [support, confidence]
liked(‘You') → liked(‘Murderer’) [20%, 60%] the full itemset looks. The orange dotted line is the frequency
get yes or no in 3 new features. are minimal. A drawback is that clusters are independent and
20% of viewers liked both ‘You’ and ‘Murderer’ border, arbitrarily set at 2.We consider an itemset closed
Class imbalance is when more instances belong to a certain may miss overlaps; k must also be
60% of people who liked ‘You’ also liked ‘Murderer’ frequent if there are no sets below with the same frequency.
decision class, classifiers are tempted to recognize the majority specified. Quality is assessed by
← number of overlapping X&Y So C is a closed item, since C=3, while AC=2, BC=2, CD=1 &
class only, solutions: (1) under-sampling select only some summing variation within clusters.
← total members CE=2. An itemset is Maximal frequent if it has no relatives,
instances from majority class. (2) over-sampling create new The algorithm starts with random which are above the frequency border. So AC is an maximal
instances minority SMOTE (synthetic minority oversampling prototypes, which adjust each frequent item, since ABC, ACD & ACE are all below the orange
technique) creates synthetic instances in the neighbourhoods iteration until they stabilize, forming ←number of overlapping X&Y
← number of X members frequency border.
of instances minority class (-induces AI generated noise) final groups.

$10.26

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

emmenhidde

4.0

(1)

Reseñas de compradores verificados

Se muestran los comentarios

sgkuipers Data Science & Society · 4 reseñas

1 mes hace

Stuvia removed the watermarks, so it's better. few calc mistakes in the file. but overall well made.

4.0

1 reseñas

Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

emmenhidde Universiteit van Tilburg - NCB account

Ver perfil

Seguir

Vendido

Miembro desde

8 meses

Número de seguidores

Documentos

Última venta

1 mes hace

4.0

1 reseñas

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller emmenhidde. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $10.26. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now