Overig

Cheatsheet Data Mining Final (summary)

Name: Cheatsheet Data Mining Final (summary)
SKU: doc_7615466
Rating: 4.00 (1 reviews)
Author: emmenhidde

Beoordeling

4,0

(1)

Verkocht

Pagina's

Geüpload op

27-03-2025

Geschreven in

2024/2025

This document contains the perfect cheatsheet for the data mining course. It is allowed to take in during the test and contains all of the course content including imputation. Thus it can also be used as a summary.

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Tilburg University (UVT)
Studie: Data Science & Society
Vak: Data mining (880662M6)

Alle documenten voor dit vak (2)

Documentinformatie

Geüpload op: 27 maart 2025
Aantal pagina's: 2
Geschreven in: 2024/2025
Type: Overig
Persoon: Onbekend

Onderwerpen

dimensionality reduction
cluste
cluster analysis
association rules
model selection
model evaluation
explainable artificial intelligence
pattern classification algorithms

Voorbeeld van de inhoud

L1 Introduction and Preliminaries. L2 Dimensionality Reduction. (2) Soft clustering allows points to be in multiple groups. Fuzzy Association Rule Mining finds rules with acceptable support
Supervised learning uses labelled datasets, unsupervised Visualization allows us to understand the data better. Box logic shows how much a point is associated with the centroid. and confidence. To enforce this, users set minimum support-
does not and infers patterns from a dataset without reference plots give the min, max, 1st and 3rd quartile and median. fuzzy c-means assigns points to clusters with membership and minimum confidence thresholds. A brute-force approach,
to outcomes or decisions. In pattern classification, Histogram shows distribution. (bell curve/gaussian, functions on a [0,1] scale. It minimizes an objective function to generating all possible rules and filtering those that fail, is
features/dimensions describe outcome/decision class, goal is skewness). Scatter determine memberships, which help compute fuzzy centroids. computationally prohibitive. Instead, we first find itemsets
to generalize beyond historical training data. For missing plot matrix.→ Analyses Each data point is allocated to a few clusters. Membership meeting the support threshold, then generate rules from
values, imputation strategies are: (1) remove the feature, how two numerical features values can be used to compute fuzzy centroids Membership frequent itemsets that meet the confidence threshold.
when majority of instances have missing value for that feature behave when contrasted degree However, this remains expensive, as there are 2^N - 1 possible
and/or variability is very high (-few features, feature is relevant,). with each other on a plane. itemsets for N items. If we put minimum support and minimum
(2) remove the instance for scattered missing values and Rule of dimensions. When confidence = 50%. For this table
features (limited instances). (3.1) replacing missing values for we increase the amount of example, we find:
a given feature with a representative value (mean or mode) (- dimensions, we increase the •Calculate distance functions: ||x|| A→C=[66.6%,66.6%]
introduce noise). (3.2) Neural network autoencoder odds of good classifications. •Divide one in question by other ones to the power of m-1 C→A=[66.6%,100%]
replacement First encoder, then decoder -> output. However, if we increase the •1/ point 2 is your answer Thus: A→C≠C→A
Feature Scaling techniques dimensions too much it becomes more time expensive. We Fuzzy prototype: The Apriori algorithm is able to generate association rules
need enough dimensions to solve, more dimensions could be fulfilling the minimum support and confidence requirements
Normalization: better, but to much can overfit. without exploring all possible association rules. The Apriori
Curse of dimensionality when we add more dimensions, the principle states that any subset of a frequent itemset must be
number of instances squares. So for 5 features, we need N 5 hierarchical frequent as well. Subsets with non-frequent items are not
Standardization: instances to get the same coverage. clustering build a interesting. A subset of a frequent
Feature selection selecting the features from the pool of hierarchy of clusters itemset must also be a frequent
features with the largest information gain. (1) Wrapper/based by either merging itemset. For example, if {AB} is a
methods iterates through the features by checking the info small clusters that frequent itemset, both {A} and {B}
gain, deleting the lowest absolute gain and checking the gain share similarities or must be frequent as well. The
again etc. (2) Embedded methods some dimensions will get splitting large ones frequent itemsets can be used to
filtered by building the used model. (not all features are used to that contain quite generate association rules. The use
build a decision tree) dissimilar data points of Apriori algorithm:
Correlation is only between numerical features Feature extraction creating new, reduced or combined (1) Agglomerative We use a lattice to visualize how
Pearsons R: features. (1) principle components analysis features can be clustering a “bottom
combined to become new unrelated ones. w is the weight to up” approach such
create a new component, it is used to take the explained x that each observation
variance from the total variance. To look into what variance is (data point) starts in its own cluster, and pairs of similar clusters
left Hybrid approach uses both feature selection and are merged as one moves up the hierarchy. We finish when all
extraction clusters have been merged into a single cluster.
Deep neural networks perform feature extraction internally, (2)Divisive clustering “top-down” method such that all
we don’t know what, and how they do what they do. Machine observations (data points) start in one big cluster, and splits are
learning has feature extraction done manually. performed recursively as one moves down the hierarchy.
spectral clustering if clusters have complex geometric
Association between categorical features shapes, like circles or parabolas. It transforms the eigenvalues
of the similarity matrix. The space defined by the eigenvalues
Chi of the similarity matrix has well- separated cluster structures.
Squared: Drawback is that the computation of the eigenvalues is
computationally expensive.
Evaluation metrics (1)Silhouette coefficients, measures the
goodness of clustering on a scale from -1 to 1. It measures a
combination of separated clusters and compact clusters. It can
be used to calculate the optimal value for k (n of groups).
(2)Dunn index clustering ratio, larger value is better
Gives:

L3 Cluster analysis.
Cluster analysis = divide population in clusters so that data
Blue Green Brown L4 Association Rules.
points in same group are more similar to other data points in
To measure the relationship between a categorical and the same group that those in other groups. Association Rule: if something, then something else is also
numerical feature: transform the numerical feature into centroid-based clustering , each group is represented by a likely to happen.
symbolic(categorical) and use Chi-squared. vector (prototype/centroid/cluster centre), this can be a non- 3 required: causality, implication, patterns.
Encoding strategies; since algorithms cannot deal with member of population created just for the representation of the Association rules allow determining in which way two or more
categorical features (1) label encoding – assign integer to group. These vector are used to discover the groups. categorical variables are associated. They encode, casualty,
category when variables have ordinal relation (weekdays). (2) (1) Hard clustering points are a part of only a single group. implication and association patterns characterizing data.
one-hot-encoding for nominal features that lack ordinal k-means Points are assigned to the nearest cluster to X mapped onto Y, results in X & Y being
relationship, each category is transformed into binary feature minimize distance. Clusters start with randomly selected a subset of all items in the set I.
(problem of dimensionality increases features2) dog/cat/mouse centres, updated iteratively by computing means until changes Antecedent → consequent [support, confidence]
liked(‘You') → liked(‘Murderer’) [20%, 60%] the full itemset looks. The orange dotted line is the frequency
get yes or no in 3 new features. are minimal. A drawback is that clusters are independent and
20% of viewers liked both ‘You’ and ‘Murderer’ border, arbitrarily set at 2.We consider an itemset closed
Class imbalance is when more instances belong to a certain may miss overlaps; k must also be
60% of people who liked ‘You’ also liked ‘Murderer’ frequent if there are no sets below with the same frequency.
decision class, classifiers are tempted to recognize the majority specified. Quality is assessed by
← number of overlapping X&Y So C is a closed item, since C=3, while AC=2, BC=2, CD=1 &
class only, solutions: (1) under-sampling select only some summing variation within clusters.
← total members CE=2. An itemset is Maximal frequent if it has no relatives,
instances from majority class. (2) over-sampling create new The algorithm starts with random which are above the frequency border. So AC is an maximal
instances minority SMOTE (synthetic minority oversampling prototypes, which adjust each frequent item, since ABC, ACD & ACE are all below the orange
technique) creates synthetic instances in the neighbourhoods iteration until they stabilize, forming ←number of overlapping X&Y
← number of X members frequency border.
of instances minority class (-induces AI generated noise) final groups.

€9,45

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

emmenhidde

4,0

(1)

Beoordelingen van geverifieerde kopers

Alle reviews worden weergegeven

sgkuipers Data Science & Society · 4 beoordelingen

3 maanden geleden

Stuvia removed the watermarks, so it's better. few calc mistakes in the file. but overall well made.

4,0

1 beoordelingen

Betrouwbare reviews op Stuvia

Alle beoordelingen zijn geschreven door echte Stuvia-gebruikers na geverifieerde aankopen.

Maak kennis met de verkoper

emmenhidde Universiteit van Tilburg - NCB account

Bekijk profiel

Volgen

Verkocht

Lid sinds

10 maanden

Aantal volgers

Documenten

Laatst verkocht

3 maanden geleden

4,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper emmenhidde. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €9,45. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 57429 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Cheatsheet Data Mining Final (summary)

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Tilburg University (UVT) > Data Science & Society

Beoordelingen van geverifieerde kopers

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?