Supervised learning uses labelled datasets, unsupervised Visualization allows us to understand the data better. Box logic shows how much a point is associated with the centroid. and confidence. To enforce this, users set minimum support-
does not and infers patterns from a dataset without reference plots give the min, max, 1st and 3rd quartile and median. fuzzy c-means assigns points to clusters with membership and minimum confidence thresholds. A brute-force approach,
to outcomes or decisions. In pattern classification, Histogram shows distribution. (bell curve/gaussian, functions on a [0,1] scale. It minimizes an objective function to generating all possible rules and filtering those that fail, is
features/dimensions describe outcome/decision class, goal is skewness). Scatter determine memberships, which help compute fuzzy centroids. computationally prohibitive. Instead, we first find itemsets
to generalize beyond historical training data. For missing plot matrix.→ Analyses Each data point is allocated to a few clusters. Membership meeting the support threshold, then generate rules from
values, imputation strategies are: (1) remove the feature, how two numerical features values can be used to compute fuzzy centroids Membership frequent itemsets that meet the confidence threshold.
when majority of instances have missing value for that feature behave when contrasted degree However, this remains expensive, as there are 2^N - 1 possible
and/or variability is very high (-few features, feature is relevant,). with each other on a plane. itemsets for N items. If we put minimum support and minimum
(2) remove the instance for scattered missing values and Rule of dimensions. When confidence = 50%. For this table
features (limited instances). (3.1) replacing missing values for we increase the amount of example, we find:
a given feature with a representative value (mean or mode) (- dimensions, we increase the •Calculate distance functions: ||x|| A→C=[66.6%,66.6%]
introduce noise). (3.2) Neural network autoencoder odds of good classifications. •Divide one in question by other ones to the power of m-1 C→A=[66.6%,100%]
replacement First encoder, then decoder -> output. However, if we increase the •1/ point 2 is your answer Thus: A→C≠C→A
Feature Scaling techniques dimensions too much it becomes more time expensive. We Fuzzy prototype: The Apriori algorithm is able to generate association rules
need enough dimensions to solve, more dimensions could be fulfilling the minimum support and confidence requirements
Normalization: better, but to much can overfit. without exploring all possible association rules. The Apriori
Curse of dimensionality when we add more dimensions, the principle states that any subset of a frequent itemset must be
number of instances squares. So for 5 features, we need N 5 hierarchical frequent as well. Subsets with non-frequent items are not
Standardization: instances to get the same coverage. clustering build a interesting. A subset of a frequent
Feature selection selecting the features from the pool of hierarchy of clusters itemset must also be a frequent
features with the largest information gain. (1) Wrapper/based by either merging itemset. For example, if {AB} is a
methods iterates through the features by checking the info small clusters that frequent itemset, both {A} and {B}
gain, deleting the lowest absolute gain and checking the gain share similarities or must be frequent as well. The
again etc. (2) Embedded methods some dimensions will get splitting large ones frequent itemsets can be used to
filtered by building the used model. (not all features are used to that contain quite generate association rules. The use
build a decision tree) dissimilar data points of Apriori algorithm:
Correlation is only between numerical features Feature extraction creating new, reduced or combined (1) Agglomerative We use a lattice to visualize how
Pearsons R: features. (1) principle components analysis features can be clustering a “bottom
combined to become new unrelated ones. w is the weight to up” approach such
create a new component, it is used to take the explained x that each observation
variance from the total variance. To look into what variance is (data point) starts in its own cluster, and pairs of similar clusters
left Hybrid approach uses both feature selection and are merged as one moves up the hierarchy. We finish when all
extraction clusters have been merged into a single cluster.
Deep neural networks perform feature extraction internally, (2)Divisive clustering “top-down” method such that all
we don’t know what, and how they do what they do. Machine observations (data points) start in one big cluster, and splits are
learning has feature extraction done manually. performed recursively as one moves down the hierarchy.
spectral clustering if clusters have complex geometric
Association between categorical features shapes, like circles or parabolas. It transforms the eigenvalues
of the similarity matrix. The space defined by the eigenvalues
Chi of the similarity matrix has well- separated cluster structures.
Squared: Drawback is that the computation of the eigenvalues is
computationally expensive.
Evaluation metrics (1)Silhouette coefficients, measures the
goodness of clustering on a scale from -1 to 1. It measures a
combination of separated clusters and compact clusters. It can
be used to calculate the optimal value for k (n of groups).
(2)Dunn index clustering ratio, larger value is better
Gives:
L3 Cluster analysis.
Cluster analysis = divide population in clusters so that data
Blue Green Brown L4 Association Rules.
points in same group are more similar to other data points in
To measure the relationship between a categorical and the same group that those in other groups. Association Rule: if something, then something else is also
numerical feature: transform the numerical feature into centroid-based clustering , each group is represented by a likely to happen.
symbolic(categorical) and use Chi-squared. vector (prototype/centroid/cluster centre), this can be a non- 3 required: causality, implication, patterns.
Encoding strategies; since algorithms cannot deal with member of population created just for the representation of the Association rules allow determining in which way two or more
categorical features (1) label encoding – assign integer to group. These vector are used to discover the groups. categorical variables are associated. They encode, casualty,
category when variables have ordinal relation (weekdays). (2) (1) Hard clustering points are a part of only a single group. implication and association patterns characterizing data.
one-hot-encoding for nominal features that lack ordinal k-means Points are assigned to the nearest cluster to X mapped onto Y, results in X & Y being
relationship, each category is transformed into binary feature minimize distance. Clusters start with randomly selected a subset of all items in the set I.
(problem of dimensionality increases features2) dog/cat/mouse centres, updated iteratively by computing means until changes Antecedent → consequent [support, confidence]
liked(‘You') → liked(‘Murderer’) [20%, 60%] the full itemset looks. The orange dotted line is the frequency
get yes or no in 3 new features. are minimal. A drawback is that clusters are independent and
20% of viewers liked both ‘You’ and ‘Murderer’ border, arbitrarily set at 2.We consider an itemset closed
Class imbalance is when more instances belong to a certain may miss overlaps; k must also be
60% of people who liked ‘You’ also liked ‘Murderer’ frequent if there are no sets below with the same frequency.
decision class, classifiers are tempted to recognize the majority specified. Quality is assessed by
← number of overlapping X&Y So C is a closed item, since C=3, while AC=2, BC=2, CD=1 &
class only, solutions: (1) under-sampling select only some summing variation within clusters.
← total members CE=2. An itemset is Maximal frequent if it has no relatives,
instances from majority class. (2) over-sampling create new The algorithm starts with random which are above the frequency border. So AC is an maximal
instances minority SMOTE (synthetic minority oversampling prototypes, which adjust each frequent item, since ABC, ACD & ACE are all below the orange
technique) creates synthetic instances in the neighbourhoods iteration until they stabilize, forming ←number of overlapping X&Y
← number of X members frequency border.
of instances minority class (-induces AI generated noise) final groups.