Clustering: grouping data points based on similarity
Data points within a cluster are similar to each other, and dissimilar do data points in other clusters
→ useful to find groups that are assumed to exist in reality (e.g. vegetation type, animal behaviour)
Clustering = partitioning (same term)
- Clustering is not about revealing gradients
→ ordination is about revealing gradients
→ clustering is about detecting discrete groups with small differences between members
- Clustering is not the same as classification
→ classification is about creating groups based on known labels
Similarity and dissimilarity are essential components of clustering analysis
Distance between pairs of:
→ points
→ cluster of points
Types of clustering:
- Flat clustering (K-means clustering): creates a flat set of clusters without any structure
- Hierarchical clustering: creates a hierarchy of clusters (thus within internal structure)
Flat clustering (K-means)
K-means: the simplest clustering algorithm, where we must define a target number K, which refers to
the number of means (centers) we want our dataset to partition around.
→ Each observation is assigned to the cluster with the nearest mean
→ Only deals with difference between clusters and not within clusters
Steps:
1. Randomly locates initial cluster centers
2. Assign records to nearest cluster mean
3. Compute new cluster means
4. Repeats 2 & 3 a few iterations
→ new data points can be assigned to the cluster
with the nearest center
→ disadvantage: number of clusters is assigned by
eye
Learning algorithm: algorithm that learns; tries a
few times and then knows a definite outcome.
→ does not necessarily result in exactly the same outcome when the analyses is repeated
,Hierarchical clustering
Hierarchical clustering does not require us to pre-
specify the number of clusters to be generated,
and results in a dendrogram
Dendrogram: tree-like diagram that records the
sequences of merges or splits
Root node: upper node where all samples belong to
Leaf (terminal node): cluster with only one sample
→ similarity of two observations is bases on the height where branches containing those two
observations first are fused
→ we cannot use the proximity of two observations along the horizontal axis for similarity
Types of hierarchical clustering:
- Agglomerative clustering (merges): builds nested clusters by merging smaller cluster with a
bottom-up approach
- Divisive clustering (splits): builds nested cluster by merging smaller clusters with a top-down
approach
Disadvantage: when a new datapoint is
added, the entire dendrogram needs to
be recalculated
,Similarity and dissimilarity
Distance between pairs
Euclidean How the crow flies
Manhattan How the taxi drives
→ distance along the axis
Jaccard Intersection/union: relative similarity
→ for binary data
Jaccard distance:
0.67 → 4 out of 6 species differ between the sites
, If variables differ in measure (e.g. temperature and weight), scale the columns to mean = 0, sd = 1
→ same scal
Linkage: how we quantify the dissimilarity between clusters:
- Single: minimum distance between clusters
- Often leading to clusters with different size
- Shape of clusters can become elongated
- Complete: maximum distance between clusters
- Size of clusters become more compact
- Average: average between clusters - Handles outliers and noise well
- Ward: minimum variance method - Lead to more uniformly sized clusters
- More difficult to compute, thus slower for
large datasets