Data Mining
Clustering
Cluster analysis:
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
Intra-cluster (binnen de cluster) distances are minimized
Inter-cluster (tussen de clusters) distances are maximized
Applications of Cluster Analysis:
Understanding: group related documents for browsing, group genes and proteins that have
similar functionality, or group stocks with similar price fluctuations.
Summarization: reduce the size of large data sets
What is not Cluster Analysis?
Supervised classification: have class label information
Simple segmentation: dividing students into different registration groups
alphabetically, by last name
Results of a query: groupings are a result of an external specification
Graph partitioning: some mutual relevance and synergy, but areas are not identical
Types of Clusterings:
- A clustering is a set of clusters
- Important distinction between hierarchical and
partitional sets of clusters
- Partitional clustering:
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
- Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Other distinctions between Sets of Clusters:
Exclusive vs. non-exclusive:
In non-exclusive clusterings, points may belong to
multiple clusters.
Can represent multiple classes or ‘border’ points
Fuzzy vs. non-fuzzy:
In fuzzy clustering, a point belongs to every cluster with
some weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial vs. complete:
In some cases, we only want to cluster some of the data
Heterogeneous vs. homogeneous:
Cluster of widely different sizes, shapes and densities
Clustering
Cluster analysis:
Finding groups of objects such that the objects in a group will be similar (or related) to one
another and different from (or unrelated to) the objects in other groups.
Intra-cluster (binnen de cluster) distances are minimized
Inter-cluster (tussen de clusters) distances are maximized
Applications of Cluster Analysis:
Understanding: group related documents for browsing, group genes and proteins that have
similar functionality, or group stocks with similar price fluctuations.
Summarization: reduce the size of large data sets
What is not Cluster Analysis?
Supervised classification: have class label information
Simple segmentation: dividing students into different registration groups
alphabetically, by last name
Results of a query: groupings are a result of an external specification
Graph partitioning: some mutual relevance and synergy, but areas are not identical
Types of Clusterings:
- A clustering is a set of clusters
- Important distinction between hierarchical and
partitional sets of clusters
- Partitional clustering:
A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset.
- Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Other distinctions between Sets of Clusters:
Exclusive vs. non-exclusive:
In non-exclusive clusterings, points may belong to
multiple clusters.
Can represent multiple classes or ‘border’ points
Fuzzy vs. non-fuzzy:
In fuzzy clustering, a point belongs to every cluster with
some weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial vs. complete:
In some cases, we only want to cluster some of the data
Heterogeneous vs. homogeneous:
Cluster of widely different sizes, shapes and densities