,Contents
Contents 1
1 Data Mining and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Exercises 3
PART I DATA ANALYSIS FOUNDATIONS 5
2 Numeric Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Exercises 7
3 Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Exercises 16
4 Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Exercises 20
5 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.6 Exercises 26
6 High-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.9 Exercises 29
7 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6 Exercises 39
PART II FREQUENT PATTERN MINING 45
8 Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Exercises 47
9 Summarizing Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 Exercises 56
1
,2 Contents
10 Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.5 Exercises 63
11 Graph Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.5 Exercises 75
12 Pattern and Rule Assessment . . . . . . . . . . . . . . . . . . . . . . . . 84
12.4 Exercises 84
PART III CLUSTERING 89
13 Representative-based Clustering . . . . . . . . . . . . . . . . . . . . . . 91
13.5 Exercises 91
14 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.4 Exercises 99
15 Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 106
15.5 Exercises 106
16 Spectral and Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . 111
16.5 Exercises 111
17 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
17.5 Exercises 118
PART IV CLASSIFICATION 123
18 Probabilistic Classification . . . . . . . . . . . . . . . . . . . . . . . . . 125
18.5 Exercises 125
19 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
19.4 Exercises 129
20 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 137
20.4 Exercises 137
21 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.7 Exercises 141
22 Classification Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 145
22.5 Exercises 145
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
,CHAPTER 1 Data Mining and Analysis
1.7 EXERCISES
Q1. Show that the mean of the centered data matrix Z in Eq. (1.5) is 0.
Answer: Each centered point is given as: zi = xi − µ. Their mean is therefore:
n n
1X 1X
zi = (xi − µ)
n n
i=0 i=0
n
1X 1
= xi − · n · µ
n n
i=0
= µ−µ= 0
Q2. Prove that for the Lp -distance in Eq. (1.2), we have
d
δ∞ (x, y) = lim δp (x, y) = max |xi − yi |
p→∞ i=1
for x, y ∈ Rd .
Answer: We have to show that
d
! p1
X d
lim |xi − yi |p = max |xi − yi |
p→∞ i=1
i=1
Assume that dimension a is the max, and let m = |xa − ya |. For simplicity, we
assume that |xi − yi | < m for all i 6= a.
If we divide and multiply the left hand side with mp we get:
! p1 1
d
X X |xi − yi | p
p
|xi − yi | p
m p
= m 1 +
m m
i=1 i6=a
3
,4 Data Mining and Analysis
p
As p → ∞, each term |xi m−yi |
→ 0, since m > |xi − yi | for all i 6= a. The finite
P |xi −yi | p
summation i6=a m converges to 0 as p → ∞, as does 1/p.
Thus δ∞ (x, y) = m × 10 = m = |xa − ya | = maxdi=1 {|xi − yi |}
Note that the same result is obtained even if we assume that dimensions other
than a achieve the maximum value m. In the worst case, we have m = |xi − yi | for
all d dimensions. In this case, the expression on LHS becomes
d
!1/p
X
lim m p
1 = lim md 1/p = lim md 0 = m
p→∞ p→∞ p→∞
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
,P A R T ONE DATA ANALYSIS
FOUNDATIONS
,
,CHAPTER 2 Numeric Attributes
2.7 EXERCISES
Q1. True or False:
(a) Mean is robust against outliers.
Answer: False
(b) Median is robust against outliers.
Answer: True
(c) Standard deviation is robust against outliers.
Answer: False
Q2. Let X and Y be two random variables, denoting age and weight, respectively.
Consider a random sample of size n = 20 from these two variables
X = (69, 74, 68, 70, 72, 67, 66, 70, 76, 68, 72, 79, 74, 67, 66, 71, 74, 75, 75, 76)
Y = (153, 175, 155, 135, 172, 150, 115, 137, 200, 130, 140, 265, 185, 112, 140,
150, 165, 185, 210, 220)
(a) Find the mean, median, and mode for X.
Answer: The mean, median, and mode are:
2
1 X
µ= 0xi = 1429/20 = 71.45
20
i=1
median = (71 + 72)/2 = 71.5
mode = 74
(b) What is the variance for Y?
7
,8 Numeric Attributes
Answer: The mean of Y is µY = 3294/20 = 164.7. The variance is:
2
1 X
σY2 = 0yi − µY = 27384.2/20 = 1369.21
20
i=1
(c) Plot the normal distribution for X.
Answer: The mean for X is µX = 71.45, and the variance is σX2 = 13.8475, with
a standard deviation of σX = 3.72.
f (x)
0.10
0.05
0 x
60 65 70 75 80
(d) What is the probability of observing an age of 80 or higher?
Answer: If we leverage the empirical probability mass function, we get:
P (X ≥ 80) = 0/20 = 0
since we do not have anyone with age 80 or more in our sample.
We can use the normal distribution modeling, with parameters µX = 71.45 and
σX2 = 3.72 to get:
Z∞
P (X ≥ 80) = N(x|µX , σX ) = 0.010769
80
b for these two
(e) Find the 2-dimensional mean µ̂ and the covariance matrix 6
variables.
Answer: The mean and covariance matrices are:
µ = (µX , µY )T = (71.45, 164.7)T
2
σX σXY 13.8475 122.435
6= =
σXY σY2 122.435 1369.21
(f) What is the correlation between age and weight?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
, 2.7 Exercises 9
Answer:
122.435
ρXY = σXY /(σX σY ) = √ = 0.889
13.845 · 1369.21
(g) Draw a scatterplot to show the relationship between age and weight.
Answer: The scatterplot is shown in the figure below.
bC
250
230
bC
bC
210
Y: Weight
bC
190 bC bC
bC
bC
170 bC
bC bC
bC bC
150
bC bC
bCbC
bC
130
bC
bC
110
65 67 69 71 73 75 77 79
X: Age
Q3. Show that the identity in Eq. (2.15) holds, that is,
n
X n
X
(xi − µ)2 = n(µ̂ − µ)2 + (xi − µ̂)2
i=1 i=1
Answer: Consider the RHS
n
X n
X
n(µ̂ − µ)2 + (xi − µ̂)2 = n(µ̂2 − 2µ̂µ + µ2 ) + (xi2 − 2xi µ̂ + µ̂2 )
i=1 i=1
n
!
X
2 2
= nµ̂ − 2nµ̂µ + nµ + xi2 − 2nµ̂2 + nµ̂2
i=1
n
!
X
= xi2 − 2nµ̂µ + nµ2
i=1
n
! Pn n
X X
i=1 xi
= xi2 − 2n µ+ µ2
n
i=1 i=1
n
X
= (xi − µ)2
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
Contents 1
1 Data Mining and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.7 Exercises 3
PART I DATA ANALYSIS FOUNDATIONS 5
2 Numeric Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.7 Exercises 7
3 Categorical Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.7 Exercises 16
4 Graph Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6 Exercises 20
5 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.6 Exercises 26
6 High-dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.9 Exercises 29
7 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6 Exercises 39
PART II FREQUENT PATTERN MINING 45
8 Itemset Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Exercises 47
9 Summarizing Itemsets . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
9.6 Exercises 56
1
,2 Contents
10 Sequence Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
10.5 Exercises 63
11 Graph Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
11.5 Exercises 75
12 Pattern and Rule Assessment . . . . . . . . . . . . . . . . . . . . . . . . 84
12.4 Exercises 84
PART III CLUSTERING 89
13 Representative-based Clustering . . . . . . . . . . . . . . . . . . . . . . 91
13.5 Exercises 91
14 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
14.4 Exercises 99
15 Density-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 106
15.5 Exercises 106
16 Spectral and Graph Clustering . . . . . . . . . . . . . . . . . . . . . . . 111
16.5 Exercises 111
17 Clustering Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
17.5 Exercises 118
PART IV CLASSIFICATION 123
18 Probabilistic Classification . . . . . . . . . . . . . . . . . . . . . . . . . 125
18.5 Exercises 125
19 Decision Tree Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
19.4 Exercises 129
20 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 137
20.4 Exercises 137
21 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 141
21.7 Exercises 141
22 Classification Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . 145
22.5 Exercises 145
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
,CHAPTER 1 Data Mining and Analysis
1.7 EXERCISES
Q1. Show that the mean of the centered data matrix Z in Eq. (1.5) is 0.
Answer: Each centered point is given as: zi = xi − µ. Their mean is therefore:
n n
1X 1X
zi = (xi − µ)
n n
i=0 i=0
n
1X 1
= xi − · n · µ
n n
i=0
= µ−µ= 0
Q2. Prove that for the Lp -distance in Eq. (1.2), we have
d
δ∞ (x, y) = lim δp (x, y) = max |xi − yi |
p→∞ i=1
for x, y ∈ Rd .
Answer: We have to show that
d
! p1
X d
lim |xi − yi |p = max |xi − yi |
p→∞ i=1
i=1
Assume that dimension a is the max, and let m = |xa − ya |. For simplicity, we
assume that |xi − yi | < m for all i 6= a.
If we divide and multiply the left hand side with mp we get:
! p1 1
d
X X |xi − yi | p
p
|xi − yi | p
m p
= m 1 +
m m
i=1 i6=a
3
,4 Data Mining and Analysis
p
As p → ∞, each term |xi m−yi |
→ 0, since m > |xi − yi | for all i 6= a. The finite
P |xi −yi | p
summation i6=a m converges to 0 as p → ∞, as does 1/p.
Thus δ∞ (x, y) = m × 10 = m = |xa − ya | = maxdi=1 {|xi − yi |}
Note that the same result is obtained even if we assume that dimensions other
than a achieve the maximum value m. In the worst case, we have m = |xi − yi | for
all d dimensions. In this case, the expression on LHS becomes
d
!1/p
X
lim m p
1 = lim md 1/p = lim md 0 = m
p→∞ p→∞ p→∞
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
,P A R T ONE DATA ANALYSIS
FOUNDATIONS
,
,CHAPTER 2 Numeric Attributes
2.7 EXERCISES
Q1. True or False:
(a) Mean is robust against outliers.
Answer: False
(b) Median is robust against outliers.
Answer: True
(c) Standard deviation is robust against outliers.
Answer: False
Q2. Let X and Y be two random variables, denoting age and weight, respectively.
Consider a random sample of size n = 20 from these two variables
X = (69, 74, 68, 70, 72, 67, 66, 70, 76, 68, 72, 79, 74, 67, 66, 71, 74, 75, 75, 76)
Y = (153, 175, 155, 135, 172, 150, 115, 137, 200, 130, 140, 265, 185, 112, 140,
150, 165, 185, 210, 220)
(a) Find the mean, median, and mode for X.
Answer: The mean, median, and mode are:
2
1 X
µ= 0xi = 1429/20 = 71.45
20
i=1
median = (71 + 72)/2 = 71.5
mode = 74
(b) What is the variance for Y?
7
,8 Numeric Attributes
Answer: The mean of Y is µY = 3294/20 = 164.7. The variance is:
2
1 X
σY2 = 0yi − µY = 27384.2/20 = 1369.21
20
i=1
(c) Plot the normal distribution for X.
Answer: The mean for X is µX = 71.45, and the variance is σX2 = 13.8475, with
a standard deviation of σX = 3.72.
f (x)
0.10
0.05
0 x
60 65 70 75 80
(d) What is the probability of observing an age of 80 or higher?
Answer: If we leverage the empirical probability mass function, we get:
P (X ≥ 80) = 0/20 = 0
since we do not have anyone with age 80 or more in our sample.
We can use the normal distribution modeling, with parameters µX = 71.45 and
σX2 = 3.72 to get:
Z∞
P (X ≥ 80) = N(x|µX , σX ) = 0.010769
80
b for these two
(e) Find the 2-dimensional mean µ̂ and the covariance matrix 6
variables.
Answer: The mean and covariance matrices are:
µ = (µX , µY )T = (71.45, 164.7)T
2
σX σXY 13.8475 122.435
6= =
σXY σY2 122.435 1369.21
(f) What is the correlation between age and weight?
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.
, 2.7 Exercises 9
Answer:
122.435
ρXY = σXY /(σX σY ) = √ = 0.889
13.845 · 1369.21
(g) Draw a scatterplot to show the relationship between age and weight.
Answer: The scatterplot is shown in the figure below.
bC
250
230
bC
bC
210
Y: Weight
bC
190 bC bC
bC
bC
170 bC
bC bC
bC bC
150
bC bC
bCbC
bC
130
bC
bC
110
65 67 69 71 73 75 77 79
X: Age
Q3. Show that the identity in Eq. (2.15) holds, that is,
n
X n
X
(xi − µ)2 = n(µ̂ − µ)2 + (xi − µ̂)2
i=1 i=1
Answer: Consider the RHS
n
X n
X
n(µ̂ − µ)2 + (xi − µ̂)2 = n(µ̂2 − 2µ̂µ + µ2 ) + (xi2 − 2xi µ̂ + µ̂2 )
i=1 i=1
n
!
X
2 2
= nµ̂ − 2nµ̂µ + nµ + xi2 − 2nµ̂2 + nµ̂2
i=1
n
!
X
= xi2 − 2nµ̂µ + nµ2
i=1
n
! Pn n
X X
i=1 xi
= xi2 − 2n µ+ µ2
n
i=1 i=1
n
X
= (xi − µ)2
i=1
This book has been published by Cambridge University Press. No unauthorized distribution shall be allowed.