“Probabilistic Machine Learning: An Introduction”
Kevin Murphy
FULL SOLUTION MANUAL
1
,Part I
Foundations
1 Solutions
1.1 Conditional independence
PRIVATE
1. Bayes’ rule gives
P (E1, E2|H)P (H) (1)
P (H|E1 , E2 ) =
P (E1, E2)
Thus the information in (ii) is sufficient. In fact, we don’t need P (E1, E2) because it is equal to the
normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient.
2. Now the equation simplifies to
P (E1|H)P (E2|H)P (H) (2)
P (H|E1 , E2 ) =
P (E1, E2)
so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P (E1, E2) using
normalization.
1.2 Pairwise independence does not imply mutual independence
We provide two counter examples.
Let X1 and X2 be independent binary random variables, and X3 = X1⊕ X2, where ⊕ is the XOR
operator. We have p(X3|X1, X2) =/p(X3 ), since X3 can be deterministically calculated from X1 and X2 . So
the variables { X1, X2, X3 are
} not mutually independent. However, we also have p(X3 X | 1) = p(X3), since
without X2 , no information can be provided to X3 . So X1⊥ X3 and similarly X2 ⊥ X3. Hence {X1, X2, X3 }
are pairwise independent.
Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at
random. Define 3 events as follows:
• X1: ball 1 or 2 is drawn.
• X2: ball 2 or 3 is drawn.
• X3: ball 1 or 3 is drawn.
We have p(X1) = p(X2 ) = p(X3) = 0.5. Also, p(X1, X2) = p(X2, X3) = p(X1 , X3 ) = 0.25. Hence
p(X1, X2) = p(X1)p(X2), and similarly for the other pairs. Hence the events are pairwise independent.
However, p(X1, X2, X3) = 0 /= 1/8 = p(X1)p(X2)p(X3).
1.3 Conditional independence iff joint factorizes
PRIVATE
Independency ⇒ Factorization. Let g(x, z) = p(x|z) and h(y, z) = p(y|z). If X ⊥ Y |Z then
p(x, y|z) = p(x|z)p(y|z) = g(x, z)h(y, z) (3)
2
, Factorization ⇒ Independency. If p(x, y|z) = g(x, z)h(y, z) then
Σ Σ Σ Σ
1= p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (4)
x,y x,y x y
Σ Σ Σ
p(x|z) = p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (5)
y y y
Σ Σ Σ
p(y|z) = p(x, y|z) = g(x, z)h(y, z) = h(y, z) g(x, z) (6)
x x x
Σ Σ
p(x|z)p(y|z) = g(x, z)h(y, z) g(x, z) g(y, z) (7)
x y
= g(x, z)h(y, z) = p(x, y|z) (8)
1.4 Convolution of two Gaussians is a Gaussian
We follow the derivation of [Jay03, p221]. Define
1 x− µ
φ(x − µ|σ) = N (x|µ, σ2) = φ( ) (9)
σ σ
where φ(z) is the pdf of the standard normal. We have
p(y) = N (x1|µ1, σ21) ⊗ N (x2|µ2, σ22) (10)
∫
= dx1φ(x1 − µ1|σ1)φ(y − (x1 − µ2)|σ2) (11)
Now the product inside the integral can be written as follows
( " #)
1 1 x1 − µ1 2
y − x1 − µ2 2
φ(x1 − µ1|σ1)φ(y − (x1 − µ2)|σ2) = exp − + (12)
2πσ 1σ 2 2 σ1 σ2
We can bring out the dependency on x1 by rearranging the quadratic form inside the exponent as follows
x − µ1 2
y − (x1 − µ2) 2 w1w2
+ = (w1 + w2)(x1 − x̂ ) + w + w (y − (µ1 − µ2))2 (13)
σ1 σ2 1 2
where we have defined the precision or weighting terms w1 = 1/σ2 , w2 = 1/σ2 , and the term
1 2
w1µ1 + w2y − w2µ2
x̂ = (14)
w1 + w2
Note that
w1w2 1 σ2σ2 1
1 2
w1 + w2 = 2 2 2
σ σ σ +σ 2 =
σ + σ2
2 (15)
1 2 1 2 1 2
Hence
∫ 1 w1w2
1 1 2
p(y) = exp − (w 1 + w2 )(x 1 — x̂) − (y − (µ 1 — µ2 ))2 dx 1 (16)
2πσ1σ2 2 2 w1 + w2
The integral over x1 becomes one over the normalization constant for the Gaussian:
∫ 1
1 1 σ22σ22 2
exp[− (w1 + w2)(x1 − x̂ )2 ] dx1 = (2π) 2 σ1 + σ22
2 (17)
2
3
, Hence
12
σ22σ22 1
(y − µ1 − µ2) 2]
1
p(y) = (2π) —1 (σ12 σ22)— 2 (2π) 2
1
exp[− (18)
2
σ1 + σ2 2 2(σ12 + σ22)
1
= (2π) —21 σ21 + σ22 — 21 exp[− 2(σ2 + σ2) (y − µ1 − µ2) 2] (19)
1 2
= N (y|µ1 + µ2, σ21 + σ22) (20)
1.5 Expected value of the minimum of two rv’s
PRIVATE
Let Z = min(X, Y ). We have P (Z > z) = P (X > z, Y > z) = P (X > z)p(Y > z) = (1 − z)2 = 1 + z2 − 2z.
Hence the cdf of the minimum is P (Z ≤ z) = 2z − z2 and the pdf is p(z) = 2 − 2z. Hence the expected value
is ∫ 1
E [Z] = z(2 — 2z) = 1/3 (21)
0
1.6 Variance of a sum
We have
V [X + Y ] = E[(X + Y )2] − (E[X] + E[Y ])2 (22)
2
= E[X + Y 2
+ 2XY ] − (E[X] + E[Y ] + 2E[X]E[Y ])
2 2
(23)
= E[X ] − E[X] + E[Y ] − E[Y ] + 2E[XY ] − 2E[X]E[Y ]
2 2 2 2
(24)
= V [X] + V [Y ] + 2Cov [X, Y ] (25)
If X and Y are independent, then Cov [X, Y ] = 0, so V [X + Y ] = V [X] + V [Y ].
1.7 Deriving the inverse gamma density
PRIVATE
We have
dx
py(y) = px(x)| | (26)
dy
where
dx 1
=− = −x2 (27)
dy y2
So
2 ba a—1 —xb
py(y) = x x e (28)
Γ(a)
ba a+1 —xb
= x e (29)
Γ(a)
ba —(a+1) —b/y
= y e = IG(y|a, b) (30)
Γ(a)
4
Kevin Murphy
FULL SOLUTION MANUAL
1
,Part I
Foundations
1 Solutions
1.1 Conditional independence
PRIVATE
1. Bayes’ rule gives
P (E1, E2|H)P (H) (1)
P (H|E1 , E2 ) =
P (E1, E2)
Thus the information in (ii) is sufficient. In fact, we don’t need P (E1, E2) because it is equal to the
normalization constant (to enforce the sum to one constraint). (i) and (iii) are insufficient.
2. Now the equation simplifies to
P (E1|H)P (E2|H)P (H) (2)
P (H|E1 , E2 ) =
P (E1, E2)
so (i) and (ii) are obviously sufficient. (iii) is also sufficient, because we can compute P (E1, E2) using
normalization.
1.2 Pairwise independence does not imply mutual independence
We provide two counter examples.
Let X1 and X2 be independent binary random variables, and X3 = X1⊕ X2, where ⊕ is the XOR
operator. We have p(X3|X1, X2) =/p(X3 ), since X3 can be deterministically calculated from X1 and X2 . So
the variables { X1, X2, X3 are
} not mutually independent. However, we also have p(X3 X | 1) = p(X3), since
without X2 , no information can be provided to X3 . So X1⊥ X3 and similarly X2 ⊥ X3. Hence {X1, X2, X3 }
are pairwise independent.
Here is a different example. Let there be four balls in a bag, numbered 1 to 4. Suppose we draw one at
random. Define 3 events as follows:
• X1: ball 1 or 2 is drawn.
• X2: ball 2 or 3 is drawn.
• X3: ball 1 or 3 is drawn.
We have p(X1) = p(X2 ) = p(X3) = 0.5. Also, p(X1, X2) = p(X2, X3) = p(X1 , X3 ) = 0.25. Hence
p(X1, X2) = p(X1)p(X2), and similarly for the other pairs. Hence the events are pairwise independent.
However, p(X1, X2, X3) = 0 /= 1/8 = p(X1)p(X2)p(X3).
1.3 Conditional independence iff joint factorizes
PRIVATE
Independency ⇒ Factorization. Let g(x, z) = p(x|z) and h(y, z) = p(y|z). If X ⊥ Y |Z then
p(x, y|z) = p(x|z)p(y|z) = g(x, z)h(y, z) (3)
2
, Factorization ⇒ Independency. If p(x, y|z) = g(x, z)h(y, z) then
Σ Σ Σ Σ
1= p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (4)
x,y x,y x y
Σ Σ Σ
p(x|z) = p(x, y|z) = g(x, z)h(y, z) = g(x, z) h(y, z) (5)
y y y
Σ Σ Σ
p(y|z) = p(x, y|z) = g(x, z)h(y, z) = h(y, z) g(x, z) (6)
x x x
Σ Σ
p(x|z)p(y|z) = g(x, z)h(y, z) g(x, z) g(y, z) (7)
x y
= g(x, z)h(y, z) = p(x, y|z) (8)
1.4 Convolution of two Gaussians is a Gaussian
We follow the derivation of [Jay03, p221]. Define
1 x− µ
φ(x − µ|σ) = N (x|µ, σ2) = φ( ) (9)
σ σ
where φ(z) is the pdf of the standard normal. We have
p(y) = N (x1|µ1, σ21) ⊗ N (x2|µ2, σ22) (10)
∫
= dx1φ(x1 − µ1|σ1)φ(y − (x1 − µ2)|σ2) (11)
Now the product inside the integral can be written as follows
( " #)
1 1 x1 − µ1 2
y − x1 − µ2 2
φ(x1 − µ1|σ1)φ(y − (x1 − µ2)|σ2) = exp − + (12)
2πσ 1σ 2 2 σ1 σ2
We can bring out the dependency on x1 by rearranging the quadratic form inside the exponent as follows
x − µ1 2
y − (x1 − µ2) 2 w1w2
+ = (w1 + w2)(x1 − x̂ ) + w + w (y − (µ1 − µ2))2 (13)
σ1 σ2 1 2
where we have defined the precision or weighting terms w1 = 1/σ2 , w2 = 1/σ2 , and the term
1 2
w1µ1 + w2y − w2µ2
x̂ = (14)
w1 + w2
Note that
w1w2 1 σ2σ2 1
1 2
w1 + w2 = 2 2 2
σ σ σ +σ 2 =
σ + σ2
2 (15)
1 2 1 2 1 2
Hence
∫ 1 w1w2
1 1 2
p(y) = exp − (w 1 + w2 )(x 1 — x̂) − (y − (µ 1 — µ2 ))2 dx 1 (16)
2πσ1σ2 2 2 w1 + w2
The integral over x1 becomes one over the normalization constant for the Gaussian:
∫ 1
1 1 σ22σ22 2
exp[− (w1 + w2)(x1 − x̂ )2 ] dx1 = (2π) 2 σ1 + σ22
2 (17)
2
3
, Hence
12
σ22σ22 1
(y − µ1 − µ2) 2]
1
p(y) = (2π) —1 (σ12 σ22)— 2 (2π) 2
1
exp[− (18)
2
σ1 + σ2 2 2(σ12 + σ22)
1
= (2π) —21 σ21 + σ22 — 21 exp[− 2(σ2 + σ2) (y − µ1 − µ2) 2] (19)
1 2
= N (y|µ1 + µ2, σ21 + σ22) (20)
1.5 Expected value of the minimum of two rv’s
PRIVATE
Let Z = min(X, Y ). We have P (Z > z) = P (X > z, Y > z) = P (X > z)p(Y > z) = (1 − z)2 = 1 + z2 − 2z.
Hence the cdf of the minimum is P (Z ≤ z) = 2z − z2 and the pdf is p(z) = 2 − 2z. Hence the expected value
is ∫ 1
E [Z] = z(2 — 2z) = 1/3 (21)
0
1.6 Variance of a sum
We have
V [X + Y ] = E[(X + Y )2] − (E[X] + E[Y ])2 (22)
2
= E[X + Y 2
+ 2XY ] − (E[X] + E[Y ] + 2E[X]E[Y ])
2 2
(23)
= E[X ] − E[X] + E[Y ] − E[Y ] + 2E[XY ] − 2E[X]E[Y ]
2 2 2 2
(24)
= V [X] + V [Y ] + 2Cov [X, Y ] (25)
If X and Y are independent, then Cov [X, Y ] = 0, so V [X + Y ] = V [X] + V [Y ].
1.7 Deriving the inverse gamma density
PRIVATE
We have
dx
py(y) = px(x)| | (26)
dy
where
dx 1
=− = −x2 (27)
dy y2
So
2 ba a—1 —xb
py(y) = x x e (28)
Γ(a)
ba a+1 —xb
= x e (29)
Γ(a)
ba —(a+1) —b/y
= y e = IG(y|a, b) (30)
Γ(a)
4