,Exercise solutions
Linear text classification
�
1. Let x be a bag-of-words vector such that Vj=1 xj = 1. Verify that the multinomial
probability pmult (x; φ), as defined in Equation 2.12, is identical to the probability of
the same document under a categorical distribution, pcat (w; φ).
2. Suppose you have a single feature x, with the following conditional distribution:
α, X = 0, Y = 0
1 − α, X = 1, Y = 0
p(x | y) = [B.23]
1 − β, X = 0, Y = 1
β, X = 1, Y = 1.
Further suppose that the prior is uniform, Pr(Y = 0) = Pr(Y = 1) = 12 , and that
both α > 12 and β > 12 . Given a Naı̈ve Bayes classifier with accurate parameters,
what is the probability of making an error?
Answer:
ŷ(X = 0) =0 [B.24]
ŷ(X = 1) =1 [B.25]
Pr(ŷ = 0 | Y = 1) = Pr(X = 0 | Y = 1) = (1 − β) [B.26]
Pr(ŷ = 1 | Y = 0) = Pr(X = 1 | Y = 0) = (1 − α) [B.27]
1
Pr(ŷ �= y) = (1 − β + 1 − α) [B.28]
2
1
=1 − (α + β) [B.29]
2
3. Derive the maximum-likelihood estimate for the parameter µ in Naı̈ve Bayes.
619
,620 BIBLIOGRAPHY
Answer:
N
�
L(µ) = log pcat (y (i) ; µ) [B.30]
i=1
N
�
= log µy(i) [B.31]
i=1
N
� K
�
� �
�(µ) = log µy(i) − λ µy − 1 [B.32]
i=1 y=1
N � �
∂�(µ) � δ y (i) = y
= −λ [B.33]
∂µy i=1
µy
N
� � �
µy ∝ δ y (i) = y [B.34]
i=1
4. The classification models in the text have a vector of weights for each possible label.
While this is notationally convenient, it is overdetermined: for any linear classifier
that can be obtained with K × V weights, an equivalent classifier can be constructed
using (K − 1) × V weights.
a) Describe how to construct this classifier. Specifically, if given a set of weights
θ and a feature function f (x, y), explain how to construct alternative weights
and feature function θ � and f � (x, y), such that,
∀y, y � ∈ Y, θ · f (x, y) − θ · f (x, y � ) = θ � · f � (x, y) − θ � · f � (x, y � ). [B.35]
b) Explain how your construction justifies the well-known alternative form for
binary logistic regression, Pr(Y = 1 | x; θ) = 1+exp(−θ
1
� ·x) = σ(θ · x), where σ
�
is the sigmoid function.
Jacob Eisenstein. Draft of January 16, 2019.
, BIBLIOGRAPHY 621
Answer:
a) Let θK,j indicate the weight for base feature j in class K. Then θk,j�
= θk,j − θK,j , and
f (x, y) = f (x, y) for all y < K. This means that θ · f (x, K) = 0.
�
b) In binary classification, θ � = θ0 − θ1 .
exp (θ · f (x, 0))
Pr(Y = 0 | x; θ) = [B.36]
exp (θ · f (x, 0)) + exp (θ · f (x, 1))
1
= [B.37]
1 + exp (θ · f (x, 1) − θ · f (x, 0))
1
= . [B.38]
1 + exp (−θ � · x)
5. Suppose you have two labeled datasets D1 and D2 , with the same features and la-
bels.
• Let θ (1) be the unregularized logistic regression (LR) coefficients from training
on dataset D1 .
• Let θ (2) be the unregularized LR coefficients (same model) from training on
dataset D2 .
• Let θ ∗ be the unregularized LR coefficients from training on the combined
dataset D1 ∪ D2 .
Under these conditions, prove that for any feature j,
(1) (2)
θj∗ ≥ min(θj , θj )
(1) (2)
θj∗ ≤ max(θj , θj ).
6. Let θ̂ be the solution to an unregularized logistic regression problem, and let θ ∗ be
the solution to the same problem, with L2 regularization. Prove that ||θ ∗ ||22 ≤ ||θ̂||22 .
Under contract with MIT Press, shared under CC-BY-NC-ND license.
Linear text classification
�
1. Let x be a bag-of-words vector such that Vj=1 xj = 1. Verify that the multinomial
probability pmult (x; φ), as defined in Equation 2.12, is identical to the probability of
the same document under a categorical distribution, pcat (w; φ).
2. Suppose you have a single feature x, with the following conditional distribution:
α, X = 0, Y = 0
1 − α, X = 1, Y = 0
p(x | y) = [B.23]
1 − β, X = 0, Y = 1
β, X = 1, Y = 1.
Further suppose that the prior is uniform, Pr(Y = 0) = Pr(Y = 1) = 12 , and that
both α > 12 and β > 12 . Given a Naı̈ve Bayes classifier with accurate parameters,
what is the probability of making an error?
Answer:
ŷ(X = 0) =0 [B.24]
ŷ(X = 1) =1 [B.25]
Pr(ŷ = 0 | Y = 1) = Pr(X = 0 | Y = 1) = (1 − β) [B.26]
Pr(ŷ = 1 | Y = 0) = Pr(X = 1 | Y = 0) = (1 − α) [B.27]
1
Pr(ŷ �= y) = (1 − β + 1 − α) [B.28]
2
1
=1 − (α + β) [B.29]
2
3. Derive the maximum-likelihood estimate for the parameter µ in Naı̈ve Bayes.
619
,620 BIBLIOGRAPHY
Answer:
N
�
L(µ) = log pcat (y (i) ; µ) [B.30]
i=1
N
�
= log µy(i) [B.31]
i=1
N
� K
�
� �
�(µ) = log µy(i) − λ µy − 1 [B.32]
i=1 y=1
N � �
∂�(µ) � δ y (i) = y
= −λ [B.33]
∂µy i=1
µy
N
� � �
µy ∝ δ y (i) = y [B.34]
i=1
4. The classification models in the text have a vector of weights for each possible label.
While this is notationally convenient, it is overdetermined: for any linear classifier
that can be obtained with K × V weights, an equivalent classifier can be constructed
using (K − 1) × V weights.
a) Describe how to construct this classifier. Specifically, if given a set of weights
θ and a feature function f (x, y), explain how to construct alternative weights
and feature function θ � and f � (x, y), such that,
∀y, y � ∈ Y, θ · f (x, y) − θ · f (x, y � ) = θ � · f � (x, y) − θ � · f � (x, y � ). [B.35]
b) Explain how your construction justifies the well-known alternative form for
binary logistic regression, Pr(Y = 1 | x; θ) = 1+exp(−θ
1
� ·x) = σ(θ · x), where σ
�
is the sigmoid function.
Jacob Eisenstein. Draft of January 16, 2019.
, BIBLIOGRAPHY 621
Answer:
a) Let θK,j indicate the weight for base feature j in class K. Then θk,j�
= θk,j − θK,j , and
f (x, y) = f (x, y) for all y < K. This means that θ · f (x, K) = 0.
�
b) In binary classification, θ � = θ0 − θ1 .
exp (θ · f (x, 0))
Pr(Y = 0 | x; θ) = [B.36]
exp (θ · f (x, 0)) + exp (θ · f (x, 1))
1
= [B.37]
1 + exp (θ · f (x, 1) − θ · f (x, 0))
1
= . [B.38]
1 + exp (−θ � · x)
5. Suppose you have two labeled datasets D1 and D2 , with the same features and la-
bels.
• Let θ (1) be the unregularized logistic regression (LR) coefficients from training
on dataset D1 .
• Let θ (2) be the unregularized LR coefficients (same model) from training on
dataset D2 .
• Let θ ∗ be the unregularized LR coefficients from training on the combined
dataset D1 ∪ D2 .
Under these conditions, prove that for any feature j,
(1) (2)
θj∗ ≥ min(θj , θj )
(1) (2)
θj∗ ≤ max(θj , θj ).
6. Let θ̂ be the solution to an unregularized logistic regression problem, and let θ ∗ be
the solution to the same problem, with L2 regularization. Prove that ||θ ∗ ||22 ≤ ||θ̂||22 .
Under contract with MIT Press, shared under CC-BY-NC-ND license.