CS229T/STATS231 (Fall 2018–2019)
Note: please do not copy or distribute.
Due date: Wed, Oct 10, 11pm
1. Value of labeled data (11 points)
In many applications, labeled data is expensive and therefore limited, while unlabeled data is cheap and
therefore abundant. For example, there are tons of images on the web, but getting labeled images is much
harder. What is the statistical value of having labeled data versus unlabeled data? This problem will explore
this formally using asymptotics.
Specifically, suppose we have an exponential family model over a discrete latent variable h and a discrete
observed variable x:
pθ (h, x) = exp{θ · φ(h, x) − A(θ)},
P
where A(θ) = log h,x exp{θ · φ(h, x)} is the usual log-partition function.
Suppose that n examples (h(1) , x(1) ), . . . , (h(n) , x(n) ) are drawn i.i.d. from some true distribution pθ∗ .
Define the following two estimators:
n
1X
θ̂sup = arg max log pθ (h(i) , x(i) ) (1)
θ∈Rd n i=1
n
1X X
θ̂unsup = arg max log pθ (h, x(i) ). (2)
θ∈Rd n i=1
h
The supervised estimator θ̂sup uses the variable h(i) and maximizes the joint likelihood, while the unsuper-
vised estimator θ̂unsup marginalizes out the latent variable h.
One important caveat: our results will hold when we assume that data is actually generated from our
model family and that unsupervised learning is possible. Otherwise, labeled data is worth a lot more.
a. (2 points) (supervised asymptotic variance) Compute the asymptotic variance of θ̂sup : that is,
√ d
given that n(θ̂sup − θ∗ ) −
→ N (0, Vsup ), write an expression for Vsup that depends on expectations/variances
involving φ.
Solution:
Using notation from class, let ` denote the log-likelihood and L be the expected log-likelihood. Recall that
Vsup = ∇2 L(θ∗ )−1 Covθ∗ [∇`(z, θ∗ )]∇2 L(θ∗ )−1 = Covθ∗ [∇`(z, θ∗ )]−1 ,
where the second equality follows from Bartlett’s identity since we have assumed the model is well specified.
Now, letting z = (h, x),
∇`(z, θ∗ ) = ∇(θ∗ · φ(h, x) − A(θ∗ )) = φ(h, x) − Eθ∗ [φ(h, x)],
so
−1
Vsup = Covθ∗ [φ(h, x) − Eθ∗ [∇`(z, θ∗ )]] = Covθ∗ [φ(h, x)]−1 .
1
, b. (2 points) (unsupervised asymptotic variance) Compute the asymptotic variance of θ̂unsup : that
√ d
is, given that n(θ̂unsup − θ∗ ) −
→ N (0, Vunsup ), write an expression for Vunsup that depends on expectation-
s/variances involving φ.
Solution:
Similar to the previous part, we have
X
∇`(z, θ∗ ) = ∇ log exp{θ∗ · φ(h, x) − A(θ∗ )}
h
P
· (φ(h, x) − Eθ∗ [φ(h, x)])
h pθ (h, x)P
∗
=
h pθ (h, x)
∗
= Eθ∗ [φ(h, x) | x] − Eθ∗ [φ(h, x)].
Therefore,
−1 −1
Vunsup = Covθ∗ [Eθ∗ [φ(h, x) | x] − Eθ∗ [φ(h, x)]] = Covθ∗ [Eθ∗ [φ(h, x) | x]] .
c. (3 points) (comparing estimators) Prove that θ̂sup has lower (or equal) asymptotic variance
compared to θ̂unsup . That is, show that
Vsup Vunsup ,
Solution:
We have,
−1
Vsup = Covθ∗ [φ(h, x)] = Eθ∗ [Covθ∗ [φ(h, x) | x]] + Covθ∗ [Eθ∗ [φ(h, x) | x]]
−1
= Eθ∗ [Covθ∗ [φ(h, x) | x]] + Vunsup
−1
Vunsup ,
where the last inequality follows since the covariance is positive semi-definite. Therefore, Vsup Vunsup ; i.e.,
θ̂sup has lower asymptotic variance.
d. (4 points)
Consider the exponential family
1
pθ (h, x) = exp(θhx),
Z
where h, x ∈ {0, 1} and Z = h,x∈{0,1}2 exp(θhx).1 Essentially, (h, x) is a pair of correlated biased coin
P
flips, where pθ (1, 1) = exp(θ)/Z and pθ (0, 0) = pθ (0, 1) = pθ (1, 0) = 1/Z.
1Z is often referred to as the partition function.
2