Machine Learning: A Probabilistic Perspective by Kevin P. Murphy
Chapter 1-28 with complete guide
Chapter 1
Introduction
on the subject area, these exams can vary significantly in format. Common types of business exams and Writing Skills: Business exams may require students to present ideas clearly
1.1 Solutions
1.1.1 KNN classifier on shuffled MNIST data
We just have to insert the following piece of code.
Listing 1.1: Part of mnistShuffled1NNdemo
... load data
%% permute columns
D = 28*28;
setSeed(0); perm = randperm(D);
Xtrain = Xtrain(:, perm);
Xtest = Xtest(:, perm);
... same as before
1.1.2 Approximate KNN classifiers
According to John Chia, the following code will work.
Listing 1.2: :
[result, ndists] = flann_search(Xtrain’, Xtest’, 1, ...
struct(’algorithm’, ’kdtree’, ’trees’, 8, ’checks’, 64));
errorRate = mean(ytrain(result) ˜= ytest0)
He reports the following results on MNIST with 1NN.
ntests=1000 ntests=10,000
Err Time Err Time
Flann 4.8% 17s 3.35% 17.2s
Vanilla 3.8% 3.68s 3.09% 28.36s
So the approximate method is somewhat faster for large test sets, but is slightly less accurate.
1.1.3 CV for KNN
See Figure 1.1(b). The CV estimate is an overestimate of the test error, but has the right shape. Note, however, that the empirical
test error is only based on 500 test points. A better comparison would use a much larger test set.
1
, 5−fold cross validation, ntrain = 200
0.35 1
0.3 0.9
0.8
0.25
0.7
0.2
0.6
0.15
0.5
0.1
0.4
0.05 0.3
0 0.2
0 20 40 60 80 100 120 0 20 40 60 80 100 120
K K
(a) (b)
Figure 1.1: (a) Misclassification rate vs K in a K-nearest neighbor classifier. On the left, where K is small, the model is complex and hence
we overfit. On the right, where K is large, the model is simple and we underfit. Dotted blue line: training set (size 200). Solid red line: test
set (size 500). (b) 5-fold cross validation estimate of test error. Figure generated by knnClassifyDemo.
on the subject area, these exams can vary significantly in format. Common types of business exams and Writing Skills: Business exams may require students to present ideas clearly
Chapter 2
Probability
2.1 Solutions
2.1.1 Probabilities are sensitive to the form of the question that was used to generate the answer
1. The event space is shown below, where X is one child and Y the other.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4
Let Ng be the number of girls and Nb the number of boys. We have the constraint (side information) that Nb + Ng = 2
and 0 Nb, Ng 2. We are told Nb 1 and are asked to compute the probability of the event Ng = 1 (i.e., one child
is a girl). By Bayes rule we have
p(Nb ≥ 1|Ng = 1)p(Ng = 1)
p(Ng = 1|Nb 1) = (2.1)
p(Nb ≥ 1)
1 × 1/2
= = 2/3 (2.2)
3/4
2. Let Y be the identity of the observed child and X be the identity of the other child. We want p(X = g Y = b). By Bayes
rule we have
p(Y = b|X = g)p(X = g)
p(X = g y = b) = (2.3)
p(Y = b)
(1/2) × (1/2)
= = 1/2 (2.4)
1/2
Tom Minka (Minka 1998) has written the following about2these results:
, This seems like a paradox because it seems that in both cases we could condition on the fact that ‖at least one child
is a boy.‖ But that is not correct; you must condition on the event actually observed, not its logical implications. In
the first case, the event was ‖He said yes to my question.‖ In the second case, the event was ‖One child appeared in
front of me.‖ The generating distribution is different for the two events. Probabilities reflect the number of possible
ways an event can happen, like the number of roads to a town. Logical implications are further down the road and
may be reached in more ways, through different towns. The different number of ways changes the probability.
on the subject area, these exams can vary significantly in format. Common types of business exams and Writing Skills: Business exams may require students to present ideas clearly
2.1.2 Legal reasoning
Let E be the evidence (the observed blood type), and I be the event that the defendant is innocent, and G = I be the event
that the defendant is guilty.
1. The prosecutor is confusing p(E I) with p(I E). We are told that p(E I) = 0.01 but the relevant quantity is p(I E). By
Bayes rule, this is
p(E|I)p(I) 0.01p(I) (2.5)
p(I|E) = =
p(E|I)p(I) + p(E|G)p(G) 0.01p(I) + (1 — p(I))
since p(E|G) = 1 and p(G) = 1 — p(I). So we cannot determine p(I|E) without knowing the prior
probability p(I). So p(E|I) = p(I|E) only if p(G) = p(I) = 0.5, which is hardly a presumption of
innocence.
To understand this more intuitively, consider the following isomorphic problem (from
http://en.wikipedia. org/wiki/Prosecutor’s_fallacy):
A big bowl is filled with a large but unknown number of balls. Some of the balls are made of wood,
and some of them are made of plastic. Of the wooden balls, 100 are white; out of the plastic balls, 99
are red and only 1 are white. A ball is pulled out at random, and observed to be white.
Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it is that
the ball is wooden. If the number of plastic balls is far larger than the number of wooden balls, for instance,
then a white ball pulled from the bowl at random is far more likely to be a white plastic ball than a white
wooden ball — even though white plastic balls are a minority of the whole set of plastic balls.
3
, 2. The defender is quoting p(G|E) while ignoring p(G). The prior odds are
p(G) 1
= (2.6)
p(I) 799, 999
The posterior odds are
p(G|E) 1 (2.7)
=
p(I|E) 7999
So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although perhaps still not
enough to find the suspect guilty.
on the subject area, these exams can vary significantly in format. Common types of business exams and Writing Skills: Business exams may require students to present ideas clearly
2.1.3 Variance of a sum
We have
2 2
var [X + Y ] = E[(X + Y ) ] — (E[X] + E[Y ]) (2.8)
2 2 2 2
= E[X + Y + 2XY ] — (E[X] + E[Y ] + 2E[X]E[Y ]) (2.9)
2 2 2 2
= E[X ] — E[X] + E[Y ] — E[Y ] + 2E[XY ] — 2E[X]E[Y ] (2.10)
= var [X] + var [Y ] + 2cov [X, Y ] (2.11)
If X and Y are independent, then cov [X, Y ] = 0, so var [X + Y ] = var [X] + var [Y ].
2.1.4 Bayes rule for medical diagnosis
Let T = 1 represent a positive test outcome, T = 0 represent a negative test outcome, D = 1 mean you have the disease, and
D = 0 mean you don’t have the disease. We are told
P (T = 1|D = 1) = 0.99 (2.12)
P (T = 0|D = 0) = 0.99 (2.13)
P (D = 1) = 0.0001 (2.14)
We are asked to compute P (D = 1|T = 1), which we can do using Bayes’ rule:
P (T = 1|D = 1)P (D = 1)
P (D = 1 T = 1) = (2.15)
P (T = 1|D = 1)P (D = 1) + P (T = 1|D = 0)P (D = 0)
0.99 × 0.0001
= (2.16)
0.99 × 0.0001 + 0.01 × 0.9999
= 0.009804 (2.17)
So although you are much more likely to have the disease (given that you have tested positive) than a random member of the
population, you are still unlikely to have it.
4