Bayesian Learning
[Read Ch. 6]
[Suggested exercises: 6.1, 6.2, 6.6]
Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classi er
Naive Bayes learner
Example: Learning over text data
Bayesian belief networks
Expectation Maximization algorithm
125 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Two Roles for Bayesian Methods
Provides practical learning algorithms:
Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities
Provides useful conceptual framework
Provides \gold standard" for evaluating other
learning algorithms
Additional insight into Occam's razor
126 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Bayes Theorem
P (D
P (hjD) = P (D) jh )P (h )
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (hjD) = probability of h given D
P (Djh) = probability of D given h
127 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Choosing Hypotheses
P (D
P (hjD) = P (D)jh )P (h )
Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP :
hMAP = arg max
h2H
P (hjD)
= arg max P (D jh )P (h )
h2H P (D)
= arg max
h2H
P (Djh)P (h)
If assume P (hi) = P (hj ) then can further simplify,
and choose the Maximum likelihood (ML)
hypothesis
hML = arg maxhi2H
P (Djhi)
128 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
[Read Ch. 6]
[Suggested exercises: 6.1, 6.2, 6.6]
Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classi er
Naive Bayes learner
Example: Learning over text data
Bayesian belief networks
Expectation Maximization algorithm
125 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Two Roles for Bayesian Methods
Provides practical learning algorithms:
Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities
Provides useful conceptual framework
Provides \gold standard" for evaluating other
learning algorithms
Additional insight into Occam's razor
126 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Bayes Theorem
P (D
P (hjD) = P (D) jh )P (h )
P (h) = prior probability of hypothesis h
P (D) = prior probability of training data D
P (hjD) = probability of h given D
P (Djh) = probability of D given h
127 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
, Choosing Hypotheses
P (D
P (hjD) = P (D)jh )P (h )
Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP :
hMAP = arg max
h2H
P (hjD)
= arg max P (D jh )P (h )
h2H P (D)
= arg max
h2H
P (Djh)P (h)
If assume P (hi) = P (hj ) then can further simplify,
and choose the Maximum likelihood (ML)
hypothesis
hML = arg maxhi2H
P (Djhi)
128 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997