%\%\\ Reinforcement Learning and Dynamic Programming
Final Exam - Summer 2022
Duration: 150 minutes
Name Surname: Bilkent ID: Signature:
Q1: Pacman Bonus Level!
o o o (€]
1 2 3 4 o)
Pacman is in a bonus level! With no ghosts around, he can eat as many dots as he wants. He is in
the 5 x 1 grid shown above, where the cells are numbered from left to right, that is, s € {1,...,5}.
In cells 1 through 4, the actions available are to move Right (R) or to Fly (F') out of the bonus
level. The action Right deterministically lands Pacman in the cell to the right (and he eats the
dot there), while the Fly action deterministically lands him in a terminal state and ends the game.
From cell 5, Fly is the only action. Eating a dot gives a reward of +10, while flying out gives a
reward of +20.
(a) (4 pts) How many deterministic policies are there in the above MDP?
4
2 =\§
Consider the following policies for 0 < i < 4: 7;(s) = R if s <4, F' otherwise.
(b) ($2pts) Find the value functions of vx, (1), vr,(1), and v,(1) for the discount of v = 1, and
fill out the table. Show your work.
Uny (1) 20
Uy (1) 4’6
(D | 40
Vg (1) =20+ 8(0) =20
Do,(1) = 10+800)+ ¥7(29) =40 4
og () =16+ % 10) 4 50 +T(0)T & (20) = 60
, (@) (10 pts) For what ranges of v, m4 is the optimal policy (that is, my is strictly better than
o, M1, T, and 7m3)?
4
Ty
G
t F
0410 +105+I0+26%
P 2
Ty -5 |04 |0¥F|6T 420
[/ SY— \0+\025+z@\)}
[ — 10+ 208
x, —> 20
v, W0 7 Vg () we need o have
For
\ e x m x Ve \0 +\ 02 $H 07f+zoz’ =5
\0 +\ 0F +\ 0 8 +
A
For Ve ()7 VL) uk reed b‘/z
For \/72(\) Y \/?L\) we need 87V
For VL) 7Vp ) e need 87Y%
So, for Y7, we have 07*(\)>u130)>\/£l)>\/1\0)>\£§\)
o K25 g ’f‘K‘Z Ofivfi"v‘vfl/{ &;0/,‘%.
B')""J"