EEE 448/548 - Reinforcement Learning & Dynamic Programming
Solutions to Final Exam
Problem 1. (30pt) Consider the following infinite-horizon Markov decision process with the
discount factor γ = 1 and initialized at state s1 : At each step, the agent stays in state s1 and
receives reward 1 if he/she takes action a1 , and receives reward 0 and terminates the process
otherwise. We focus on (Markov) stationary policy parametrized by a single parameter θ as
follows
πθ (a1 | s1 ) = θ and πθ (a2 | s1 ) = 1 − θ.
Note that there is no action in state sF as the process ended.
Compute the policy gradient of the expected return J(θ) = E[R(τ )] with respect to the
parameter θ, i.e., dJ(θ)
P
dθ , where R(τ ) = h rh is the total reward of the trajectory τ and the
expectation is taken with respect to the randomness induced by the policy πθ .
Hint: ∞
P k−1 =
P∞ d k d P∞ k
k=1 kα k=1 dα α = dα k=1 α .
Solution: Feasible n-length trajectories are τ = {s1 a1 , 1, . . . , s1 , a1 , 1, s1 , a2 , 0} with probability
θn−1 (1 − θ) and reward n − 1 (10pt). Therefore, we have (10pt)
∞
X ∞
X
E[R(τ )] = (n − 1)θn−1 (1 − θ) = nθn (1 − θ).
n=1 n=1
Then, the gradient is given by (10pt)
∞
d d X n
E[R(τ )] = nθ (1 − θ)
dθ dθ
n=1
∞
!
d X
= θ(1 − θ) nθn−1
dθ
n=1
∞
!
d d X n
= θ(1 − θ) θ
dθ dθ
n=1
d d θ
= θ(1 − θ)
dθ dθ 1 − θ
d (1 − θ) + θ
= θ(1 − θ)
dθ (1 − θ)2
d θ
=
dθ 1 − θ
1
= .
(1 − θ)2
1 09-25-2025 13:17:22 GMT -05:00
This study source was downloaded by 100000899606396 from CourseHero.com on
https://www.coursehero.com/file/249728220/Fall-2023-5pdf/