100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

EEE 448/548 - Reinforcement Learning & Dynamic Programming Solutions to Final Exam

Rating
-
Sold
-
Pages
4
Grade
A+
Uploaded on
25-09-2025
Written in
2025/2026

EEE 448/548 - Reinforcement Learning & Dynamic Programming Solutions to Final Exam Problem 1. (30pt) Consider the following infinite-horizon Markov decision process with the discount factor γ = 1 and initialized at state s1: At each step, the agent stays in state s1 and receives reward 1 if he/she takes action a1, and receives reward 0 and terminates the process otherwise. We focus on (Markov) stationary policy parametrized by a single parameter θ as follows πθ(a1 | s1) = θ and πθ(a2 | s1) = 1 − θ. Note that there is no action in state sF as the process ended. Compute the policy gradient of the expected return J(θ) = E[R(τ )] with respect to the parameter θ, i.e., dJ(θ) dθ , where R(τ ) = P h rh is the total reward of the trajectory τ and the expectation is taken with respect to the randomness induced by the policy πθ. Hint: P∞ k=1 kαk−1 = P∞ k=1 d dαα k = d dα P∞ k=1 α k . Solution: Feasible n-length trajectories are τ = {s1a1, 1, . . . , s1, a1, 1, s1, a2, 0} with probability θ n−1 (1 − θ) and reward n − 1 (10pt). Therefore, we have (10pt) E[R(τ )] = X∞ n=1 (n − 1)θ n−1 (1 − θ) = X∞ n=1 nθn (1 − θ). Then, the gradient is given by (10pt)

Show more Read less
Institution
Revision
Course
Revision








Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Revision
Course
Revision

Document information

Uploaded on
September 25, 2025
Number of pages
4
Written in
2025/2026
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

Bilkent University Fall 2023

EEE 448/548 - Reinforcement Learning & Dynamic Programming

Solutions to Final Exam
Problem 1. (30pt) Consider the following infinite-horizon Markov decision process with the
discount factor γ = 1 and initialized at state s1 : At each step, the agent stays in state s1 and




receives reward 1 if he/she takes action a1 , and receives reward 0 and terminates the process
otherwise. We focus on (Markov) stationary policy parametrized by a single parameter θ as
follows
πθ (a1 | s1 ) = θ and πθ (a2 | s1 ) = 1 − θ.
Note that there is no action in state sF as the process ended.
Compute the policy gradient of the expected return J(θ) = E[R(τ )] with respect to the
parameter θ, i.e., dJ(θ)
P
dθ , where R(τ ) = h rh is the total reward of the trajectory τ and the
expectation is taken with respect to the randomness induced by the policy πθ .
Hint: ∞
P k−1 =
P∞ d k d P∞ k
k=1 kα k=1 dα α = dα k=1 α .



Solution: Feasible n-length trajectories are τ = {s1 a1 , 1, . . . , s1 , a1 , 1, s1 , a2 , 0} with probability
θn−1 (1 − θ) and reward n − 1 (10pt). Therefore, we have (10pt)

X ∞
X
E[R(τ )] = (n − 1)θn−1 (1 − θ) = nθn (1 − θ).
n=1 n=1

Then, the gradient is given by (10pt)

d d X n
E[R(τ )] = nθ (1 − θ)
dθ dθ
n=1

!
d X
= θ(1 − θ) nθn−1

n=1

!
d d X n
= θ(1 − θ) θ
dθ dθ
n=1
 
d d θ
= θ(1 − θ)
dθ dθ 1 − θ
 
d (1 − θ) + θ
= θ(1 − θ)
dθ (1 − θ)2
d θ
=
dθ 1 − θ
1
= .
(1 − θ)2


1 09-25-2025 13:17:22 GMT -05:00
This study source was downloaded by 100000899606396 from CourseHero.com on


https://www.coursehero.com/file/249728220/Fall-2023-5pdf/

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
Abbyy01 Exam Questions
Follow You need to be logged in order to follow users or courses
Sold
91
Member since
3 year
Number of followers
33
Documents
1121
Last sold
4 weeks ago

3,5

13 reviews

5
5
4
2
3
3
2
1
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions