CS 234 ASSIGNMENT 2 2021/2022 – Stanford University
CS 234 ASSIGNMENT 2 2021/2022 – Stanford University. Distributions induced by a policy (13 pts) In this problem, we’ll work with an infinite-horizon MDP M = hS, A, R, T , γi and consider stochastic policies of the form π : S → ∆(A) 1 . Additionally, we’ll assume that M has a single, fixed starting state s 0 ∈ S for simplicity. (a) (written, 3 pts) Consider a fixed stochastic policy and imagine running several rollouts of this policy within the environment. Naturally, depending on the stochasticity of the MDP M and the policy itself, some trajectories are more likely than others. Write down an expression for ρ π (τ ), the likelihood of sampling a trajectory τ = (s 0 , a0 , s1 , a1, . . .) by running π in M. To put this distribution in context, recall that V π (s0) = E τ ρ ∼ π P∞ t=0 γ t R(s t , at) | s0 . Solution: ρ π (τ ) = ∞Y t=0 π(at |st)T (st+1 |st , at) (b) (written, 5 pts) Just as ρ π captures the distribution over trajectories induced by π, we can also examine the distribution over states induced by π. In particular, define the discounted, stationary state distribution of a policy π as d π (s) = (1 − γ) ∞X t=0 γ t p(st = s), where p(st = s) denotes the probability of being in state s at timestep t while following policy π; your answer to the previous part should help you reason about how you might compute this value. Consider an arbitrary function f : S × A → R. Prove the following identity: Eτ ρ ∼ π " ∞X t=0 γ t f (st , at) # = 1 (1 − γ) Es d ∼ π Ea π∼ (s) [f (s, a)] . Hint: You may find it helpful to first consider how things work out for f (s, a) = 1, ∀(s, a) S × A. ∈ Hint: What is p(s t = s)? Solution: Eτ ρ ∼ π " ∞X t=0 γ t f (st , at ) # = ∞X t=0 γ t Eτ ρ ∼ π [f (st , at)] = E τ ρ ∼ π [f (s0 , a0)] + γE τ ρ ∼ π [f (s1 , a1)] + γ 2Eτ ρ ∼ π [f (s2 , a2)] + ... = X a0 π(a0 |s0)f (s 0 , a0) + γ X a0 π(a0 |s0) X s1 T (s1 |s0 , a0) X a1 π(a1 |s1)f (s 1 , a1) + ... = X s p(s0 = s)E a π∼ (s) [f (s, a)] + γ X s p(s1 = s)E a π∼ (s) [f (s, a)] + ... = X s ∞X t=0 γ t p(st = s)E a π∼ (s) [f (s, a)] = 1 (1 − γ) X s d π (s)Ea π∼ (s) [f (s, a)] = 1 (1 − γ) Es d ∼ π Ea π∼ (s) [f (s, a)] 1For a finite set X , ∆(X ) refers to the set of categorical distributions with support on X or, equivalently, the ∆ |X |−1 probability simplex. Page 2 of 12 CS 234 Winter 2021: Assignment #2 (c) (written, 5 pts) For any policy π, we define the following function A π (s, a) = Q π (s, a) − V π (s). Prove the following statement holds for all policies π, π0 : V π (s0) − V π 0 (s0) = 1 (1 − γ) Es d ∼ π h Ea π∼ (s) h A π 0 (s, a) ii . Solution: V π (s0) − V π 0 (s0) = E τ ρ ∼ π " ∞X t=0 γ t R(s t , at) # − V π 0 (s0) = E τ ρ ∼ π " ∞X t=0 γ t R(s t , at) + V π 0 (st) − V π 0 (st) # − V π 0 (s0) = E τ ρ ∼ π " ∞X t=0 γ t R(s t , at) + γV π 0 (st+1 ) − V π 0 (st) # = E τ ρ ∼ π " E " ∞X t=0 γ t R(s t , at) + γV π 0 (st+1 ) − V π 0 (st ) st , at ## = E τ ρ ∼ π " ∞X t=0 γ t R(s t , at) + γE h V π 0 (st+1 ) st , at i − V π 0 (st) # = E τ ρ ∼ π " ∞X t=0 γ t Qπ 0 (st , at) − V π 0 (st) # = E τ ρ ∼ π " ∞X t=0 γ tA π 0 (st , at) # = 1 (1 − γ) Es d ∼ π h Ea π∼ (s) h A π 0 (s, a) ii . The function A π (s, a) is known as the advantage function which quantifies how much more advantageous it may (or may not) be to take action a in state s and follow policy π thereafter, rather than following policy π in state s.
Escuela, estudio y materia
- Institución
- Stanford University
- Grado
- CS234 (CS234)
Información del documento
- Subido en
- 11 de marzo de 2022
- Número de páginas
- 13
- Escrito en
- 2021/2022
- Tipo
- Examen
- Contiene
- Preguntas y respuestas
Temas
-
cs 234
-
cs 234 assignment 2 20212022 – stanford university