i,- i,- i,- i,- i,- i,- i,- i,-
with Verified Answers | 100% Correct| Latest
i,- i,- i,- i,- i,- i,- i,-
2025/2026 Update - Georgia Institute of i,- i,- i,- i,- i,- i,-
Technology.
Policy Iteration i,- i,-i,- i,- Policy Evaluation: Compute V(pi) i,- i,- i,-
Policy Refinement: Greedily change action as per V(Pi) at next
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
states
Why do Policy Iteration: PI_i often converges to PI* sooner than
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
V_PI to V_PI*
i,- i,-
- thus requires few iterations
i,- i,- i,- i,-
Deep Q-Learning i,- i,-i,- i,- - Q(s, a; w, b) = w_a^t * s + b_a
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
MSE Loss := (Q_new(s, a) - (r + y*max_a(Q_old(s', a)))^2
i,- i,- i,- i,- i,- i,- i,- i,- i,-
- using a single Q function makes loss function unstable
i,- i,- i,- i,- i,- i,- i,- i,- i,-
--> use two Q-tables (NNs)
i,- i,- i,- i,-
- Freeze Q_old and update Q_new
i,- i,- i,- i,- i,-
- Set Q_old = Q_new at regular intervals
i,- i,- i,- i,- i,- i,- i,-
,Reinforcement learning Sequential decision making in an i,- i,-i,- i,- i,- i,- i,- i,- i,-
environment with evaluative feedback i,- i,- i,-
Environment: may be unknown, non-linear, stochastic and i,- i,- i,- i,- i,- i,- i,-
complex
Agent: learns a policy to map states of the environments to
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
actions
- seeks to maximize long-term reward
i,- i,- i,- i,- i,-
RL: Evaluative Feedback
i,- i,- i,-i,- i,- - Pick an action, receive a reward
i,- i,- i,- i,- i,- i,-
- No supervision for what the correct action is or would have
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
been (unlike supervised learning)
i,- i,- i,-
RL: Sequential Decisions
i,- i,- i,-i,- i,- - Plan and execution actions over a
i,- i,- i,- i,- i,- i,- i,-
sequence of states i,- i,-
- Reward may be delayed, requiring optimization of future
i,- i,- i,- i,- i,- i,- i,- i,- i,-
rewards (long-term planning) i,- i,-
Signature Challenges in RL Evaluative Feedback: Need trial
i,- i,- i,- i,-i,- i,- i,- i,- i,- i,-
and error to find the right action
i,- i,- i,- i,- i,- i,-
Delayed Feedback: Actions may not lead to immediate reward
i,- i,- i,- i,- i,- i,- i,- i,-
, Non-stationarity: Data distribution of visited states changes when i,- i,- i,- i,- i,- i,- i,- i,-
the policy changes
i,- i,-
Fleeting Nature: of online data (may only see data once)
i,- i,- i,- i,- i,- i,- i,- i,- i,-
MDP i,-i,- i,- Framework underlying RL i,- i,-
S: Set of states
i,- i,- i,-
A: Set of actions
i,- i,- i,-
R: Distribution of Rewards
i,- i,- i,-
T: Transition probabiliity
i,- i,-
y: Discount property
i,- i,-
Markov Property: Current state completely characterizes state of
i,- i,- i,- i,- i,- i,- i,- i,-
the environment
i,-
RL: Equations relating optimal quantities
i,- i,- i,- i,- i,-i,- i,- 1. V*(S) =
i,- i,- i,-
max_a(Q*(s, a) i,-
2. PI*(s) = argmax_a(Q*(s, a)
i,- i,- i,- i,-
V*(S) i,-i,- i,- max_a (sum_(s') { p(s'|s, a) [r(s, a) + yV*(s')] } )
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-