with Verified Answers | 100% Correct| Latest
i,- i,- i,- i,- i,- i,- i,-
2025/2026 Update - Georgia Institute of i,- i,- i,- i,- i,- i,-
Technology.
Why do Policy Iteration: PI_i often converges to PI* sooner than
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
V_PI to V_PI*
i,- i,-
- thus requires few iterations
i,- i,- i,- i,-
Deep Q-Learning i,- i,-i,- i,- - Q(s, a; w, b) = w_a^t * s + b_a
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
MSE Loss := (Q_new(s, a) - (r + y*max_a(Q_old(s', a)))^2
i,- i,- i,- i,- i,- i,- i,- i,- i,-
- using a single Q function makes loss function unstable
i,- i,- i,- i,- i,- i,- i,- i,- i,-
--> use two Q-tables (NNs)
i,- i,- i,- i,-
- Freeze Q_old and update Q_new
i,- i,- i,- i,- i,-
- Set Q_old = Q_new at regular intervals
i,- i,- i,- i,- i,- i,- i,-
Fitted Q-Iteration i,- i,-i,- i,- Algorithm to optimize MSE Loss on a fixed i,- i,- i,- i,- i,- i,- i,- i,-
dataset
,RL: How to Collect Data
i,- i,- i,- i,- i,-i,- i,- Challenge 1: Exploration vs i,- i,- i,- i,-
Exploitation
Challenge 2: Non iid, highly correlated data
i,- i,- i,- i,- i,- i,-
- This leads to high variance in gradients and inefficient learning
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
- Experience Replay Addresses this:
i,- i,- i,- i,-
--> store (s, a, s', r) pairs and continually update episodes (older
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
samples discarded) i,-
--> Train Q-Network on random mini batches of transitions from
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
the replay memory instead of consecutive examples
i,- i,- i,- i,- i,- i,-
--> larger the buffer, lower the correlation
i,- i,- i,- i,- i,- i,-
Experience Replay - store (s, a, s', r) pairs and continually
i,- i,-i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
update episodes (older samples discarded)
i,- i,- i,- i,-
Reinforcement learning Sequential decision making in an i,- i,-i,- i,- i,- i,- i,- i,- i,-
environment with evaluative feedback i,- i,- i,-
Environment: may be unknown, non-linear, stochastic and i,- i,- i,- i,- i,- i,- i,-
complex
Agent: learns a policy to map states of the environments to
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
actions
, - seeks to maximize long-term reward
i,- i,- i,- i,- i,-
RL: Evaluative Feedback
i,- i,- i,-i,- i,- - Pick an action, receive a reward
i,- i,- i,- i,- i,- i,-
- No supervision for what the correct action is or would have
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
been (unlike supervised learning)
i,- i,- i,-
RL: Sequential Decisions
i,- i,- i,-i,- i,- - Plan and execution actions over a
i,- i,- i,- i,- i,- i,- i,-
sequence of states i,- i,-
- Reward may be delayed, requiring optimization of future
i,- i,- i,- i,- i,- i,- i,- i,- i,-
rewards (long-term planning) i,- i,-
Signature Challenges in RL Evaluative Feedback: Need trial
i,- i,- i,- i,-i,- i,- i,- i,- i,- i,-
and error to find the right action
i,- i,- i,- i,- i,- i,-
Delayed Feedback: Actions may not lead to immediate reward
i,- i,- i,- i,- i,- i,- i,- i,-
Non-stationarity: Data distribution of visited states changes when i,- i,- i,- i,- i,- i,- i,- i,-
the policy changes
i,- i,-
Fleeting Nature: of online data (may only see data once)
i,- i,- i,- i,- i,- i,- i,- i,- i,-
MDP i,-i,- i,- Framework underlying RL i,- i,-