1. Sequential Learning & Multi-Armed Bandit
Problem
Sequential Decision-making Problem
Bandits
What action to take next based on our information → how to leverage our experience in an
optimal way.
Each time the learner pulls an arm it chooses a distribution to get a reward from.
▶ Which one is better?
● Booking: choosing between two layouts.
● Medical: choosing between two medical treatments.
▶ Assumption: all the customers are the same, the only difference is in the arms.
Regret
Measure bandit performance through regret.
*
𝑅𝑛(π) = 𝑛µ − 𝐸[𝑆𝑛]
π: policy/interaction.
*
µ = max µ𝑎: the best arm/interaction.
𝑎∈α
𝑛
𝑆𝑛 = ∑ 𝑋𝑡: total reward.
𝑡=1
The faster the line converges to the max the better our arm/action.
1
, ▶ Regret is the relative performance to the crack → some benchmark considering the
reward.
▶ Assumptions:
● Regret is non-negative → impossible to outperform the best solution.
● Impossible to get regret 0 (in real life).
● Cannot use supervised learning → the outcome if another action was chosen is
unobserved.
▶ Properties:
1. Non-negative: 𝑅 (π) ≥ 0 for all policies π.
𝑛
2. Best-policy (sufficient): the policy choosing 𝐴 ∈ arg max 𝑥 for all rounds 𝑡
𝑡 𝑎∈𝐴 𝑡𝑎
satisfies 𝑅 (π) = 0.
𝑛
3. Best-policy (necessary): if 𝑅 (π) = 0 for some policy π, then
𝑛
ℙ(𝐴𝑡 = arg max 𝑥𝑡𝑎) = 1 for all rounds 𝑡.
𝑎∈𝐴
How Do We Know Regret
1. Analysis: sometimes possible to compute, analytically, the asymptotic regret of a
policy.
○ Reveals the true performance of the policy.
○ Only possible for fairly restricted environments.
○ Often only possible to bound the asymptotic regret.
2. Simulation: create a program that simulates the environment and runs the policy
against the environment (repeatedly).
○ Comparatively easy to carry out.
○ Still, many restrictions on the environment.
○ Simulation != proof, can get an incorrect reward.
'
3. Offline Evaluation: collect data from the environment using some logging policy π →
evaluate policy π using data collected.
○ Relatively easy to carry out if data is available.
○ Provides evaluation for the actual environment (in theory).
○ Necessity to understand to logging policy.
○ Collecting 𝐷 might be expensive.
○ Effective sample size is often huge.
4. Online Evaluation: evaluate the policy in a real-life environment.
○ Deploy policy π in the wild.
○ Often challenging engineering task.
○ Expensive → all errors affect actual business.
○ If done well, it allows for future offline analysis.
Explore-First (Explore-Then-Commit) (Non-Adaptive)
1. Explore: play each arm 𝑚 rounds.
^
2. Find the arm with the highest average reward µ.
3. Exploit: play arm 𝑎 in all remaining rounds.
2
Problem
Sequential Decision-making Problem
Bandits
What action to take next based on our information → how to leverage our experience in an
optimal way.
Each time the learner pulls an arm it chooses a distribution to get a reward from.
▶ Which one is better?
● Booking: choosing between two layouts.
● Medical: choosing between two medical treatments.
▶ Assumption: all the customers are the same, the only difference is in the arms.
Regret
Measure bandit performance through regret.
*
𝑅𝑛(π) = 𝑛µ − 𝐸[𝑆𝑛]
π: policy/interaction.
*
µ = max µ𝑎: the best arm/interaction.
𝑎∈α
𝑛
𝑆𝑛 = ∑ 𝑋𝑡: total reward.
𝑡=1
The faster the line converges to the max the better our arm/action.
1
, ▶ Regret is the relative performance to the crack → some benchmark considering the
reward.
▶ Assumptions:
● Regret is non-negative → impossible to outperform the best solution.
● Impossible to get regret 0 (in real life).
● Cannot use supervised learning → the outcome if another action was chosen is
unobserved.
▶ Properties:
1. Non-negative: 𝑅 (π) ≥ 0 for all policies π.
𝑛
2. Best-policy (sufficient): the policy choosing 𝐴 ∈ arg max 𝑥 for all rounds 𝑡
𝑡 𝑎∈𝐴 𝑡𝑎
satisfies 𝑅 (π) = 0.
𝑛
3. Best-policy (necessary): if 𝑅 (π) = 0 for some policy π, then
𝑛
ℙ(𝐴𝑡 = arg max 𝑥𝑡𝑎) = 1 for all rounds 𝑡.
𝑎∈𝐴
How Do We Know Regret
1. Analysis: sometimes possible to compute, analytically, the asymptotic regret of a
policy.
○ Reveals the true performance of the policy.
○ Only possible for fairly restricted environments.
○ Often only possible to bound the asymptotic regret.
2. Simulation: create a program that simulates the environment and runs the policy
against the environment (repeatedly).
○ Comparatively easy to carry out.
○ Still, many restrictions on the environment.
○ Simulation != proof, can get an incorrect reward.
'
3. Offline Evaluation: collect data from the environment using some logging policy π →
evaluate policy π using data collected.
○ Relatively easy to carry out if data is available.
○ Provides evaluation for the actual environment (in theory).
○ Necessity to understand to logging policy.
○ Collecting 𝐷 might be expensive.
○ Effective sample size is often huge.
4. Online Evaluation: evaluate the policy in a real-life environment.
○ Deploy policy π in the wild.
○ Often challenging engineering task.
○ Expensive → all errors affect actual business.
○ If done well, it allows for future offline analysis.
Explore-First (Explore-Then-Commit) (Non-Adaptive)
1. Explore: play each arm 𝑚 rounds.
^
2. Find the arm with the highest average reward µ.
3. Exploit: play arm 𝑎 in all remaining rounds.
2