100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
College aantekeningen

Reinforcement learning lecture material

Beoordeling
-
Verkocht
1
Pagina's
22
Geüpload op
19-10-2024
Geschreven in
2023/2024

Samenvatting van alle lectures. Dit is allemaal tentamenstof











Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
19 oktober 2024
Aantal pagina's
22
Geschreven in
2023/2024
Type
College aantekeningen
Docent(en)
Yuzhen qin
Bevat
Alle colleges

Voorbeeld van de inhoud

Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.

RL is different from (un)supervised learning:

- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward

Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.

Science and framework of learning to make decisions from interaction:

- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward

A policy defines the agent’s way of behaving: π (a∨s).

Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.

Value function: v π (s)

- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.

A model describes/mimics the behaviour of the environment. A model can:

- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment

History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error

,  Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
 Clear parallel with evolution by natural selection
 Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
 Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
 Control goal: drive a system from a state to a desired state
 Objective: minimize time/energy
 Optimal control generate a sequence of control inputs to realize our goal.
 Model: st +1=f (st ,ut )
 Goal - find control policy: optimize O( st ,ut )
 Class of methods:
 Optimal return function – Bellman Equation
 Dynamical programming
 Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
 Bootstrapping from existing estimates of the value function
 Re-estimate the value functions little by little
 Has a neural basis in the brain
 Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins

Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
 R is reward function
 Gamma is the discount factor (between 0 and 1)
 Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩

,  A is the action space
o At each step t=1,2,3 , … , T
 Agent receives some representation of the environment's state
 Based on that information, agent selects an action
 Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
 p : S × R × S × A → [0, 1]
 fully defines the dynamics of the MDP
o State-transition probabilities
 p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs

'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
 Agent interacts with the environment for a finite number of steps
 Agent-env interaction breaks naturally into sub-sequences
 Have a clear terminal state
 Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
 Agent interacts with the environment for an infinite number of steps
 Agent-env interaction doesn’t break naturally into sub-sequences
 No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
 T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
 Discount factor ƴ
 0≤γ≤1
 Determines the present value of future rewards
 γ=0 – Miopic agent: maximize only immediate reward
 γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?
€5,56
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
donjaschipper
4,0
(1)

Maak kennis met de verkoper

Seller avatar
donjaschipper Radboud Universiteit Nijmegen
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
4
Lid sinds
1 jaar
Aantal volgers
0
Documenten
9
Laatst verkocht
6 maanden geleden

4,0

1 beoordelingen

5
0
4
1
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen