College aantekeningen

Reinforcement learning lecture material

Beoordeling

Verkocht

Pagina's

Geüpload op

19-10-2024

Geschreven in

2023/2024

Samenvatting van alle lectures. Dit is allemaal tentamenstof

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Geschreven voor

Instelling: Radboud Universiteit Nijmegen (RU)
Studie: Artificial Intelligence
Vak: Reinforcement Learning (SOWBKI258)

Alle documenten voor dit vak (1)

Documentinformatie

Geüpload op: 19 oktober 2024
Aantal pagina's: 22
Geschreven in: 2023/2024
Type: College aantekeningen
Docent(en): Yuzhen qin
Bevat: Alle colleges

Onderwerpen

Voorbeeld van de inhoud

Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.

RL is different from (un)supervised learning:

- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward

Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.

Science and framework of learning to make decisions from interaction:

- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward

A policy defines the agent’s way of behaving: π (a∨s).

Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.

Value function: v π (s)

- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.

A model describes/mimics the behaviour of the environment. A model can:

- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment

History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error

,  Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
 Clear parallel with evolution by natural selection
 Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
 Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
 Control goal: drive a system from a state to a desired state
 Objective: minimize time/energy
 Optimal control generate a sequence of control inputs to realize our goal.
 Model: st +1=f (st ,ut )
 Goal - find control policy: optimize O( st ,ut )
 Class of methods:
 Optimal return function – Bellman Equation
 Dynamical programming
 Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
 Bootstrapping from existing estimates of the value function
 Re-estimate the value functions little by little
 Has a neural basis in the brain
 Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins

Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
 R is reward function
 Gamma is the discount factor (between 0 and 1)
 Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩

,  A is the action space
o At each step t=1,2,3 , … , T
 Agent receives some representation of the environment's state
 Based on that information, agent selects an action
 Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
 p : S × R × S × A → [0, 1]
 fully defines the dynamics of the MDP
o State-transition probabilities
 p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs

'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
 Agent interacts with the environment for a finite number of steps
 Agent-env interaction breaks naturally into sub-sequences
 Have a clear terminal state
 Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
 Agent interacts with the environment for an infinite number of steps
 Agent-env interaction doesn’t break naturally into sub-sequences
 No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
 T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
 Discount factor ƴ
 0≤γ≤1
 Determines the present value of future rewards
 γ=0 – Miopic agent: maximize only immediate reward
 γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?

€5,56

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

donjaschipper

4,0

(1)

Maak kennis met de verkoper

donjaschipper Radboud Universiteit Nijmegen

Bekijk profiel

Volgen

Verkocht

Lid sinds

1 jaar

Aantal volgers

Documenten

Laatst verkocht

6 maanden geleden

4,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper donjaschipper. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €5,56. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 46231 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Reinforcement learning lecture material

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Radboud Universiteit Nijmegen (RU) > Artificial Intelligence

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?