Notas de lectura

Reinforcement learning lecture material

Puntuación

Vendido

Páginas

Subido en

19-10-2024

Escrito en

2023/2024

Samenvatting van alle lectures. Dit is allemaal tentamenstof

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Radboud Universiteit Nijmegen (RU)
Estudio: Artificial Intelligence
Grado: Reinforcement Learning (SOWBKI258)

Todos documentos para esta materia (1)

Información del documento

Subido en: 19 de octubre de 2024
Número de páginas: 22
Escrito en: 2023/2024
Tipo: Notas de lectura
Profesor(es): Yuzhen qin
Contiene: Todas las clases

Temas

learning
agents
reinforcement learning

Vista previa del contenido

Lecture notes Reinforcement Learning
Lecture 1: Introduction
Reinforcement Learning: learning to make decisions by interaction.

RL is different from (un)supervised learning:

- Agent sequentially interacts with the environment
- Agent observes a reward and the state of the environment
- Goal-directed: maximize the cumulative reward

Learning based on the reward hypothesis: any goal objective can be formalized as the outcome of
maximizing a cumulative reward.

Science and framework of learning to make decisions from interaction:

- It requires us to think about
o Time
o Consequences of actions
o Actively gathering experience
o Predicting about future
- At each step t
o Agent takes an action at
o Receives a reward rt (observed by agent)
o Environment: state transitions to st (observed by agent)
o Goal: maximize cumulative reward

A policy defines the agent’s way of behaving: π (a∨s).

Deterministic: it states what the next state is going to be for every action.
Stochastic: multiple states can be reached by an action.

Value function: v π (s)

- Describe how good a state is
- Depend on the state
- Depend on the policy
- Expected total reward!
A state might always yield a low immediate reward, but still have a high value because it is
regularly followed by other states that yield high rewards.

A model describes/mimics the behaviour of the environment. A model can:

- assist in estimating value functions
- make plans of a course of actions
- Model-Based RL: learns/uses a model to make decisions
- Model-Free RL: learns to make decisions solely based on interactions with the environment

History of RL
- RL stems from 3 different domains: trial-and-error, optimal control, temporal difference (TD)
o Trial-and-error

,  Law of effect: if an association is followed by a “satisfying state of affairs” it
will be strengthened and if it is followed by an “annoying state of affairs” it
will be weakened
 Clear parallel with evolution by natural selection
 Selection involves no guidance or supervision: depends only on the
consequences of actions or adaptation and whether these serve the
organism’s biological needs
 Credit assignment: process of attributing or assigning credit or responsibility
to past actions for the rewards received at a later time
o Optimal Control
 Control goal: drive a system from a state to a desired state
 Objective: minimize time/energy
 Optimal control generate a sequence of control inputs to realize our goal.
 Model: st +1=f (st ,ut )
 Goal - find control policy: optimize O( st ,ut )
 Class of methods:
 Optimal return function – Bellman Equation
 Dynamical programming
 Markovian Decision Processes (MDPs)
o Policy Iteration Method (to solve MDPs)
o Temporal Difference Learning
 Bootstrapping from existing estimates of the value function
 Re-estimate the value functions little by little
 Has a neural basis in the brain
 Fundamental aspect of TD: the calculation of reward prediction errors (i.e.,
difference between expected and actually received rewards).
o In 1989 Modern RL begins

Lecture 2: Markovian Decision Processes
Markov Processes
- Formalization of sequential decision-making, where the choice of actions influence:
o immediate rewards
o subsequent situations/states
o subsequent rewards
- Markov property: P [ S t +1|S 1 , … , S t ]=P [ S t+1|St ]
- M =⟨S , P ⟩
o P is a transition matrix
o S is a state space (finite or infinite, we usually use finite)
- Markov Reward Process
o M =⟨S , P , R , γ ⟩
 R is reward function
 Gamma is the discount factor (between 0 and 1)
 Trades of later rewards to earlier ones
o Rewards only depend on the current action!
- Markov Decision Processes
o M =⟨S , A , P , R , γ ⟩

,  A is the action space
o At each step t=1,2,3 , … , T
 Agent receives some representation of the environment's state
 Based on that information, agent selects an action
 Action modifies system’s state and generates a reward
o p ( s ' , r|s , a )=P [ S t +1=s' , R t+1 =r|S t =s , A t=a ¿
 p : S × R × S × A → [0, 1]
 fully defines the dynamics of the MDP
o State-transition probabilities
 p ( s '|s , a )=P [ St =s'|S t −1 =s , A t−1=a ]= ∑ p( s ' , r∨s , a)
r∈ R
o Expected reward for state-action pairs

'
p( s , r∨s , a)
r ( s , a )=E [ Rt|S t −1 =s , A t−1=a ]= ∑ r ∑ p (s ' , r ∨s ,a)=∑ r
r∈ R s' ∈S r ∈R p (s ' ∨s , a)
Goals, Rewards & Returns
- Goal/objective
o Is formalized as a reward signal from the environment to the agent
o What the agents wants to maximize over time (not immediate reward, but
cumulative reward)
- Rewards
o Formalized the objective: what we want to achieve (not how)
o Customize rewards such that maximizing them allows the agent to achieve an
objective
- Two types of tasks
o Episodic tasks (terminating tasks)
 Agent interacts with the environment for a finite number of steps
 Agent-env interaction breaks naturally into sub-sequences
 Have a clear terminal state
 Time of termination T is usually a random variable that can vary from episode
to episode
o Continuing tasks (non-terminating tasks)
 Agent interacts with the environment for an infinite number of steps
 Agent-env interaction doesn’t break naturally into sub-sequences
 No terminal state
- Returns
o Episodic tasks: Gt =Rt +1 + Rt +2+ R t +3+ …+ RT
 T: time of termination
o Continuing tasks: Gt =Rt +1 +γ R t+2 + y 2 Rt +3 +…
 Discount factor ƴ
 0≤γ≤1
 Determines the present value of future rewards
 γ=0 – Miopic agent: maximize only immediate reward
 γ=1 – Farsighted agent: take future reward into account
o Most Markov decision processes are discounted. Why?

$6.72

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

donjaschipper

4.0

(1)

Conoce al vendedor

donjaschipper Radboud Universiteit Nijmegen

Ver perfil

Seguir

Vendido

Miembro desde

1 año

Número de seguidores

Documentos

Última venta

6 meses hace

4.0

1 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller donjaschipper. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $6.72. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 15 years now