100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Tentamen (uitwerkingen)

Markov Decision Processes Finals V2

Beoordeling
-
Verkocht
-
Pagina's
14
Cijfer
A
Geüpload op
30-10-2024
Geschreven in
2024/2025

Markov Decision Processes Finals V2 A Markov Process is a process in which all states do not depend on previous actions. ️️True, Markov means that you don't have to condition on anything past the most recent state. A Markov Decision Process is a set of Markov Property Compliant states, with rewards and values. Decaying Reward encourages the agent to end the game quickly instead of running around and gathering more reward ️️True, as reward decays the total reward for the episode decreases, so the agent is encouraged to maximize total reward by ending the game quickly. R(s) and R(s,a) are equivalent. ️️True, it just happens that it's easier to think about one vs the other in certain situations. Reinforcement Learning is harder to compute than a simple MDP. ️️True, you can just use the Bellman Equations for an MDP, but Reinforcement Learning requires that you make observations and then summarize those observations as values. An optimal policy is the best possible sequence of actions for an MDP. ️️True, with a single caveat. The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence" for a policy. Temporal Difference Learning is the difference in reward you see on subsequent time steps. ️️False, Temporal Difference Learning is the difference in value estimates on subsequence time steps. RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based. ️️True, Model-Based is essentially using the Bellman Equations to solve a problem, Value-Based is Temporal Difference Learning, and Policy-Based is similar to Value-Based, but it solves in a finite amount of time with a certain amount of confidence (in Greedy it's guaranteed). TD Learning is defined by Incremental Estimates that are Outcome Based. ️️True, TD Learning thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than having a predefined model. For a learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum of the learning rate squared must be finite. ️️True, this is called a contraction mapping and it guarantees convergence. All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes. ️️True, this is why we use TD(Lambda), which has many of the benefits of TD(0) but is much more performant. Empirically, lambdas between 0.3 and 0.7 seem to perform best. To control learning, you simply have the operator choose actions in addition to learning. ️️True, states are experienced as observations during learning, so the operator can influence learning. Q-Learning converges ️️True, the Bellman Equation satisfies a Contraction Mapping where the sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q*. As long as the update operators for Q-learning or Value-iteration are non-expansions, then they will converge. ️️True, there are expansions that will converge, but only non-expansions are guaranteed to converge independent of their starting v

Meer zien Lees minder
Instelling
Ma-rk-ov Decision Processes Fin V2
Vak
Ma-rk-ov Decision Processes Fin V2









Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Ma-rk-ov Decision Processes Fin V2
Vak
Ma-rk-ov Decision Processes Fin V2

Documentinformatie

Geüpload op
30 oktober 2024
Aantal pagina's
14
Geschreven in
2024/2025
Type
Tentamen (uitwerkingen)
Bevat
Vragen en antwoorden

Onderwerpen

Voorbeeld van de inhoud

Markov Decision Processes Finals V2

A Markov Process is a process in which all states do not depend on previous actions. ✔️✔️True,
Markov means that you don't have to condition on anything past the most recent state. A Markov
Decision Process is a set of Markov Property Compliant states, with rewards and values.



Decaying Reward encourages the agent to end the game quickly instead of running around and
gathering more reward ✔️✔️True, as reward decays the total reward for the episode decreases, so
the agent is encouraged to maximize total reward by ending the game quickly.



R(s) and R(s,a) are equivalent. ✔️✔️True, it just happens that it's easier to think about one vs the
other in certain situations.



Reinforcement Learning is harder to compute than a simple MDP. ✔️✔️True, you can just use the
Bellman Equations for an MDP, but Reinforcement Learning requires that you make observations and
then summarize those observations as values.



An optimal policy is the best possible sequence of actions for an MDP. ✔️✔️True, with a single caveat.
The optimal policy is a policy that maximizes reward over an entire episode by taking the argmax of
resulting values of actions + rewards. But MDPs are memoryless, so there is no concept of "sequence"
for a policy.



Temporal Difference Learning is the difference in reward you see on subsequent time steps.
✔️✔️False, Temporal Difference Learning is the difference in value estimates on subsequence time
steps.



RL falls generally into 3 different categories: Model-Based, Value-Based, and Policy-Based. ✔️✔️True,
Model-Based is essentially using the Bellman Equations to solve a problem, Value-Based is Temporal
Difference Learning, and Policy-Based is similar to Value-Based, but it solves in a finite amount of time
with a certain amount of confidence (in Greedy it's guaranteed).

, TD Learning is defined by Incremental Estimates that are Outcome Based. ✔️✔️True, TD Learning
thinks of learning in terms of "episodes", which it uses to estimate the transition functions rather than
having a predefined model.



For a learning rate to guarantee convergence, the sum of the learning rate must be infinite, and the sum
of the learning rate squared must be finite. ✔️✔️True, this is called a contraction mapping and it
guarantees convergence.



All of the TD learning methods have set backs, TD(1) is inefficient because it requires too much data and
has high variance, TD(0) has a maximum likelihood estimate but is hard to calculate for long episodes.
✔️✔️True, this is why we use TD(Lambda), which has many of the benefits of TD(0) but is much more
performant. Empirically, lambdas between 0.3 and 0.7 seem to perform best.



To control learning, you simply have the operator choose actions in addition to learning. ✔️✔️True,
states are experienced as observations during learning, so the operator can influence learning.



Q-Learning converges ✔️✔️True, the Bellman Equation satisfies a Contraction Mapping where the
sum of all is infinite, but the sum of all squared is less than infinite. It always converges to Q*.



As long as the update operators for Q-learning or Value-iteration are non-expansions, then they will
converge. ✔️✔️True, there are expansions that will converge, but only non-expansions are
guaranteed to converge independent of their starting values.



A convex combination will converge. ✔️✔️False, it must be a fixed convex combination to converge. If
the value can change, like with the Boltzmann exploration, then it is not guaranteed to converge.



In Greedy Policies, the difference between the true value and the current value of the policy is less than
some epsilon value for exploration. ✔️✔️True



It serves as a good check for how long we run value iteration until we're pretty confident that we have
the optimal policy. ✔️✔️True



For a set of linear equations, the solution can be found in polynomial time. ✔️✔️True

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
CertifiedGrades Chamberlain College Of Nursing
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
141
Lid sinds
2 jaar
Aantal volgers
61
Documenten
8748
Laatst verkocht
4 weken geleden
High Scores

Hi there! Welcome to my online tutoring store, your ultimate destination for A+ rated educational resources! My meticulously curated collection of documents is designed to support your learning journey. Each resource has been carefully revised and verified to ensure top-notch quality, empowering you to excel academically. Feel free to reach out to consult with me on any subject matter—I'm here to help you thrive!

3.9

38 beoordelingen

5
21
4
6
3
2
2
3
1
6

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen