Conceptual Actual Frequently Tested Exam
Questions With Reviewed 100% Correct
Detailed Answers
Guaranteed Pass!!Current Update!!
Q1. What is reinforcement learning (RL)?
A. Learning from labeled datasets to minimize error.
B. Sequential decision-making with evaluative feedback in an environment.
C. Learning embeddings for words and graphs.
D. Clustering unlabeled data points.
Answer: B
Q2. In RL, what does the agent do?
A. Learns embeddings from data.
B. Learns a policy to map states to actions to maximize long-term rewards.
C. Provides supervision to the environment.
D. Directly controls the reward function.
Answer: B
Q3. In RL, what distinguishes evaluative feedback from supervised learning
feedback?
A. Supervised learning provides rewards; RL provides labels.
B. RL provides correct labels for each action.
,C. In RL, the agent only receives a reward but not the correct action.
D. Both provide direct error signals for optimal actions.
Answer: C
Q4. Why is RL considered sequential decision-making?
A. Each decision is independent of prior states.
B. Actions have no long-term consequences.
C. The agent must plan actions over sequences of states, sometimes with delayed
rewards.
D. It only applies to static datasets.
Answer: C
Q5. Which of the following is not a core challenge in RL?
A. Evaluative feedback (trial-and-error learning).
B. Delayed feedback (rewards not immediate).
C. Non-stationarity (policy changes environment distribution).
D. Full supervision (true labels provided at each step).
Answer: D
Q6. What does non-stationarity mean in RL?
A. Rewards are fixed regardless of state.
B. The distribution of visited states changes as the policy evolves.
C. The environment resets after each action.
D. The transition probabilities remain constant.
Answer: B
Q7. What is the Markov property in RL?
A. The next state depends on the entire history of states and actions.
B. The current state fully characterizes the environment.
, C. Rewards are always immediate and fixed.
D. Actions are chosen independently of states.
Answer: B
Q8. Which components define an MDP?
A. States, Actions, Rewards, Transition probabilities, Discount factor.
B. Loss function, Optimizer, Training data, Validation set.
C. Embeddings, Hidden states, Outputs, Weights.
D. Layers, Activations, Gradients, Loss.
Answer: A
Q9. The Bellman optimality equation for the state-value function V∗(s)V^*(s)V∗(s)
is:
A. V∗(s)=maxa∑s′p(s′∣s,a)[r(s,a)+γV∗(s′)]V^*(s) = \max_a \sum_{s'} p(s'|s, a) [r(s,
a) + \gamma V^*(s')]V∗(s)=maxa∑s′p(s′∣s,a)[r(s,a)+γV∗(s′)]
B. V∗(s)=∑s′p(s′∣s)r(s)V^*(s) = \sum_{s'} p(s'|s) r(s)V∗(s)=∑s′p(s′∣s)r(s)
C. V∗(s)=minaQ(s,a)V^*(s) = \min_a Q(s, a)V∗(s)=minaQ(s,a)
D. V∗(s)=r(s)+V∗(s)V^*(s) = r(s) + V^*(s)V∗(s)=r(s)+V∗(s)
Answer: A
Q10. The Bellman optimality equation for the action-value function
Q∗(s,a)Q^*(s,a)Q∗(s,a) is:
A. Q∗(s,a)=∑s′p(s′∣s,a)[r(s,a)]Q^*(s,a) = \sum_{s'} p(s'|s, a) [r(s,a)]Q∗(s,a)=∑s′
p(s′∣s,a)[r(s,a)]
B. Q∗(s,a)=∑s′p(s′∣s,a)[r(s,a)+γmaxa′Q∗(s′,a′)]Q^*(s,a) = \sum_{s'} p(s'|s, a) [r(s,a)
+ \gamma \max_{a'} Q^*(s', a')]Q∗(s,a)=∑s′p(s′∣s,a)[r(s,a)+γmaxa′Q∗(s′,a′)]
C. Q∗(s,a)=V∗(s)+r(s)Q^*(s,a) = V^*(s) + r(s)Q∗(s,a)=V∗(s)+r(s)
D. Q∗(s,a)=maxsV(s)Q^*(s,a) = \max_s V(s)Q∗(s,a)=maxsV(s)
Answer: B