Question 1: In a deep neural network, which of the following best describes the primary cause of the
vanishing gradient problem?
A) The use of ReLU activation functions causing dead neurons B) Gradients becoming exponentially small
as they propagate backward through many layers with sigmoid/tanh activations C) The learning rate
being set too high, causing oscillations around the optimum D) Overfitting due to excessive model
capacity relative to training data
Correct Answer: B
Explanation:
B is correct because: The vanishing gradient problem occurs primarily when using activation functions
like sigmoid or tanh, whose derivatives are bounded between 0 and 0.25 (sigmoid) or -1 and 1 (tanh).
During backpropagation, these small derivatives are multiplied together across many layers, causing
gradients to shrink exponentially. For a network with n layers, gradients can diminish by a factor of
approximately (0.25)^n, making early layers learn extremely slowly or not at all.
A is incorrect because: ReLU activation functions actually help mitigate vanishing gradients, not cause
them. The "dead neuron" problem with ReLU is a separate issue where neurons can become
permanently inactive if they consistently receive negative inputs, but this is distinct from vanishing
gradients.
C is incorrect because: High learning rates cause divergence or oscillation during optimization, but this is
unrelated to the mathematical mechanism of vanishing gradients, which concerns the magnitude of
computed gradients, not how they're applied during parameter updates.
D is incorrect because: Overfitting relates to generalization performance on unseen data, not to the
propagation of gradients during training. A model can overfit while still having healthy gradient flow, or
suffer from vanishing gradients while underfitting.
Question 2: A convolutional neural network uses 64 filters of size 3×3×3 (where the last dimension
represents input channels) applied to an input feature map of dimensions 32×32×3 with stride 1 and
padding 'same'. What is the output volume dimension?
A) 30×30×64 B) 32×32×64 C) 32×32×3 D) 30×30×3
Correct Answer: B
Explanation:
B is correct because: With "same" padding, the spatial dimensions are preserved. The formula for
output spatial dimension with stride s=1, padding p calculated to maintain size, and kernel size k=3 is:
output = (input - k + 2p)/s + 1. For 32×32 input with 3×3 kernel and stride 1, padding of 1 pixel on each
, side gives (32 - 3 + 2)/1 + 1 = 32. The depth equals the number of filters (64), not the input channels.
Thus: 32×32×64.
A is incorrect because: 30×30 would be the result of "valid" padding (no padding), calculated as (32 -
3)/1 + 1 = 30. However, the question specifies "same" padding, which preserves dimensions.
C is incorrect because: This maintains the spatial dimensions correctly but incorrectly preserves the
input depth (3 channels) rather than using the number of filters (64) as the output depth. Each filter
produces one output channel.
D is incorrect because: This combines both errors—using "valid" padding spatial dimensions (30×30)
while also incorrectly maintaining input channel depth (3) instead of filter count (64).
Question 3: In the Transformer architecture, what is the primary mathematical purpose of the scaling
factor dk in the scaled dot-product attention mechanism Attention(Q,K,V)=softmax(dkQKT)V ?
A) To normalize the attention weights so they sum to 1 B) To prevent the dot products from growing too
large in magnitude, which would push the softmax function into regions with extremely small gradients
C) To ensure that the query and key matrices are orthogonal D) To convert the attention scores into
probability distributions
Correct Answer: B
Explanation:
B is correct because: When dk (dimension of keys/queries) is large, the dot products QKT grow in
magnitude because the sum involves more terms. For random vectors with mean 0 and variance 1, the
dot product variance is dk . Large dot product values push the softmax function into regions where it
saturates (near 0 or 1), producing extremely small gradients that hinder learning. Dividing by dk
normalizes the variance to approximately 1, maintaining stable gradients.
A is incorrect because: The softmax function itself ensures outputs sum to 1 through its normalization
(dividing by the sum of exponentials). The scaling factor is applied before the softmax, so it doesn't
serve this normalization purpose.
C is incorrect because: The scaling factor doesn't enforce or encourage orthogonality between Q and K
matrices. Orthogonality would require specific constraints on the weight matrices during training, not a
simple scaling of dot products.
D is incorrect because: The conversion to probability distributions is accomplished by the softmax
function's exponential and normalization operations, not by the scaling factor. The scaling occurs before
this conversion and serves a different purpose.
Question 4: Which regularization technique explicitly constrains the L2 norm of the incoming weight
vector for each neuron to be exactly equal to a fixed constant (typically 1)?
A) L2 regularization (weight decay) B) Dropout C) Batch Normalization D) Weight Normalization