answers Newest RATED A+ 2025-2026
Embedding – >>> CORRECT ANSWER
A learned map from entities to vectors that encodes similarity
Graph Embedding – >>> CORRECT ANSWER
Optimize the objective that connected nodes have more similar
embeddings than unconnected nodes.
Task: convert nodes to vectors
effectively unsupervised learning where nearest neighbors are similar
these learned vectors are useful for downstream tasks
Multi-layer Perceptron (MLP) pain points for NLP – >>> CORRECT ANSWER
Cannot easily support variable-sized sequences as inputs or outputs
No inherent temporal structure
No practical way of holding state
The size of the network grows with the maximum allowed size of the
input or output sequences
Truncated Backpropagation through time – >>> CORRECT ANSWER
Only back propagate an RNN through T time steps
Recurrent Neural Networks (RNN) – >>> CORRECT ANSWER
h(t) = activation(U × input + V × h(t-1) + bias)
y(t) = activation(W × h(t) + bias)
activation is typically the logistic function or tanh
outputs can also simply be h(t)
family of NN architectures for modeling sequences
Training Vanilla RNN's difficulties – >>> CORRECT ANSWER
, Vanishing gradients
Since dx(t)/dx(t-1) = w^t
if w > 1 → exploding gradients
if w < 1 → vanishing gradients
Long Short-Term Memory Network Gates and States – >>> CORRECT
ANSWER
f(t) = forget gate
i(t) = input gate
u(t) = candidate update gate
o(t) = output gate
c(t) = cell state c(t) = f(t) × c(t – 1) + i(t) × u(t)
h(t) = hidden state h(t) = o(t) × tanh(c(t))
Perplexity(s) – >>> CORRECT ANSWER
= product ( 1 / P(w(i) | w(i-1), ...) )^ (1 / N)
= b^(–1/N Σ log_b (P(w(i) | w(i-1), ...)))
note exponent of b is per-word CE loss
perplexity of a discrete uniform distribution over k events = k
Language Model Goal – >>> CORRECT ANSWER
estimate the probability of sequences of words
p(s) = p(w₁, w₂, …, wₙ)
Masked Language Modeling – >>> CORRECT ANSWER
pre-training task – an auxiliary task different from the final task we're
really interested in, but which can help us achieve better performance
by finding good initial parameters for the model
By pre-training on masked language modeling before training on our
final task, it is usually possible to obtain higher performance than by
simply training on the final task
Knowledge Distillation to Reduce Model Sizes – >>> CORRECT ANSWER