Graph and Word Representations, RNNs, LSTMs, Skip-Gram
Word2Vec, Masked Language Modeling, Knowledge Distillation,
t-SNE, Teacher Forcing, Conditional Language Models, and
Evaluation Metrics”
Embedding
QUESTION What is an embedding?
A: A learned map from entities to vectors that encodes similarity.
Rationale: Embeddings allow similar entities to be close in vector space.
Graph Embedding
QUESTION Purpose of graph embeddings?
A: Optimize the objective that connected nodes have more similar
embeddings than unconnected nodes.
Rationale: Converts graph nodes into vectors useful for downstream tasks.
QUESTION Why useful?
A: Task-agnostic entity representations; nearest neighbors are semantically
meaningful.
Rationale: Works even with limited labeled data.
MLP Pain Points for NLP
QUESTION Why are MLPs limited for NLP?
A: Cannot handle variable-length sequences, no temporal structure, no
memory, size grows with max sequence length.
Rationale: Sequences require context and state, which MLPs lack.
Truncated Backpropagation Through Time (TBPTT)
, QUESTION What is TBPTT?
A: Only backpropagate an RNN through T time steps.
Rationale: Reduces computational cost and mitigates gradient issues.
Recurrent Neural Networks (RNN)
QUESTION RNN update equations?
A:
h(t) = activation(U x(t) + V h(t-1) + bias)
y(t) = activation(W h(t) + bias)
Rationale: Recursively updates hidden state based on input and previous
hidden state.
QUESTION Training difficulties?
A: Vanishing and exploding gradients.
Rationale: Multiplicative effects over time steps make long-term dependencies
hard.
Long Short-Term Memory (LSTM) Networks
QUESTION LSTM gates and states?
A:
f(t) = forget gate
i(t) = input gate
u(t) = candidate update gate
o(t) = output gate
c(t) = f(t)c(t-1) + i(t)u(t)
h(t) = o(t) * tanh(c(t))
Rationale: LSTM gates control memory flow to solve vanishing gradient
problem.
Perplexity