i,- i,- i,- i,- i,- i,- i,- i,-
with Verified Answers | 100% Correct| Latest
i,- i,- i,- i,- i,- i,- i,-
2025/2026 Update - Georgia Institute of i,- i,- i,- i,- i,- i,-
Technology.
Masked Language Modeling i,- - pre-training task - an auxiliary i,- i,-i,- i,- i,- i,- i,- i,- i,- i,-
task different from the final task we're really interested in, but
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
which can help us achieve better performance finding good initial
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
parameters for the model i,- i,- i,-
- By pre-training on masked language modeling before training
i,- i,- i,- i,- i,- i,- i,- i,- i,-
on our final task, it is usually possible to obtain higher
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
performance than by simply training on the final task i,- i,- i,- i,- i,- i,- i,- i,-
Knowledge Distillation to Reduce Model Sizes i,- i,- i,- i,- i,- i,-i,- i,- - Have fully
i,- i,- i,-
parameterized teacher model i,- i,-
- Have a much smaller student model
i,- i,- i,- i,- i,- i,-
- Student model attempts to minimize prediction error and
i,- i,- i,- i,- i,- i,- i,- i,- i,-
distance to teacher model simultaneously
i,- i,- i,- i,-
L(dist) = CE b/w student and teacher predictions
i,- i,- i,- i,- i,- i,- i,-
L(student) = CE b/w predicted output and actual i,- i,- i,- i,- i,- i,- i,-
, L = alpha * L(dist) + beta * L(student)
i,- i,- i,- i,- i,- i,- i,- i,-
Advantages:
- may work well b/c of soft predictions of teacher model
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
- if we don't have enough labeled text we can still train student
i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
model to align predictions i,- i,- i,-
Embedding A learned map from entities to vectors that
i,-i,- i,- i,- i,- i,- i,- i,- i,- i,- i,-
encodes similarity i,-
Graph Embedding Optimize the objective that connected
i,- i,-i,- i,- i,- i,- i,- i,- i,-
nodes have more similar embeddings than unconnected nodes.
i,- i,- i,- i,- i,- i,- i,-
Task: convert nodes to vectors
i,- i,- i,- i,-
- effectively unsupervised learning where nearest neighbors are
i,- i,- i,- i,- i,- i,- i,- i,-
similar
- these learned vectors are useful for downstream tasks
i,- i,- i,- i,- i,- i,- i,- i,-
Multi-layer Perceptron (MLP) pain points for NLP - Cannot
i,- i,- i,- i,- i,- i,- i,-i,- i,- i,- i,-
easily support variable-sized sequences as inputs or outputs
i,- i,- i,- i,- i,- i,- i,-
- No inherent temporal structure
i,- i,- i,- i,-