Embedding b- bcorrect banswers✔✔A blearned bmap bfrom bentities bto bvectors bthat
bencodes bsimilarity
Graph bEmbedding b- bcorrect banswers✔✔Optimize bthe bobjective bthat bconnected
bnodes bhave bmore bsimilar bembeddings bthan bunconnected bnodes.
Task: bconvert bnodes bto bvectors
- beffectively bunsupervised blearning bwhere bnearest bneighbors bare bsimilar
- bthese blearned bvectors bare buseful bfor bdownstream btasks
Multi-layer bPerceptron b(MLP) bpain bpoints bfor bNLP b- bcorrect banswers✔✔- bCannot
beasily bsupport bvariable-sized bsequences bas binputs bor boutputs
- bNo binherent btemporal bstructure
- bNo bpractical bway bof bholding bstate
- bThe bsize bof bthe bnetwork bgrows bwith bthe bmaximum ballowed bsize bof bthe binput bor
boutput bsequences
Truncated bBackpropagation bthrough btime b- bcorrect banswers✔✔- bOnly
bbackpropagate ba bRNN bthrough bT btime bsteps
Recurrent bNeural bNetworks b(RNN) b- bcorrect banswers✔✔h(t) b= bactivation(U*input b+
bV*h(t-1) b+ bbias)
y(t) b= bactivation(W*h(t) b+ bbias)
- bactivation bis btypically bthe blogistic bfunction bor btanh
- boutputs bcan balso bsimply bbe bh(t)
- bfamily bof bNN barchitectures bfor bmodeling bsequences
Training bVanilla bRNN's bdifficulties b- bcorrect banswers✔✔- bVanishing bgradients
- bSince bdx(t)/dx(t-1) b= bw^t
- bif bw b> b1: bexploding bgradients
- bif bw b< b1: bvanishing bgradients
, Long bShort-Term bMemory bNetwork bGates band bStates b- bcorrect banswers✔✔- bf(t) b=
bforget bgate
- bi(t) b= binput bgate
- bu(t) b= bcandidate bupdate bgate
- bo(t) b= boutput bgate
- bc(t) b= bcell bstate
- bc(t) b= bf(t) b* bc(t b- b1) b+ bi(t) b* bu(t)
- bh(t) b= bhidden bstate
- bh(t) b= bo(t) b* btanh(c(t))
Perplexity(s) b- bcorrect banswers✔✔= bproduct( b1 b/ bP(w(i) b| bw(i-1), b...) b) b^ b(1 b/ bN)
= bb b^ b(-1/N bsum( blog(b) b(P(w(i) b| bw(i-1), b...) b) b)
- bnote bexponent bof bb bis bper bword bCE bloss
- bperplexity bof ba bdiscrete buniform bdistribution bover bk bevents bis bk
Language bModel bGoal b- bcorrect banswers✔✔- bestimate bthe bprobability bof
bsequences bof bwords
- bp(s) b= bp(w1, bw2, b..., bwn)
Masked bLanguage bModeling b- bcorrect banswers✔✔- bpre-training btask b- ban bauxiliary
btask bdifferent bfrom bthe bfinal btask bwe're breally binterested bin, bbut bwhich bcan bhelp
bus bachieve bbetter bperformance bfinding bgood binitial bparameters bfor bthe bmodel
- bBy bpre-training bon bmasked blanguage bmodeling bbefore btraining bon bour bfinal btask,
bit bis busually bpossible bto bobtain bhigher bperformance bthan bby bsimply btraining bon
bthe bfinal btask
Knowledge bDistillation bto bReduce bModel bSizes b- bcorrect banswers✔✔- bHave bfully
bparameterized bteacher bmodel
- bHave ba bmuch bsmaller bstudent bmodel
- bStudent bmodel battempts bto bminimize bprediction berror band bdistance bto bteacher
bmodel bsimultaneously
L(dist) b= bCE bb/w bstudent band bteacher bpredictions
L(student) b= bCE bb/w bpredicted boutput band bactual
L b= balpha b* bL(dist) b+ bbeta b* bL(student)
Advantages:
- bmay bwork bwell bb/c bof bsoft bpredictions bof bteacher bmodel
- bif bwe bdon't bhave benough blabeled btext bwe bcan bstill btrain bstudent bmodel bto balign
bpredictions