NLP - N-Grams and Smoothing
N-gram - answer Literally a sequence of n tokens. It is used as method of determining
the next word in a sequence based on the previous n-1 tokens.
Markov Assumption - answer The probability of a word type occurring is only based on
the previous word type
Extrinsic evaluation - answer Evaluation of the performance of a language model by
embedding it in an application and measuring how much the application improves. This
is often the only way to know if a particular improvement in a component is really going
to help the task at hand
Intrinsic evaluation - answer Evaluation of the performance of a language model that
measures the quality of a model independent of any application.
Test / training / dev set - answer Train model on the training set, evaluate it with the test
set, and make sure you don't overfit with the dev set.
Maximum Likelihood Estimation - answer Counts the number of times a feature (such
as an n-gram) appears in the corpus and normalizes between 0 and 1.
Perplexity - answer A metric that describes how surprising the next sequence will be;
also a measure of how well the probability distribution predicts a sample. Lower
perplexity is generally better. Mathematically, it is a variation on raw probabilities which
takes the N-th root of the inverse probability of a test set, where N is the number of
tokens. The formula is as follows: 𝑃𝑃 = (1/𝑃(𝑤1𝑤2 ....𝑤𝑛 ))1/N
Relative Frequency - answer The number of appearances of a sequence divided by the
number of appearances of its prefix
Smoothing - answer Since we multiply probabilities, one zero probability event will zero
out an entire chain. In order to avoid this, we perform smoothing, a slight transformation
to data sets that prevents a language model from assigning a probability of zero to an
event. The simplest technique is Laplace smoothing, which merely adds one to each
count of token appearances.
Backoff - answer The process used to decide which n-gram to use in predictions as
determined by the largest n-gram with sufficient evidence. If an n-gram has zero
evidence for use, we backoff to the next largest (n-1 gram) and try again
Interpolation - answer An alternative approach to backoff that combines the estimates
from each n-gram prediction via a linear combination where the constants sum to1.
N-gram - answer Literally a sequence of n tokens. It is used as method of determining
the next word in a sequence based on the previous n-1 tokens.
Markov Assumption - answer The probability of a word type occurring is only based on
the previous word type
Extrinsic evaluation - answer Evaluation of the performance of a language model by
embedding it in an application and measuring how much the application improves. This
is often the only way to know if a particular improvement in a component is really going
to help the task at hand
Intrinsic evaluation - answer Evaluation of the performance of a language model that
measures the quality of a model independent of any application.
Test / training / dev set - answer Train model on the training set, evaluate it with the test
set, and make sure you don't overfit with the dev set.
Maximum Likelihood Estimation - answer Counts the number of times a feature (such
as an n-gram) appears in the corpus and normalizes between 0 and 1.
Perplexity - answer A metric that describes how surprising the next sequence will be;
also a measure of how well the probability distribution predicts a sample. Lower
perplexity is generally better. Mathematically, it is a variation on raw probabilities which
takes the N-th root of the inverse probability of a test set, where N is the number of
tokens. The formula is as follows: 𝑃𝑃 = (1/𝑃(𝑤1𝑤2 ....𝑤𝑛 ))1/N
Relative Frequency - answer The number of appearances of a sequence divided by the
number of appearances of its prefix
Smoothing - answer Since we multiply probabilities, one zero probability event will zero
out an entire chain. In order to avoid this, we perform smoothing, a slight transformation
to data sets that prevents a language model from assigning a probability of zero to an
event. The simplest technique is Laplace smoothing, which merely adds one to each
count of token appearances.
Backoff - answer The process used to decide which n-gram to use in predictions as
determined by the largest n-gram with sufficient evidence. If an n-gram has zero
evidence for use, we backoff to the next largest (n-1 gram) and try again
Interpolation - answer An alternative approach to backoff that combines the estimates
from each n-gram prediction via a linear combination where the constants sum to1.