Assignment 3
Unique No:
Due 10 September 2025
,COS4861 – Assignment 3 (2025)
Working Towards Encoding Systems in NLP
Due: 10 September 2025
Question 1 — Theory (12)
1.1 What is a corpus, and how does it differ from other data types? (2) A corpus is
a structured and purposefully compiled collection of natural language material—either
written texts or transcribed speech—used for linguistic and NLP research. Unlike
ordinary datasets such as numerical tables or sensor logs, a corpus retains essential
linguistic features including tokens, sentences, documents, genres, and metadata (e.g.,
author, date, and register). This enables the study of language-specific phenomena
such as vocabulary distribution, syntax, and semantic patterns. For this assignment, the
dataset provided was a small English corpus on smoothing algorithms.
1.2 Technical term for splitting a corpus into paragraphs/sentences/words (1) This
process is called tokenization (for words) and sentence segmentation (for boundaries
between sentences). Together, they are key preprocessing steps in NLP.
1.3 Define N-grams and give peer-reviewed references (2) An N-gram is a
continuous sequence of N items—such as words, subwords, or characters—taken from
a text. N-gram language models estimate conditional probabilities of the form:
𝑖−1
𝑃(𝑤𝑖 ∣ 𝑤 𝑖−𝑁+1 )
Early influential work includes Brown et al. (1992), which introduced class-based N-
gram models, and subsequent comparative studies such as Chen & Goodman (1999),
which evaluated different smoothing strategies and demonstrated the importance of N-
grams in statistical language modelling.
1.4 Data sparseness in N-gram models; what is smoothing? Name two algorithms
(7) Because language has an extremely large combinatorial space, many valid N-grams
never appear in a given training set. Maximum likelihood estimation (MLE) assigns zero
, probability to such unseen sequences and disproportionately favours frequent ones,
creating a data sparsity problem that weakens predictive power.
Smoothing addresses this by reallocating some probability mass from observed N-
grams to unseen ones, ensuring more robust generalisation.
Two important smoothing algorithms are:
Katz Back-off: reduces counts for observed N-grams and falls back to lower-
order distributions when higher-order contexts are sparse.
Modified Kneser–Ney: employs absolute discounting combined with
continuation probabilities, and is widely regarded as one of the most effective
smoothing methods in practice.
Another well-known approach is Good-Turing discounting, which adjusts low-
frequency counts to better estimate probabilities for unseen events.
Question 2 — Applications & Code Concepts (13)
2.1 How MLE causes data sparseness issues in unsmoothed N-grams (3) Under
MLE, probabilities are calculated as:
𝐶(ℎ, 𝑤𝑖 )
𝑃(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ)
where 𝐶(ℎ, 𝑤𝑖 ) is the joint count of the history ℎ and word 𝑤𝑖 . If a plausible word never
appears in the training corpus (𝐶(ℎ, 𝑤𝑖 ) = 0 ), its probability becomes zero. Since natural
language has a long tail of rare events, this leads to many zero-probability cases,
making predictions unreliable and increasing perplexity.
2.2 Is Laplace (add-one) smoothing good enough for modern N-gram models?
Explain how it works and its effect (4) Laplace smoothing adjusts MLE by adding one
to every count:
𝐶(ℎ, 𝑤𝑖 ) + 1
𝑃(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ) + 𝑉