ASSIGNMENT 3 2025
UNIQUE NO.
DUE DATE: 10 SEPTEMBER 2025
, Natural Language Processing
Question 1 – Theory (12)
1) What is a corpus, and how does it differ from other data types? (2)
A corpus is a large, structured collection of authentic texts (written, spoken, or
transcribed) compiled to support empirical language study and NLP modeling. Unlike
generic datasets (e.g., numeric sensor tables), corpora preserve linguistic form and
sequence (tokens, order, sentence boundaries, discourse) and are often annotated
(e.g., POS tags), enabling probabilistic language models and linguistic analysis
(Jurafsky & Martin, 2023; McEnery & Hardie, 2012).
2) Technical term for splitting a corpus into linguistic units (1)
Tokenization (and, more broadly, segmentation) — e.g., sentence segmentation
and word tokenization (Manning et al., 2008; Jurafsky & Martin, 2023).
3) Define N-grams with peer-reviewed references (2)
An N-gram is a contiguous sequence of N items (characters or words) from text. In
word N-grams, items are words; in character N-grams, items are characters. N-gram
language models approximate 𝑃(𝑤𝑡 ∣ 𝑤 𝑡−(𝑁−1) , … , 𝑤𝑡−1 ) using observed counts in a
corpus (Shannon, 1948; Chen & Goodman, 2003; Kneser & Ney, 1995).
4) Data sparseness in N-gram models; smoothing; name two algorithms (7)
Data sparseness arises because even large corpora do not observe many possible N-
grams, especially for larger N. Raw Maximum Likelihood Estimation (MLE) assigns zero
probability to unseen N-grams, making models brittle.
Smoothing redistributes some probability mass from seen to unseen events to avoid
zeros and improve generalization. Two well-established algorithms are: