Assignment 3
DUE 10 September 2025
, COS4861 ASSIGNMENT 3 2025
WORKING TOWARDS ENCODING SYSTEMS IN NLP
DUE: 10 SEPTEMBER 2025 MARKS: 65
Question 1 — Theory (12)
1.1) What is a corpus, and how does it differ from other data types? (2)
A corpus is a large, curated collection of natural-language text or speech transcripts
organised for linguistic or NLP analysis. Unlike generic datasets e.g., numeric sensor
tables, a corpus preserves linguistic structure, tokens, sentences, documents, genres,
meta-data such as source/date/register so that we can model language phenomena like
vocabulary, syntax and usage patterns. In this assignment we were given a small
English text corpus about smoothing algorithms to use for all tasks.
1.2) Technical term for splitting a corpus into paragraphs/sentences/words (1)
This process is called tokenization (word tokenization) and sentence segmentation
(sentence boundary detection). Together these are standard text preprocessing steps.
1.3) Define N-grams and give peer-reviewed references (2)
An N-gram is a contiguous sequence of 𝑁 items (characters or words) from a text; N-
𝑖−1
gram language models estimate 𝑃(𝑤𝑖 ∣ 𝑤 𝑖−𝑁+1 ) from counts. Foundational peer-
reviewed sources include Brown et al. (1992), who develop class-based N-gram models
for predicting the next word, and later comparative studies showing their centrality in
language modelling.
1.4) Data sparseness in N-gram models; what is smoothing? Name two
algorithms (7)
Because natural language is combinatorially large, many plausible N-grams are unseen
in training. Maximum-likelihood estimates (MLE) assign probability zero to unseen