COS4861
ASSIGNMENT 3 2025
UNIQUE NO.
DUE DATE: 10 SEPTEMBER 2025
, COS4861 Assignment 3 2025
Question 1 [12 points] – Theory
1. What is a corpus and how does it differ from other data types? (2) A corpus is a
large, structured, and machine-readable collection of texts that is systematically
compiled for linguistic or natural language processing (NLP) research (Meyer, 2021).
Unlike ordinary datasets (e.g., spreadsheets or numerical data), a corpus contains raw
or annotated natural language data, enabling analysis of patterns in language use.
2. What is the technical term for splitting a corpus into different linguistic units
such as paragraphs, sentences, and words in NLP? (1) The process is called
tokenization (Jurafsky & Martin, 2023).
3. Define N-grams and provide references. (2) An N-gram is a contiguous sequence
of N items (characters, syllables, or words) from a given text or speech sample
(Manning & Schütze, 1999). For example, in the sentence “data is noisy”:
Unigrams = [“data”, “is”, “noisy”]
Bigrams = [“data is”, “is noisy”]
Trigrams = [“data is noisy”]
4. Describe the problem of data sparseness with regard to an N-gram model.
Explain smoothing and name two algorithms. (7)
Data sparseness problem: In N-gram models, many possible word
combinations never appear in the training corpus, resulting in zero probabilities
for valid but unseen sequences. This weakens generalization (Jurafsky & Martin,
2023).
Smoothing: A statistical technique that adjusts raw frequency counts to avoid
assigning zero probability to unseen events.
Two smoothing algorithms:
ASSIGNMENT 3 2025
UNIQUE NO.
DUE DATE: 10 SEPTEMBER 2025
, COS4861 Assignment 3 2025
Question 1 [12 points] – Theory
1. What is a corpus and how does it differ from other data types? (2) A corpus is a
large, structured, and machine-readable collection of texts that is systematically
compiled for linguistic or natural language processing (NLP) research (Meyer, 2021).
Unlike ordinary datasets (e.g., spreadsheets or numerical data), a corpus contains raw
or annotated natural language data, enabling analysis of patterns in language use.
2. What is the technical term for splitting a corpus into different linguistic units
such as paragraphs, sentences, and words in NLP? (1) The process is called
tokenization (Jurafsky & Martin, 2023).
3. Define N-grams and provide references. (2) An N-gram is a contiguous sequence
of N items (characters, syllables, or words) from a given text or speech sample
(Manning & Schütze, 1999). For example, in the sentence “data is noisy”:
Unigrams = [“data”, “is”, “noisy”]
Bigrams = [“data is”, “is noisy”]
Trigrams = [“data is noisy”]
4. Describe the problem of data sparseness with regard to an N-gram model.
Explain smoothing and name two algorithms. (7)
Data sparseness problem: In N-gram models, many possible word
combinations never appear in the training corpus, resulting in zero probabilities
for valid but unseen sequences. This weakens generalization (Jurafsky & Martin,
2023).
Smoothing: A statistical technique that adjusts raw frequency counts to avoid
assigning zero probability to unseen events.
Two smoothing algorithms: