Exam (elaborations)

COS4861_Assignment_3_EXPERTLY_DETAILED_ANSWERS_DUE_10_September

Rating

Sold

Pages

Grade

A+

Uploaded on

26-08-2025

Written in

2025/2026

COS4861_Assignment_3_EXPERTLY_DETAILED_ANSWERS_DUE_10_September 100% solved answers.Stop starting from scratch.Download your copy today and get a head start.

Institution

Module

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: University of South Africa
Module: Natural Language Processing (COS4861)

All documents for this subject (21)

Document information

Uploaded on: August 26, 2025
Number of pages: 26
Written in: 2025/2026
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

cos4861assignment3expertlydetailedanswersdue

Content preview

COS4861
Assignment 3
Unique No:
Due 10 September 2025

,COS4861 – Assignment 3 (2025)

Working Towards Encoding Systems in NLP

Due: 10 September 2025

Question 1 — Theory (12)

1.1 What is a corpus, and how does it differ from other data types? (2) A corpus is
a structured and purposefully compiled collection of natural language material—either
written texts or transcribed speech—used for linguistic and NLP research. Unlike
ordinary datasets such as numerical tables or sensor logs, a corpus retains essential
linguistic features including tokens, sentences, documents, genres, and metadata (e.g.,
author, date, and register). This enables the study of language-specific phenomena
such as vocabulary distribution, syntax, and semantic patterns. For this assignment, the
dataset provided was a small English corpus on smoothing algorithms.

1.2 Technical term for splitting a corpus into paragraphs/sentences/words (1) This
process is called tokenization (for words) and sentence segmentation (for boundaries
between sentences). Together, they are key preprocessing steps in NLP.

1.3 Define N-grams and give peer-reviewed references (2) An N-gram is a
continuous sequence of N items—such as words, subwords, or characters—taken from
a text. N-gram language models estimate conditional probabilities of the form:

𝑖−1
𝑃(𝑤𝑖 ∣ 𝑤 𝑖−𝑁+1 )

Early influential work includes Brown et al. (1992), which introduced class-based N-
gram models, and subsequent comparative studies such as Chen & Goodman (1999),
which evaluated different smoothing strategies and demonstrated the importance of N-
grams in statistical language modelling.

1.4 Data sparseness in N-gram models; what is smoothing? Name two algorithms
(7) Because language has an extremely large combinatorial space, many valid N-grams
never appear in a given training set. Maximum likelihood estimation (MLE) assigns zero

, probability to such unseen sequences and disproportionately favours frequent ones,
creating a data sparsity problem that weakens predictive power.

Smoothing addresses this by reallocating some probability mass from observed N-
grams to unseen ones, ensuring more robust generalisation.

Two important smoothing algorithms are:

 Katz Back-off: reduces counts for observed N-grams and falls back to lower-
order distributions when higher-order contexts are sparse.
 Modified Kneser–Ney: employs absolute discounting combined with
continuation probabilities, and is widely regarded as one of the most effective
smoothing methods in practice.

Another well-known approach is Good-Turing discounting, which adjusts low-
frequency counts to better estimate probabilities for unseen events.

Question 2 — Applications & Code Concepts (13)

2.1 How MLE causes data sparseness issues in unsmoothed N-grams (3) Under
MLE, probabilities are calculated as:

𝐶(ℎ, 𝑤𝑖 )
𝑃෠(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ)

where 𝐶(ℎ, 𝑤𝑖 ) is the joint count of the history ℎ and word 𝑤𝑖 . If a plausible word never
appears in the training corpus (𝐶(ℎ, 𝑤𝑖 ) = 0 ), its probability becomes zero. Since natural
language has a long tail of rare events, this leads to many zero-probability cases,
making predictions unreliable and increasing perplexity.

2.2 Is Laplace (add-one) smoothing good enough for modern N-gram models?
Explain how it works and its effect (4) Laplace smoothing adjusts MLE by adding one
to every count:

𝐶(ℎ, 𝑤𝑖 ) + 1
𝑃෠(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ) + 𝑉

£2.12

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

AcademicGeniusHub

4.0

(13)

Get to know the seller

AcademicGeniusHub Chamberlain College Of Nursing

View profile

Sold

Member since

5 months

Number of followers

Documents

247

Last sold

1 month ago

AcademicGenius

AcademicGenius provides high-quality, easy-to-understand study guides, notes, and assignment help to support student success. All materials are well-structured, accurate, and aligned with academic standards—perfect for exam prep, coursework, and revision.

4.0

13 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller AcademicGeniusHub. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for £2.12. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 41730 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

COS4861_Assignment_3_EXPERTLY_DETAILED_ANSWERS_DUE_10_September

Written for

Document information

Subjects

Content preview

More courses for University of South Africa >

Get to know the seller

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?