100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

COS4861 Assignment 3 |EXPERTLY DETAILED ANSWERS|- DUE 10 September 2025

Rating
-
Sold
-
Pages
26
Grade
A+
Uploaded on
21-08-2025
Written in
2025/2026

COS4861 Assignment 3 |EXPERTLY DETAILED ANSWERS|- DUE 10 September 2025 Note 1: This assignment is designed to make you understand the fundamentals behind corpus-based Natural Language Processing (NLP) and various techniques applied for pre processing, analysing, and generating insights from text such as word would, tokenization, and creating encoding systems. This is in no way a definitive list of examples, but the basic components you need to get started.

Show more Read less
Institution
Module









Whoops! We can’t load your doc right now. Try again or contact support.

Connected book

Written for

Institution
Module

Document information

Uploaded on
August 21, 2025
Number of pages
26
Written in
2025/2026
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

COS4861
Assignment 3
DUE 10 September 2025

, COS4861 ASSIGNMENT 3 2025

WORKING TOWARDS ENCODING SYSTEMS IN NLP

DUE: 10 SEPTEMBER 2025 MARKS: 65




Question 1 — Theory (12)

1.1) What is a corpus, and how does it differ from other data types? (2)

A corpus is a large, curated collection of natural-language text or speech transcripts
organised for linguistic or NLP analysis. Unlike generic datasets e.g., numeric sensor
tables, a corpus preserves linguistic structure, tokens, sentences, documents, genres,
meta-data such as source/date/register so that we can model language phenomena like
vocabulary, syntax and usage patterns. In this assignment we were given a small
English text corpus about smoothing algorithms to use for all tasks.

1.2) Technical term for splitting a corpus into paragraphs/sentences/words (1)

This process is called tokenization (word tokenization) and sentence segmentation
(sentence boundary detection). Together these are standard text preprocessing steps.

1.3) Define N-grams and give peer-reviewed references (2)

An N-gram is a contiguous sequence of 𝑁 items (characters or words) from a text; N-
𝑖−1
gram language models estimate 𝑃(𝑤𝑖 ∣ 𝑤 𝑖−𝑁+1 ) from counts. Foundational peer-
reviewed sources include Brown et al. (1992), who develop class-based N-gram models
for predicting the next word, and later comparative studies showing their centrality in
language modelling.

1.4) Data sparseness in N-gram models; what is smoothing? Name two
algorithms (7)

Because natural language is combinatorially large, many plausible N-grams are unseen
in training. Maximum-likelihood estimates (MLE) assign probability zero to unseen

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
FocusZone University of South Africa (Unisa)
Follow You need to be logged in order to follow users or courses
Sold
383
Member since
7 months
Number of followers
2
Documents
506
Last sold
1 hour ago
Focus Zone

On this page you will find Uploads and Package Deals by the seller FOCUS ZONE.

4.3

57 reviews

5
33
4
11
3
11
2
0
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions