100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

COS4861 Assignment 3 (COMPLETE ANSWERS) 2025 - DUE 10 September 2025; 100% correct solutions and explanations.

Rating
-
Sold
-
Pages
20
Grade
A+
Uploaded on
06-09-2025
Written in
2025/2026

COS4861 Assignment 3 (COMPLETE ANSWERS) 2025 - DUE 10 September 2025; 100% correct solutions and explanations.











Whoops! We can’t load your doc right now. Try again or contact support.

Document information

Uploaded on
September 6, 2025
Number of pages
20
Written in
2025/2026
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

COS4861 Assignment 3 (COMPLETE ANSWERS) 2025 - DUE 10
September 2025; 100% correct solutions and explanations

, Working towards encoding systems in NLP.

Due date: 10 September 2025
Year: 2025
You will learn how to:
- define various encoding techniques (N-grams, ), and smoothing algorithms - build
tokenizers, and N-grams models,

Note 1: This assignment is designed to make you understand the fundamentals behind
corpus-based Natural Language Processing (NLP) and various techniques applied for
preprocessing, analysing, and generating insights from text such as word would,
tokenization, and creating encoding systems. This is in no way a definitive list of examples,
but the basic components you need to get started.


Question 1 — Theory (12 points)
1) What is a corpus and how does it differ from other
data types? (2)
A corpus is a structured body (collection) of natural
language text used for linguistic or NLP analysis. A
corpus is explicitly assembled to represent language use
(e.g., news text, scientific articles, transcribed speech)
and typically annotated or preprocessed for analysis. It
differs from other data types (images, tabular sensor
data, audio without transcription) in that its primary unit
is textual linguistic data (tokens, sentences, documents)
and analyses focus on linguistic phenomena (syntax,
semantics, frequencies, collocations, etc.).
2) Technical term for splitting a corpus into
paragraphs, sentences, words in NLP (1)
That process is called tokenization (with sentence
segmentation / sentence boundary detection and
word/token segmentation as sub-tasks).

, 3) Define N-grams and provide references from peer-
reviewed articles (2)
An n-gram is a contiguous sequence of n items
(characters or words) from text. In language modeling,
an n-gram model predicts the probability of a token
given the previous n−1 tokens (e.g., a bigram uses 1
previous token, a trigram uses 2). Peer-reviewed /
authoritative references that define and use n-grams:
Jurafsky & Martin (N-gram language model introduction)
and review material in ScienceDirect and SAGE journals
on n-gram models and text mining. Stanford
UniversityScienceDirectSAGE Journals
4) Data sparseness in N-grams, define smoothing
and name two smoothing algorithms (7)
 Data sparseness problem (for N-gram models): As

n grows, the number of possible n-grams explodes
(vocabulary^n). Many valid n-grams will have zero
or very low counts in any finite corpus. This leads to
zero-probability estimates under straightforward
Maximum Likelihood Estimation (MLE), causing a
model to assign zero probability to plausible events
(bad generalization).
 Smoothing (definition): Smoothing refers to

techniques that adjust observed frequency counts
or probabilities so that unseen or rare n-grams
receive non-zero probability mass and so that
probability mass is redistributed from seen to
unseen events in a principled way. Smoothing

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
ScholarsCompas CHARMBERLAIN SCHOOL OF NURSING
View profile
Follow You need to be logged in order to follow users or courses
Sold
124
Member since
1 year
Number of followers
0
Documents
232
Last sold
3 weeks ago

4,0

25 reviews

5
13
4
4
3
4
2
2
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions