100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Exam (elaborations)

COS4861 Assignment 3 due 10 September 2025

Rating
-
Sold
-
Pages
12
Grade
A+
Uploaded on
06-09-2025
Written in
2025/2026

COS4861 Assignment 3 2025 - Due 10 September 2025; 100% TRUSTED workings with detailed Answers for A+ Grade.










Whoops! We can’t load your doc right now. Try again or contact support.

Document information

Uploaded on
September 6, 2025
Number of pages
12
Written in
2025/2026
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

Content preview

COS4861
ASSIGNMENT 3 2025

UNIQUE NO.
DUE DATE: 10 SEPTEMBER 2025

, COS4861 Assignment 3 2025

Question 1 [12 points] – Theory

1. What is a corpus and how does it differ from other data types? (2) A corpus is a
large, structured, and machine-readable collection of texts that is systematically
compiled for linguistic or natural language processing (NLP) research (Meyer, 2021).
Unlike ordinary datasets (e.g., spreadsheets or numerical data), a corpus contains raw
or annotated natural language data, enabling analysis of patterns in language use.

2. What is the technical term for splitting a corpus into different linguistic units
such as paragraphs, sentences, and words in NLP? (1) The process is called
tokenization (Jurafsky & Martin, 2023).

3. Define N-grams and provide references. (2) An N-gram is a contiguous sequence
of N items (characters, syllables, or words) from a given text or speech sample
(Manning & Schütze, 1999). For example, in the sentence “data is noisy”:

 Unigrams = [“data”, “is”, “noisy”]
 Bigrams = [“data is”, “is noisy”]
 Trigrams = [“data is noisy”]

4. Describe the problem of data sparseness with regard to an N-gram model.
Explain smoothing and name two algorithms. (7)

 Data sparseness problem: In N-gram models, many possible word
combinations never appear in the training corpus, resulting in zero probabilities
for valid but unseen sequences. This weakens generalization (Jurafsky & Martin,
2023).

 Smoothing: A statistical technique that adjusts raw frequency counts to avoid
assigning zero probability to unseen events.

 Two smoothing algorithms:

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
StudyAidPro Tutors International
View profile
Follow You need to be logged in order to follow users or courses
Sold
191
Member since
1 year
Number of followers
2
Documents
560
Last sold
1 month ago
StudyAidPro

On this page, you find all documents, package deals, and flashcards offered by seller StudyAidPro. All Modules!

4,0

22 reviews

5
11
4
5
3
2
2
3
1
1

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can immediately select a different document that better matches what you need.

Pay how you prefer, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card or EFT and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions