100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Tentamen (uitwerkingen)

COS4861_Assignment_3_EXPERTLY_DETAILED_ANSWERS_DUE_10_September

Beoordeling
-
Verkocht
-
Pagina's
26
Cijfer
A+
Geüpload op
26-08-2025
Geschreven in
2025/2026

COS4861_Assignment_3_EXPERTLY_DETAILED_ANSWERS_DUE_10_September 100% solved answers.Stop starting from scratch.Download your copy today and get a head start.

Instelling
Vak










Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Geschreven voor

Instelling
Vak

Documentinformatie

Geüpload op
26 augustus 2025
Aantal pagina's
26
Geschreven in
2025/2026
Type
Tentamen (uitwerkingen)
Bevat
Vragen en antwoorden

Onderwerpen

Voorbeeld van de inhoud

COS4861
Assignment 3
Unique No:
Due 10 September 2025

,COS4861 – Assignment 3 (2025)

Working Towards Encoding Systems in NLP

Due: 10 September 2025

Question 1 — Theory (12)

1.1 What is a corpus, and how does it differ from other data types? (2) A corpus is
a structured and purposefully compiled collection of natural language material—either
written texts or transcribed speech—used for linguistic and NLP research. Unlike
ordinary datasets such as numerical tables or sensor logs, a corpus retains essential
linguistic features including tokens, sentences, documents, genres, and metadata (e.g.,
author, date, and register). This enables the study of language-specific phenomena
such as vocabulary distribution, syntax, and semantic patterns. For this assignment, the
dataset provided was a small English corpus on smoothing algorithms.

1.2 Technical term for splitting a corpus into paragraphs/sentences/words (1) This
process is called tokenization (for words) and sentence segmentation (for boundaries
between sentences). Together, they are key preprocessing steps in NLP.

1.3 Define N-grams and give peer-reviewed references (2) An N-gram is a
continuous sequence of N items—such as words, subwords, or characters—taken from
a text. N-gram language models estimate conditional probabilities of the form:

𝑖−1
𝑃(𝑤𝑖 ∣ 𝑤 𝑖−𝑁+1 )

Early influential work includes Brown et al. (1992), which introduced class-based N-
gram models, and subsequent comparative studies such as Chen & Goodman (1999),
which evaluated different smoothing strategies and demonstrated the importance of N-
grams in statistical language modelling.

1.4 Data sparseness in N-gram models; what is smoothing? Name two algorithms
(7) Because language has an extremely large combinatorial space, many valid N-grams
never appear in a given training set. Maximum likelihood estimation (MLE) assigns zero

, probability to such unseen sequences and disproportionately favours frequent ones,
creating a data sparsity problem that weakens predictive power.

Smoothing addresses this by reallocating some probability mass from observed N-
grams to unseen ones, ensuring more robust generalisation.

Two important smoothing algorithms are:

 Katz Back-off: reduces counts for observed N-grams and falls back to lower-
order distributions when higher-order contexts are sparse.
 Modified Kneser–Ney: employs absolute discounting combined with
continuation probabilities, and is widely regarded as one of the most effective
smoothing methods in practice.

Another well-known approach is Good-Turing discounting, which adjusts low-
frequency counts to better estimate probabilities for unseen events.

Question 2 — Applications & Code Concepts (13)

2.1 How MLE causes data sparseness issues in unsmoothed N-grams (3) Under
MLE, probabilities are calculated as:

𝐶(ℎ, 𝑤𝑖 )
𝑃෠(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ)

where 𝐶(ℎ, 𝑤𝑖 ) is the joint count of the history ℎ and word 𝑤𝑖 . If a plausible word never
appears in the training corpus (𝐶(ℎ, 𝑤𝑖 ) = 0 ), its probability becomes zero. Since natural
language has a long tail of rare events, this leads to many zero-probability cases,
making predictions unreliable and increasing perplexity.

2.2 Is Laplace (add-one) smoothing good enough for modern N-gram models?
Explain how it works and its effect (4) Laplace smoothing adjusts MLE by adding one
to every count:

𝐶(ℎ, 𝑤𝑖 ) + 1
𝑃෠(𝑤𝑖 ∣ ℎ) =
𝐶(ℎ) + 𝑉

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
AcademicGeniusHub Chamberlain College Of Nursing
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
82
Lid sinds
5 maanden
Aantal volgers
0
Documenten
247
Laatst verkocht
1 maand geleden
AcademicGenius

AcademicGenius provides high-quality, easy-to-understand study guides, notes, and assignment help to support student success. All materials are well-structured, accurate, and aligned with academic standards—perfect for exam prep, coursework, and revision.

4.0

13 beoordelingen

5
7
4
2
3
2
2
1
1
1

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via Bancontact, iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo eenvoudig kan het zijn.”

Alisha Student

Veelgestelde vragen