Geschreven door studenten die geslaagd zijn Direct beschikbaar na je betaling Online lezen of als PDF Verkeerd document? Gratis ruilen 4,6 TrustPilot
logo-home
Samenvatting

Samenvatting Natural Language Processing (NLP)

Beoordeling
-
Verkocht
-
Pagina's
23
Geüpload op
23-10-2021
Geschreven in
2018/2019

Samenvatting van het vak Natural Language Processing, gegeven aan de Universiteit Maastricht

Instelling
Vak

Voorbeeld van de inhoud

Natural Language Processing Famke Nouwens

Week 1
NLP – algorithms to process, analyse and understand structure and meaning texts in a natural
language. Key aspects are: information extraction and retrieval, text classification, extracting
sentiment et cetera.
However, interpreting language is hard because of:
- Ambiguity: sentences/words can have multiple meanings
- Compositionality: certain expressions need to be composed in a set way (eg gin & tonic
instead of tonic & gin).
- Co-reference: to which object/person does the pronoun belong
- Non-standard language: symbols, abbreviations and acronyms
- Neologisms: newly coined words or expressions (eg bromance/retweet)
Morphology
Every NLP task needs to do text normalization:
- Tokenizing words (separating the individual words/symbols in sentences)
- Normalizing word formats
- Segmenting sentences
Lemma – words that have the same stem (basis of the word) and a rough word sense (eg cat – cats,
be – is)
Type – an element of the vocabulary (different words, different symbols et cetera). |V| is the size of
the vocabulary.
Token – an instance of that type in running text (the number of occurrences of that type). N is the
number of tokens.
type
The number of tokens is always at least the number of types, and the token ratio is a measure of
complexity.
Tokenization
Compound splitter – splits a single word into main words (useful in German)
Often before we can start tokenizing, we first need pre-process the data:
- Normalization
• Text and query terms must have the same form (eg. USA must match U.S.A.)
• Define equivalence classes (eg. Terms may not have a . in them)
- Case folding: reduce all letters to lower case. Disadvantage: lose information
- Remove spaces for compound words or collocations (eg New York → NewYork)
- Lemmatization: change every word to its base form (eg. Am/is/are → be). Preferred to
stemming with small training sets.
- Morphology: reduce words to its morpheme (smallest meaningful unit of word eg. Walked =
walk + ed).
- Stemming: reduce terms to their stem (chop off the affixes). It is aggressive because it can
chop too much to which the word loses its meaning.
Different ways to form a word:
- Inflection: modifying a word with an affix to change its grammatical function (book → books)
- Derivation: adding an affix to a stem to create a new word (great → greatly)
- Compounding: combining two stems (key+board = keyboard, lawsuit, bookcase)

,Another problem with tokenization is deciding if punctuation means the end of a sentence or not. For
! or ? this is fairly easy, but for the . it can be difficult → it is often used in abbreviations. To solve this,
a decision tree or binary classifier can be implemented.
Basic text processing
Regular expressions – a formal language for specifying text strings.
Special symbols in regular expressions:
Symbol Meaning
[…] Any letter/digit inside can be chosen or in range […-…]
[^..] Negation of whatever comes after the ^
..? Whatever comes before ? is optional
..* 0 or more times or whatever comes before *
..+ 1 or more times or whatever comes before *
. Anything can be placed on the position of .
^… Start of string/line
…$ Denotes the end of string/line
\.$ End of string/line must be whatever comes after \ (so now .)
..|.. Disjunction, can be both
..[].. Whatever is outside the brackets must be in the string/line
\b…\b No other string before and after whatever is inside


Regular expressions are often used as features in machine learning classifiers. It can also be used in
capturing generalizations. They’re often the first model for any text processing text.
False positive error (Type I): matching strings that we should not have matched. This is reduced by
increasing accuracy or precision.
False negative error (Type II): not matching things that we should have matched. This is reduced by
increasing coverage or recall.
Similarity of strings
Minimum edit distance – minimum number of editing operations (insertion, deletion, substitution) to
change one string into the other. Levenshtein distance – substitution has a cost of 2.
To find the minimum edit distance we search for a path (sequence of edits) from the start string to
the final string:
- Initial state: word we’re transforming
- Operators: insert, delete, substitute
- Goal state: word we’re trying to get
- Path cost: the number of edits (that we try to minimize)
For string x of length n and string y of length m, D(i, j) is the edit distance between X[1…i] and Y[1…j].
The edit distance between X and Y is thus D(n, m).
Dynamic programming – a tabular computation of D(n, m) that solves problems by combining
solutions to subproblems. It’s a bottom-up algorithm that computes D(i, j) for small i, j and computes
a larger D(i, j) based on previously computed smaller values.
DP for Levenshtein minimum edit distance:
- Initialization:
D(i, 0) = i
D(0, j) = j

, - Recurrence relation:
For each i = 1…M
For each j = 1…N
D(i − 1, j) + 1
D(i, j − 1) + 1
D(i, j) = min
2 if X(i) ≠ Y(j)
D(i − 1, j − 1) + {
{ 0 if X(i) = Y(j)
LEFT
ptr(i, j) = { DOWN
DIAGONAL
- Termination:
D(N, M) is distance
Where the edit distance table is:
The edit distance is not sufficient, since often each
character needs to be aligned to the other string.




This is done by keeping a backtrace (ptr bit in formula): every time we enter a cell, we remember
where we came from and once we reach the end, we trace back the path from the upper right corner
to read off the alignment.
Performance minEdit: O(nm) for time and space. Performance backtrace: O(n+m).
An optimal alignment is composed of optimal subalignments.
Weighted edit distance – weights are added to computation. This is useful for spell correction (some
letters are more likely to be mistyped) and biology (certain kinds of deletions or insertions are more
likely). So instead of the + 1 and + 2/0, we have the weight of the insertion, deletion and substitution.
Language Models
Probabilistic language models compute a probability of a sentence or a sequence of words. Usually
it’s P(w1, w2, …, wn-1, wn):
- P(w1 w2 … wn ) = ∏i P(wi |w1 w2 … wi−1 )
P(A,B)
- P(B|A) = P(A) ⟺ P(A, B) = P(A)P(B|A)

Markov assumption – we can compute the probability of a sentence by computing the conditional
probability of the last few words: P(𝑤𝑖 | w1 w2 … wn ) ≈ ∏i P(wi |wi−k … wi−1 ). Example: P(the|water
is so transparent that) ≈ P(the|that). Multiple models:
- Unigram model: P(w1 w2 … wn ) ≈ ∏i P(wi )
- Bigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−1 )
- Trigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−2 , 𝑤𝑖−1 )
𝑐𝑜𝑢𝑛𝑡(wi−1 ,wi)
- Where P(wi |wi−1 ) = (using maximum likelihood estimate)
𝑐𝑜𝑢𝑛𝑡(wi−1 )
- Transform probabilities to log space to avoid underflow: Log(p1 x p2) = log(p1) + log(p2).
N-gram models are insufficient, because language has long-distance dependencies. Also, they only
work well for word prediction if the test corpus looks like the training corpus.
Smoothing – a method used for when the test set is not (as) similar to training set (i.e. bigrams with
probability 0 occur, which gives errors because we can’t divide by 0)

Geschreven voor

Instelling
Studie
Vak

Documentinformatie

Geüpload op
23 oktober 2021
Aantal pagina's
23
Geschreven in
2018/2019
Type
SAMENVATTING

Onderwerpen

$9.47
Krijg toegang tot het volledige document:

Verkeerd document? Gratis ruilen Binnen 14 dagen na aankoop en voor het downloaden kun je een ander document kiezen. Je kunt het bedrag gewoon opnieuw besteden.
Geschreven door studenten die geslaagd zijn
Direct beschikbaar na je betaling
Online lezen of als PDF

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
FamkeNouwens Universiteit Leiden
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
13
Lid sinds
8 jaar
Aantal volgers
9
Documenten
0
Laatst verkocht
1 jaar geleden

3.0

1 beoordelingen

5
0
4
0
3
1
2
0
1
0

Populaire documenten

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen