Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Summary

Summary Natural Language Processing (NLP)

Rating
-
Sold
-
Pages
23
Uploaded on
23-10-2021
Written in
2018/2019

Summary of the course Natural Language Processing, taught at Maastricht University

Institution
Course

Content preview

Natural Language Processing Famke Nouwens

Week 1
NLP – algorithms to process, analyse and understand structure and meaning texts in a natural
language. Key aspects are: information extraction and retrieval, text classification, extracting
sentiment et cetera.
However, interpreting language is hard because of:
- Ambiguity: sentences/words can have multiple meanings
- Compositionality: certain expressions need to be composed in a set way (eg gin & tonic
instead of tonic & gin).
- Co-reference: to which object/person does the pronoun belong
- Non-standard language: symbols, abbreviations and acronyms
- Neologisms: newly coined words or expressions (eg bromance/retweet)
Morphology
Every NLP task needs to do text normalization:
- Tokenizing words (separating the individual words/symbols in sentences)
- Normalizing word formats
- Segmenting sentences
Lemma – words that have the same stem (basis of the word) and a rough word sense (eg cat – cats,
be – is)
Type – an element of the vocabulary (different words, different symbols et cetera). |V| is the size of
the vocabulary.
Token – an instance of that type in running text (the number of occurrences of that type). N is the
number of tokens.
type
The number of tokens is always at least the number of types, and the token ratio is a measure of
complexity.
Tokenization
Compound splitter – splits a single word into main words (useful in German)
Often before we can start tokenizing, we first need pre-process the data:
- Normalization
• Text and query terms must have the same form (eg. USA must match U.S.A.)
• Define equivalence classes (eg. Terms may not have a . in them)
- Case folding: reduce all letters to lower case. Disadvantage: lose information
- Remove spaces for compound words or collocations (eg New York → NewYork)
- Lemmatization: change every word to its base form (eg. Am/is/are → be). Preferred to
stemming with small training sets.
- Morphology: reduce words to its morpheme (smallest meaningful unit of word eg. Walked =
walk + ed).
- Stemming: reduce terms to their stem (chop off the affixes). It is aggressive because it can
chop too much to which the word loses its meaning.
Different ways to form a word:
- Inflection: modifying a word with an affix to change its grammatical function (book → books)
- Derivation: adding an affix to a stem to create a new word (great → greatly)
- Compounding: combining two stems (key+board = keyboard, lawsuit, bookcase)

,Another problem with tokenization is deciding if punctuation means the end of a sentence or not. For
! or ? this is fairly easy, but for the . it can be difficult → it is often used in abbreviations. To solve this,
a decision tree or binary classifier can be implemented.
Basic text processing
Regular expressions – a formal language for specifying text strings.
Special symbols in regular expressions:
Symbol Meaning
[…] Any letter/digit inside can be chosen or in range […-…]
[^..] Negation of whatever comes after the ^
..? Whatever comes before ? is optional
..* 0 or more times or whatever comes before *
..+ 1 or more times or whatever comes before *
. Anything can be placed on the position of .
^… Start of string/line
…$ Denotes the end of string/line
\.$ End of string/line must be whatever comes after \ (so now .)
..|.. Disjunction, can be both
..[].. Whatever is outside the brackets must be in the string/line
\b…\b No other string before and after whatever is inside


Regular expressions are often used as features in machine learning classifiers. It can also be used in
capturing generalizations. They’re often the first model for any text processing text.
False positive error (Type I): matching strings that we should not have matched. This is reduced by
increasing accuracy or precision.
False negative error (Type II): not matching things that we should have matched. This is reduced by
increasing coverage or recall.
Similarity of strings
Minimum edit distance – minimum number of editing operations (insertion, deletion, substitution) to
change one string into the other. Levenshtein distance – substitution has a cost of 2.
To find the minimum edit distance we search for a path (sequence of edits) from the start string to
the final string:
- Initial state: word we’re transforming
- Operators: insert, delete, substitute
- Goal state: word we’re trying to get
- Path cost: the number of edits (that we try to minimize)
For string x of length n and string y of length m, D(i, j) is the edit distance between X[1…i] and Y[1…j].
The edit distance between X and Y is thus D(n, m).
Dynamic programming – a tabular computation of D(n, m) that solves problems by combining
solutions to subproblems. It’s a bottom-up algorithm that computes D(i, j) for small i, j and computes
a larger D(i, j) based on previously computed smaller values.
DP for Levenshtein minimum edit distance:
- Initialization:
D(i, 0) = i
D(0, j) = j

, - Recurrence relation:
For each i = 1…M
For each j = 1…N
D(i − 1, j) + 1
D(i, j − 1) + 1
D(i, j) = min
2 if X(i) ≠ Y(j)
D(i − 1, j − 1) + {
{ 0 if X(i) = Y(j)
LEFT
ptr(i, j) = { DOWN
DIAGONAL
- Termination:
D(N, M) is distance
Where the edit distance table is:
The edit distance is not sufficient, since often each
character needs to be aligned to the other string.




This is done by keeping a backtrace (ptr bit in formula): every time we enter a cell, we remember
where we came from and once we reach the end, we trace back the path from the upper right corner
to read off the alignment.
Performance minEdit: O(nm) for time and space. Performance backtrace: O(n+m).
An optimal alignment is composed of optimal subalignments.
Weighted edit distance – weights are added to computation. This is useful for spell correction (some
letters are more likely to be mistyped) and biology (certain kinds of deletions or insertions are more
likely). So instead of the + 1 and + 2/0, we have the weight of the insertion, deletion and substitution.
Language Models
Probabilistic language models compute a probability of a sentence or a sequence of words. Usually
it’s P(w1, w2, …, wn-1, wn):
- P(w1 w2 … wn ) = ∏i P(wi |w1 w2 … wi−1 )
P(A,B)
- P(B|A) = P(A) ⟺ P(A, B) = P(A)P(B|A)

Markov assumption – we can compute the probability of a sentence by computing the conditional
probability of the last few words: P(𝑤𝑖 | w1 w2 … wn ) ≈ ∏i P(wi |wi−k … wi−1 ). Example: P(the|water
is so transparent that) ≈ P(the|that). Multiple models:
- Unigram model: P(w1 w2 … wn ) ≈ ∏i P(wi )
- Bigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−1 )
- Trigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−2 , 𝑤𝑖−1 )
𝑐𝑜𝑢𝑛𝑡(wi−1 ,wi)
- Where P(wi |wi−1 ) = (using maximum likelihood estimate)
𝑐𝑜𝑢𝑛𝑡(wi−1 )
- Transform probabilities to log space to avoid underflow: Log(p1 x p2) = log(p1) + log(p2).
N-gram models are insufficient, because language has long-distance dependencies. Also, they only
work well for word prediction if the test corpus looks like the training corpus.
Smoothing – a method used for when the test set is not (as) similar to training set (i.e. bigrams with
probability 0 occur, which gives errors because we can’t divide by 0)

Written for

Institution
Study
Course

Document information

Uploaded on
October 23, 2021
Number of pages
23
Written in
2018/2019
Type
SUMMARY

Subjects

$9.48
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
FamkeNouwens Universiteit Leiden
Follow You need to be logged in order to follow users or courses
Sold
13
Member since
8 year
Number of followers
9
Documents
0
Last sold
1 year ago

3.0

1 reviews

5
0
4
0
3
1
2
0
1
0

Trending documents

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions