Summary

Summary Natural Language Processing (NLP)

Rating

Sold

Pages

Uploaded on

23-10-2021

Written in

2018/2019

Summary of the course Natural Language Processing, taught at Maastricht University

Institution

Course

Content preview

Natural Language Processing Famke Nouwens

Week 1
NLP – algorithms to process, analyse and understand structure and meaning texts in a natural
language. Key aspects are: information extraction and retrieval, text classification, extracting
sentiment et cetera.
However, interpreting language is hard because of:
- Ambiguity: sentences/words can have multiple meanings
- Compositionality: certain expressions need to be composed in a set way (eg gin & tonic
instead of tonic & gin).
- Co-reference: to which object/person does the pronoun belong
- Non-standard language: symbols, abbreviations and acronyms
- Neologisms: newly coined words or expressions (eg bromance/retweet)
Morphology
Every NLP task needs to do text normalization:
- Tokenizing words (separating the individual words/symbols in sentences)
- Normalizing word formats
- Segmenting sentences
Lemma – words that have the same stem (basis of the word) and a rough word sense (eg cat – cats,
be – is)
Type – an element of the vocabulary (different words, different symbols et cetera). |V| is the size of
the vocabulary.
Token – an instance of that type in running text (the number of occurrences of that type). N is the
number of tokens.
type
The number of tokens is always at least the number of types, and the token ratio is a measure of
complexity.
Tokenization
Compound splitter – splits a single word into main words (useful in German)
Often before we can start tokenizing, we first need pre-process the data:
- Normalization
• Text and query terms must have the same form (eg. USA must match U.S.A.)
• Define equivalence classes (eg. Terms may not have a . in them)
- Case folding: reduce all letters to lower case. Disadvantage: lose information
- Remove spaces for compound words or collocations (eg New York → NewYork)
- Lemmatization: change every word to its base form (eg. Am/is/are → be). Preferred to
stemming with small training sets.
- Morphology: reduce words to its morpheme (smallest meaningful unit of word eg. Walked =
walk + ed).
- Stemming: reduce terms to their stem (chop off the affixes). It is aggressive because it can
chop too much to which the word loses its meaning.
Different ways to form a word:
- Inflection: modifying a word with an affix to change its grammatical function (book → books)
- Derivation: adding an affix to a stem to create a new word (great → greatly)
- Compounding: combining two stems (key+board = keyboard, lawsuit, bookcase)

,Another problem with tokenization is deciding if punctuation means the end of a sentence or not. For
! or ? this is fairly easy, but for the . it can be difficult → it is often used in abbreviations. To solve this,
a decision tree or binary classifier can be implemented.
Basic text processing
Regular expressions – a formal language for specifying text strings.
Special symbols in regular expressions:
Symbol Meaning
[…] Any letter/digit inside can be chosen or in range […-…]
[^..] Negation of whatever comes after the ^
..? Whatever comes before ? is optional
..* 0 or more times or whatever comes before *
..+ 1 or more times or whatever comes before *
. Anything can be placed on the position of .
^… Start of string/line
…$ Denotes the end of string/line
\.$ End of string/line must be whatever comes after \ (so now .)
..|.. Disjunction, can be both
..[].. Whatever is outside the brackets must be in the string/line
\b…\b No other string before and after whatever is inside

Regular expressions are often used as features in machine learning classifiers. It can also be used in
capturing generalizations. They’re often the first model for any text processing text.
False positive error (Type I): matching strings that we should not have matched. This is reduced by
increasing accuracy or precision.
False negative error (Type II): not matching things that we should have matched. This is reduced by
increasing coverage or recall.
Similarity of strings
Minimum edit distance – minimum number of editing operations (insertion, deletion, substitution) to
change one string into the other. Levenshtein distance – substitution has a cost of 2.
To find the minimum edit distance we search for a path (sequence of edits) from the start string to
the final string:
- Initial state: word we’re transforming
- Operators: insert, delete, substitute
- Goal state: word we’re trying to get
- Path cost: the number of edits (that we try to minimize)
For string x of length n and string y of length m, D(i, j) is the edit distance between X[1…i] and Y[1…j].
The edit distance between X and Y is thus D(n, m).
Dynamic programming – a tabular computation of D(n, m) that solves problems by combining
solutions to subproblems. It’s a bottom-up algorithm that computes D(i, j) for small i, j and computes
a larger D(i, j) based on previously computed smaller values.
DP for Levenshtein minimum edit distance:
- Initialization:
D(i, 0) = i
D(0, j) = j

, - Recurrence relation:
For each i = 1…M
For each j = 1…N
D(i − 1, j) + 1
D(i, j − 1) + 1
D(i, j) = min
2 if X(i) ≠ Y(j)
D(i − 1, j − 1) + {
{ 0 if X(i) = Y(j)
LEFT
ptr(i, j) = { DOWN
DIAGONAL
- Termination:
D(N, M) is distance
Where the edit distance table is:
The edit distance is not sufficient, since often each
character needs to be aligned to the other string.

This is done by keeping a backtrace (ptr bit in formula): every time we enter a cell, we remember
where we came from and once we reach the end, we trace back the path from the upper right corner
to read off the alignment.
Performance minEdit: O(nm) for time and space. Performance backtrace: O(n+m).
An optimal alignment is composed of optimal subalignments.
Weighted edit distance – weights are added to computation. This is useful for spell correction (some
letters are more likely to be mistyped) and biology (certain kinds of deletions or insertions are more
likely). So instead of the + 1 and + 2/0, we have the weight of the insertion, deletion and substitution.
Language Models
Probabilistic language models compute a probability of a sentence or a sequence of words. Usually
it’s P(w1, w2, …, wn-1, wn):
- P(w1 w2 … wn ) = ∏i P(wi |w1 w2 … wi−1 )
P(A,B)
- P(B|A) = P(A) ⟺ P(A, B) = P(A)P(B|A)

Markov assumption – we can compute the probability of a sentence by computing the conditional
probability of the last few words: P(𝑤𝑖 | w1 w2 … wn ) ≈ ∏i P(wi |wi−k … wi−1 ). Example: P(the|water
is so transparent that) ≈ P(the|that). Multiple models:
- Unigram model: P(w1 w2 … wn ) ≈ ∏i P(wi )
- Bigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−1 )
- Trigram model: P(w1 w2 … wn ) ≈ ∏i P(wi |𝑤𝑖−2 , 𝑤𝑖−1 )
𝑐𝑜𝑢𝑛𝑡(wi−1 ,wi)
- Where P(wi |wi−1 ) = (using maximum likelihood estimate)
𝑐𝑜𝑢𝑛𝑡(wi−1 )
- Transform probabilities to log space to avoid underflow: Log(p1 x p2) = log(p1) + log(p2).
N-gram models are insufficient, because language has long-distance dependencies. Also, they only
work well for word prediction if the test corpus looks like the training corpus.
Smoothing – a method used for when the test set is not (as) similar to training set (i.e. bigrams with
probability 0 occur, which gives errors because we can’t divide by 0)

Report Copyright Violation

Written for

Institution: Maastricht University (UM)
Study: Data Science And Knowledge Engineering
Course: Natural Language Processing

All documents for this subject (1)

Document information

Uploaded on: October 23, 2021
Number of pages: 23
Written in: 2018/2019
Type: SUMMARY

Subjects

morphology
text
tokenization
data processing
regular expressions
similarity of strings
language models
smoothing
evaluation metrics
evaluation
text classification
natural language processing nlp
ai

$9.48

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

FamkeNouwens

3.0

(1)

Get to know the seller

FamkeNouwens Universiteit Leiden

View profile

Sold

Member since

8 year

Number of followers

Documents

Last sold

1 year ago

3.0

1 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller FamkeNouwens. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.48. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 49004 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary Natural Language Processing (NLP)

Content preview

Written for

Document information

Subjects

Get to know the seller

Trending documents

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?