✔
Natural Language Processing
Technology
Created @March 24, 2021 2:40 PM
Class S5
Type S5
Materials
Lecture 1
Introduction
NLP:
represents language in a way that a computer can process it → representing input
Process language in a way that is useful for human → generating output
understanding language structure and language use → computational modelling
Analyzing Language
linguistic pre-processing steps
standardizing the input
normalization and cleaning
remove layout (paragraphs, underlined, bold, italics)
remove/replace emojis and urls (making URLS or sth like that)
replace numbers with NUM
anonymization: replacing phone numbers/paswords
unless you need them!
Casing: uppercase vs lowercase vs true case (for example keeping uppercase by names but
not sentence beginnings)
sentence segmentation: What are indicators for sentence boundaries?
Linguistic pre-processing
fast developments: huge research places can now be done by just one package Python.
performance: very good for generic languages and problematic for domain-specific data or
small languages.
Natural Language Processing Technology 1
, word segmentation: how can i decompose a sentence into its words?
tokenization: all things, type: amount of different tokens
morphological analysis: lemmatization, sub-words, ... [read chapter 2]
morphological analysis
we want to decompose a word into their morphemes (as small as possible): unhappier-un-
happy-er (difficult in turkish for example)
highly challenging, because most languages contain many exceptions and morpheme
boundaries can be ambiguous
subwords:
frequent tokens are unique
less frequent tokens are decomposed into subwords
really statistical, not about linguistics by the approaches!
lemmatization: dictionary word happier/happiest → happy. there are ambiguities saw → see or
saw?
now we analyze the lemma
Natural Language Processing Technology 2
, Penn Treebank: 36 labels
Natural Language Processing Technology 3
, note on image above: left more complex, deeper structure
error propagation: dat de fout gemaakt in een van de stappen overvloeit naar de volgende stap
corpora and shared tasks
1) how did automated linguistic preprocessing become so good? tools were trained on manually
annotated corpora, tuned on development data and evaluated on test data. Machine learning and
neural networks boosted the performance and facilitated transfer across languages
nlpprogress.com → good to look for which process which package is the best
Natural Language Processing Technology 4
Natural Language Processing
Technology
Created @March 24, 2021 2:40 PM
Class S5
Type S5
Materials
Lecture 1
Introduction
NLP:
represents language in a way that a computer can process it → representing input
Process language in a way that is useful for human → generating output
understanding language structure and language use → computational modelling
Analyzing Language
linguistic pre-processing steps
standardizing the input
normalization and cleaning
remove layout (paragraphs, underlined, bold, italics)
remove/replace emojis and urls (making URLS or sth like that)
replace numbers with NUM
anonymization: replacing phone numbers/paswords
unless you need them!
Casing: uppercase vs lowercase vs true case (for example keeping uppercase by names but
not sentence beginnings)
sentence segmentation: What are indicators for sentence boundaries?
Linguistic pre-processing
fast developments: huge research places can now be done by just one package Python.
performance: very good for generic languages and problematic for domain-specific data or
small languages.
Natural Language Processing Technology 1
, word segmentation: how can i decompose a sentence into its words?
tokenization: all things, type: amount of different tokens
morphological analysis: lemmatization, sub-words, ... [read chapter 2]
morphological analysis
we want to decompose a word into their morphemes (as small as possible): unhappier-un-
happy-er (difficult in turkish for example)
highly challenging, because most languages contain many exceptions and morpheme
boundaries can be ambiguous
subwords:
frequent tokens are unique
less frequent tokens are decomposed into subwords
really statistical, not about linguistics by the approaches!
lemmatization: dictionary word happier/happiest → happy. there are ambiguities saw → see or
saw?
now we analyze the lemma
Natural Language Processing Technology 2
, Penn Treebank: 36 labels
Natural Language Processing Technology 3
, note on image above: left more complex, deeper structure
error propagation: dat de fout gemaakt in een van de stappen overvloeit naar de volgende stap
corpora and shared tasks
1) how did automated linguistic preprocessing become so good? tools were trained on manually
annotated corpora, tuned on development data and evaluated on test data. Machine learning and
neural networks boosted the performance and facilitated transfer across languages
nlpprogress.com → good to look for which process which package is the best
Natural Language Processing Technology 4