100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4.2 TrustPilot
logo-home
Samenvatting

JADS Master - Natural Language Processing Summary

Beoordeling
-
Verkocht
4
Pagina's
31
Geüpload op
09-01-2023
Geschreven in
2021/2022

Summary for the Natural Language Processing course of the Master Data Science and Entrepreneurship.












Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Documentinformatie

Geüpload op
9 januari 2023
Aantal pagina's
31
Geschreven in
2021/2022
Type
Samenvatting

Onderwerpen

Voorbeeld van de inhoud

1. Introduction
Natural Language Processing Approaches
● Rule-based (rationalism): hand-crafted rules, symbolic manipulation.
● Statistical (empiricism): data-driven (probabilistic or otherwise), shallow machine
learning.
● Massively parallel processing (deep learning): representation learning, human-like
performance.

▶ Natural language processing is about finding patterns in text and explaining them.

Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.

Structured
Labeled data in a (relational) database.

Unstructured
Free text.

Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).

Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.

Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.




1

,Text Analysis Techniques




2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.

▶ Types of learning:
● Basic: supervised, unsupervised, reinforcement learning.
● Other: semi-supervised, transfer learning, active learning.
▶ Types of tasks:
● Classification. ● Co-occurrence grouping.
● Regression. ● Profiling.
● Similarity matching. ● Link prediction.
● Clustering. ● Data reduction.
● Anomaly detection. ● Causal modeling.
▶ Modeling methods:
● Linear regression. ● Mixture models.
● Logistic regression. ● Support vector machines.
● Decision trees. ● Neural networks.
● K-nearest neighbors. ● Fuzzy inference systems.
● Naive Bayes classification. ● Bayesian networks.

Measuring Classifier Performance
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝑓1 = 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙




2

,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.

𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.

▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.

Kappa Curves
Used to select the optimal prediction threshold.

▶ AUK: area under the kappa curve.

Experiment Design




Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.

CRISP-DM Framework




Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.


3

, ▶ Fix ambiguity → domain application.
▶ Fix variation → text normalization.

Domain Application
● Text type or communication context (i.e. letters, tweets, chats, reports, news stories,
scientific articles).
● Application domain: area of application.
○ Topics & content.
○ Vocabulary use: terminology, jargon, general.
○ Writing style: formal, informal.
○ Languages.
● Corpus characteristics:
○ Text format: annotations, text, XML, HTML.
○ Text encoding: ASCII, UTF-8.
○ Text unit: documents, paragraphs, sentences, phrases.
○ Text unit length.
○ Vocabulary richness/variations.
○ Document structure (i.e. articles, wikipedia, etc.).
○ Corpus homogeneity (i.e. wikipedia, news, etc.).

Domain Considerations
● Data size.
● Private & sensitive data.
● Ethical issues.

Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.

▶ For good understanding read some documents → look for patterns.

Preprocessing Text




Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).

Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.


4

Maak kennis met de verkoper

Seller avatar
De reputatie van een verkoper is gebaseerd op het aantal documenten dat iemand tegen betaling verkocht heeft en de beoordelingen die voor die items ontvangen zijn. Er zijn drie niveau’s te onderscheiden: brons, zilver en goud. Hoe beter de reputatie, hoe meer de kwaliteit van zijn of haar werk te vertrouwen is.
tomdewildt Jheronimus Academy of Data Science
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
29
Lid sinds
4 jaar
Aantal volgers
13
Documenten
22
Laatst verkocht
6 maanden geleden

5,0

1 beoordelingen

5
1
4
0
3
0
2
0
1
0

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen