Garantie de satisfaction à 100% Disponible immédiatement après paiement En ligne et en PDF Tu n'es attaché à rien 4.2 TrustPilot
logo-home
Resume

JADS Master - Natural Language Processing Summary

Note
-
Vendu
4
Pages
31
Publié le
09-01-2023
Écrit en
2021/2022

Summary for the Natural Language Processing course of the Master Data Science and Entrepreneurship.

Établissement
Cours











Oups ! Impossible de charger votre document. Réessayez ou contactez le support.

École, étude et sujet

Établissement
Cours
Cours

Infos sur le Document

Publié le
9 janvier 2023
Nombre de pages
31
Écrit en
2021/2022
Type
Resume

Sujets

Aperçu du contenu

1. Introduction
Natural Language Processing Approaches
● Rule-based (rationalism): hand-crafted rules, symbolic manipulation.
● Statistical (empiricism): data-driven (probabilistic or otherwise), shallow machine
learning.
● Massively parallel processing (deep learning): representation learning, human-like
performance.

▶ Natural language processing is about finding patterns in text and explaining them.

Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.

Structured
Labeled data in a (relational) database.

Unstructured
Free text.

Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).

Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.

Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.




1

,Text Analysis Techniques




2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.

▶ Types of learning:
● Basic: supervised, unsupervised, reinforcement learning.
● Other: semi-supervised, transfer learning, active learning.
▶ Types of tasks:
● Classification. ● Co-occurrence grouping.
● Regression. ● Profiling.
● Similarity matching. ● Link prediction.
● Clustering. ● Data reduction.
● Anomaly detection. ● Causal modeling.
▶ Modeling methods:
● Linear regression. ● Mixture models.
● Logistic regression. ● Support vector machines.
● Decision trees. ● Neural networks.
● K-nearest neighbors. ● Fuzzy inference systems.
● Naive Bayes classification. ● Bayesian networks.

Measuring Classifier Performance
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝑓1 = 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙




2

,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.

𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.

▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.

Kappa Curves
Used to select the optimal prediction threshold.

▶ AUK: area under the kappa curve.

Experiment Design




Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.

CRISP-DM Framework




Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.


3

, ▶ Fix ambiguity → domain application.
▶ Fix variation → text normalization.

Domain Application
● Text type or communication context (i.e. letters, tweets, chats, reports, news stories,
scientific articles).
● Application domain: area of application.
○ Topics & content.
○ Vocabulary use: terminology, jargon, general.
○ Writing style: formal, informal.
○ Languages.
● Corpus characteristics:
○ Text format: annotations, text, XML, HTML.
○ Text encoding: ASCII, UTF-8.
○ Text unit: documents, paragraphs, sentences, phrases.
○ Text unit length.
○ Vocabulary richness/variations.
○ Document structure (i.e. articles, wikipedia, etc.).
○ Corpus homogeneity (i.e. wikipedia, news, etc.).

Domain Considerations
● Data size.
● Private & sensitive data.
● Ethical issues.

Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.

▶ For good understanding read some documents → look for patterns.

Preprocessing Text




Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).

Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.


4
€5,49
Accéder à l'intégralité du document:

Garantie de satisfaction à 100%
Disponible immédiatement après paiement
En ligne et en PDF
Tu n'es attaché à rien

Faites connaissance avec le vendeur

Seller avatar
Les scores de réputation sont basés sur le nombre de documents qu'un vendeur a vendus contre paiement ainsi que sur les avis qu'il a reçu pour ces documents. Il y a trois niveaux: Bronze, Argent et Or. Plus la réputation est bonne, plus vous pouvez faire confiance sur la qualité du travail des vendeurs.
tomdewildt Jheronimus Academy of Data Science
S'abonner Vous devez être connecté afin de suivre les étudiants ou les cours
Vendu
29
Membre depuis
4 année
Nombre de followers
13
Documents
22
Dernière vente
6 mois de cela

5,0

1 revues

5
1
4
0
3
0
2
0
1
0

Récemment consulté par vous

Pourquoi les étudiants choisissent Stuvia

Créé par d'autres étudiants, vérifié par les avis

Une qualité sur laquelle compter : rédigé par des étudiants qui ont réussi et évalué par d'autres qui ont utilisé ce document.

Le document ne convient pas ? Choisis un autre document

Aucun souci ! Tu peux sélectionner directement un autre document qui correspond mieux à ce que tu cherches.

Paye comme tu veux, apprends aussitôt

Aucun abonnement, aucun engagement. Paye selon tes habitudes par carte de crédit et télécharge ton document PDF instantanément.

Student with book image

“Acheté, téléchargé et réussi. C'est aussi simple que ça.”

Alisha Student

Foire aux questions