100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4.2 TrustPilot
logo-home
Resumen

JADS Master - Natural Language Processing Summary

Puntuación
-
Vendido
4
Páginas
31
Subido en
09-01-2023
Escrito en
2021/2022

Summary for the Natural Language Processing course of the Master Data Science and Entrepreneurship.

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
9 de enero de 2023
Número de páginas
31
Escrito en
2021/2022
Tipo
Resumen

Temas

Vista previa del contenido

1. Introduction
Natural Language Processing Approaches
● Rule-based (rationalism): hand-crafted rules, symbolic manipulation.
● Statistical (empiricism): data-driven (probabilistic or otherwise), shallow machine
learning.
● Massively parallel processing (deep learning): representation learning, human-like
performance.

▶ Natural language processing is about finding patterns in text and explaining them.

Natural Language Processing History
● 1950-1990: Symbolic NLP.
○ Using a collection of rules, a computer can emulate natural language
understanding by applying those rules to confronted data.
● 1990-2010: Statistical NLP.
○ Apply machine learning techniques to natural language processing.
● 2010-present: Neutral NLP.
○ Extension of statistical methods with representation learning and deep neural
networks.

Structured
Labeled data in a (relational) database.

Unstructured
Free text.

Semi-Structured
A mixture of structured and unstructured data (i.e. a database + free-text notes).

Natural Language Processing Challenge
● Ambiguity (open for interpretation).
● Variation: direct variation, spelling variation, synonyms & syntactic variation.
● World knowledge.
● Context:
○ Domain: document context, genre, purpose, and characteristics.
○ Knowledge: general and domain knowledge resources.
○ Text: use of linguistic information.

Natural Language Processing Tasks
● Text classification: spam filtering, topic modeling, sentiment analysis.
● Information retrieval: recommender systems, search engine, question answering,
summarization.
● Information extraction: template-filling, named entity recognition (NER), relationship
extraction, ontology extraction.




1

,Text Analysis Techniques




2. Text Analysis
Machine Learning
Use and develop computer systems that can learn and adapt without following explicit
instructions by using algorithms and statistical models to analyze and draw inferences from
patterns in data.

▶ Types of learning:
● Basic: supervised, unsupervised, reinforcement learning.
● Other: semi-supervised, transfer learning, active learning.
▶ Types of tasks:
● Classification. ● Co-occurrence grouping.
● Regression. ● Profiling.
● Similarity matching. ● Link prediction.
● Clustering. ● Data reduction.
● Anomaly detection. ● Causal modeling.
▶ Modeling methods:
● Linear regression. ● Mixture models.
● Logistic regression. ● Support vector machines.
● Decision trees. ● Neural networks.
● K-nearest neighbors. ● Fuzzy inference systems.
● Naive Bayes classification. ● Bayesian networks.

Measuring Classifier Performance
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁)
𝑒𝑟𝑟𝑜𝑟 = 1 − 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃+𝐹𝑁
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 · 𝑟𝑒𝑐𝑎𝑙𝑙
𝑓1 = 2 · 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙




2

,Kappa Statistic
Used to measure inter-rater reliability for qualitative (categorical) items.

𝑎−𝑝
𝑘= 1−𝑝
● 𝑎: accuracy
● 𝑝: the probability of predicting the correct class due to chance.

▶ If 𝑘 = 1 → perfect model.
▶ If 𝑘 ≈ 0 → no better than random guessing.

Kappa Curves
Used to select the optimal prediction threshold.

▶ AUK: area under the kappa curve.

Experiment Design




Cross Validation
● Split data into groups of the same size.
● Hold aside one group for testing and use the remainder for training.
● Repeat for all groups.

CRISP-DM Framework




Natural Language Processing Terminology
● Text: series of symbols and characters.
● Token: a sequence of symbols (characters) that form a useful semantic unit of
processing.
● Document: a collection of tokens.
● Corpus: a collection of documents.


3

, ▶ Fix ambiguity → domain application.
▶ Fix variation → text normalization.

Domain Application
● Text type or communication context (i.e. letters, tweets, chats, reports, news stories,
scientific articles).
● Application domain: area of application.
○ Topics & content.
○ Vocabulary use: terminology, jargon, general.
○ Writing style: formal, informal.
○ Languages.
● Corpus characteristics:
○ Text format: annotations, text, XML, HTML.
○ Text encoding: ASCII, UTF-8.
○ Text unit: documents, paragraphs, sentences, phrases.
○ Text unit length.
○ Vocabulary richness/variations.
○ Document structure (i.e. articles, wikipedia, etc.).
○ Corpus homogeneity (i.e. wikipedia, news, etc.).

Domain Considerations
● Data size.
● Private & sensitive data.
● Ethical issues.

Corpus Statistics
● Document count.
● Word count.
● Word frequency.
● Lexical variation in the text (unique words / total words).
● Average sentence length.
● Average document length.

▶ For good understanding read some documents → look for patterns.

Preprocessing Text




Document Filtering
Select relevant documents (i.e. retrieve tweets with a certain hashtag).

Optical Character Recognition (OCR)
Converts scanned text images into text → may introduce a lot of errors.


4
$6.58
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
tomdewildt Jheronimus Academy of Data Science
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
29
Miembro desde
4 año
Número de seguidores
13
Documentos
22
Última venta
6 meses hace

5.0

1 reseñas

5
1
4
0
3
0
2
0
1
0

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes