100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Samenvatting week 5-7 Data Science

Rating
-
Sold
1
Pages
17
Uploaded on
21-03-2025
Written in
2024/2025

Summary with main material of the second part of Data Science. Perfect for learning before the final exam.

Institution
Module










Whoops! We can’t load your doc right now. Try again or contact support.

Written for

Institution
Study
Module

Document information

Uploaded on
March 21, 2025
Number of pages
17
Written in
2024/2025
Type
Summary

Subjects

Content preview

Week 5 - Text Data Processing

Natural Language Processing (NLP):

Examples:
●​ text preprocessing
●​ bag of words and Tf-IDF
●​ Topic modeling
●​ Word embeddings and Word2Vec
●​ Sentence/document representations
●​ Attention mechanism

●​ before the deep learning era we need to preprocess text using tokenization and
normalization
●​ tokenization→ separates a sentence into word fragment
○​ we can lower the cases first before tokenization
○​ vb. tokens = nltk.tokenize.word_tokenize(s.lower())
●​ during tokenization we can also remove unwanted tokens → such as punctuations,
digits, symbols, emojis, stop words
○​ vb:
■​ >>> stws = nltk.corpus.stopwords.words(‘english’)
■​ >>> tokens = [t for t in tokens if t.isaplha() and t not in stws]
●​ normalization →
○​ stemming → chops or replaces word tails with the goal of approximate the
word’s original form
○​ vb:
■​ >>> stemmer = nltk.stem.porter.PorterStemmer()
■​ >>> clean_tokens = [stemmer.stem(t) for t in tokens]
○​ lemmatization → uses dictionaries and full morphological analysis to correctly
identify the lemma for each word
○​ vb:
■​ >>> from nltk.corpus import wordnet
■​ >>> lemmatizer = nltk.stem.WordNetLemmatizer()
■​ >>> pos = [wordnet_pos(p) for p in nltk.pos_tag(tokens)]
●​ >>> clean_tokens = [lemmatizer.lemmatize(t,p) for t, p in pos]
■​ to perform lemmatization appropriately we need POS (Part Of
Speech) which means labeling the role of each word in a particular
part of speech
■​ vb:
●​ def wordnet_pos(nltk_pos):
●​ if nltk_pos[1].startswith('V’): return (nltk_pos[0], wordnet.VERB)
●​ if nltk_pos[1].startswith('J’): return (nltk_pos[0], wordnet.ADJ)
●​ if nltk_pos[1].startswith('R’): return (nltk_pos[0], wordnet.ADV)
●​ else: return (nltk_pos[0], wordnet.NOUN)

,●​ now we have the cleaned tokens that represent a sentence. We need to transform
them to data points in some high-dimensional space
○​ Bag of Words
○​ vectors → data points → arrays of numbers that encode both the direction
and length information
○​ Bag of Words → can be problematic because it weights all words equally,
even after removing stop words
○​ solution → use TF-IDF (term frequency-inverse document frequency) to
transform sentences or documents into vectors​
■​ weighted Bag of Words
■​ 𝑤𝑡,𝑑 = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷)
●​ Term Frequency (TF) → measures how frequently a term
(word) appears in a document. There are different
implementations such as using a log function to scale it down
○​ 𝑡𝑓(𝑡, 𝑑) = 𝑓𝑡,𝑑
○​ alternative implementation: 𝑡𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔10(𝑓𝑡,𝑑 + 1)
●​ Inverse Document Frequency (IDF) → weights each word by
considering how frequently it shows in different documents.
IDF is higher when the terms appears in fewer documents
𝑁
●​ 𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔10 ( 𝑛 )
𝑡

○​ N → number of documents
○​ 𝑛𝑡→ number of documents t appears in
○​ we can also use topic modeling to encode a sentence/document into a
distribution of topics
■​ vb → Latent Dirichlet Allocation
●​ each topic vector is represented by a list of words with different
weights
●​ each topic vector is represented by a list of words with different
weights
●​ after transforming text into vectors, we can use these vectors
for national language processing tasks, such as
sentence/document classification
○​ We can use one-hot encoding → inefficient because it creates long vectors
with many zeros, which uses a lot of computer memory
■​ does not encode similarity, vb. cosine similarities between two one-hot
encoded vectors are always zero

, ●​ The dot product of two vectors can also be used to measure similarity, which
considers both the angle and the vector lengths. Cosine similarity is a
normalized dot product




exercise 7.1

●​ we can use word embeddings to efficiently represent text as vectors, in which similar
words have a similar encoding in a high dimensional space
●​ position (distance and direction) in the word embedding vector space can encode
semantic relations, such as the relation between a country and its capital
○​ we can represent words by their context
○​ Word2Vec → method to train word embeddings by context. The goal is to
use the center word to predict nearby words as accurate as possible, based
on probabilities
$9.18
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached


Also available in package deal

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
FloorReeuwijk Universiteit van Amsterdam
Follow You need to be logged in order to follow users or courses
Sold
19
Member since
1 year
Number of followers
0
Documents
18
Last sold
1 month ago

3.5

2 reviews

5
0
4
1
3
1
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions