Summary

Samenvatting week 5-7 Data Science

Rating

Sold

Pages

Uploaded on

21-03-2025

Written in

2024/2025

Summary with main material of the second part of Data Science. Perfect for learning before the final exam.

Institution

Module

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit van Amsterdam (UvA)
Study: Informatiekunde
Module: Data Science

All documents for this subject (8)

Document information

Uploaded on: March 21, 2025
Number of pages: 17
Written in: 2024/2025
Type: Summary

Subjects

tf idf
gradient
rnn
neural networks
image filtering
text filtering
convolutional neural network
multimodal data processing

Content preview

Week 5 - Text Data Processing

Natural Language Processing (NLP):

Examples:
● text preprocessing
● bag of words and Tf-IDF
● Topic modeling
● Word embeddings and Word2Vec
● Sentence/document representations
● Attention mechanism

● before the deep learning era we need to preprocess text using tokenization and
normalization
● tokenization→ separates a sentence into word fragment
○ we can lower the cases first before tokenization
○ vb. tokens = nltk.tokenize.word_tokenize(s.lower())
● during tokenization we can also remove unwanted tokens → such as punctuations,
digits, symbols, emojis, stop words
○ vb:
■ >>> stws = nltk.corpus.stopwords.words(‘english’)
■ >>> tokens = [t for t in tokens if t.isaplha() and t not in stws]
● normalization →
○ stemming → chops or replaces word tails with the goal of approximate the
word’s original form
○ vb:
■ >>> stemmer = nltk.stem.porter.PorterStemmer()
■ >>> clean_tokens = [stemmer.stem(t) for t in tokens]
○ lemmatization → uses dictionaries and full morphological analysis to correctly
identify the lemma for each word
○ vb:
■ >>> from nltk.corpus import wordnet
■ >>> lemmatizer = nltk.stem.WordNetLemmatizer()
■ >>> pos = [wordnet_pos(p) for p in nltk.pos_tag(tokens)]
● >>> clean_tokens = [lemmatizer.lemmatize(t,p) for t, p in pos]
■ to perform lemmatization appropriately we need POS (Part Of
Speech) which means labeling the role of each word in a particular
part of speech
■ vb:
● def wordnet_pos(nltk_pos):
● if nltk_pos[1].startswith('V’): return (nltk_pos[0], wordnet.VERB)
● if nltk_pos[1].startswith('J’): return (nltk_pos[0], wordnet.ADJ)
● if nltk_pos[1].startswith('R’): return (nltk_pos[0], wordnet.ADV)
● else: return (nltk_pos[0], wordnet.NOUN)

,● now we have the cleaned tokens that represent a sentence. We need to transform
them to data points in some high-dimensional space
○ Bag of Words
○ vectors → data points → arrays of numbers that encode both the direction
and length information
○ Bag of Words → can be problematic because it weights all words equally,
even after removing stop words
○ solution → use TF-IDF (term frequency-inverse document frequency) to
transform sentences or documents into vectors
■ weighted Bag of Words
■ 𝑤𝑡,𝑑 = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷)
● Term Frequency (TF) → measures how frequently a term
(word) appears in a document. There are different
implementations such as using a log function to scale it down
○ 𝑡𝑓(𝑡, 𝑑) = 𝑓𝑡,𝑑
○ alternative implementation: 𝑡𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔10(𝑓𝑡,𝑑 + 1)
● Inverse Document Frequency (IDF) → weights each word by
considering how frequently it shows in different documents.
IDF is higher when the terms appears in fewer documents
𝑁
● 𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔10 ( 𝑛 )
𝑡

○ N → number of documents
○ 𝑛𝑡→ number of documents t appears in
○ we can also use topic modeling to encode a sentence/document into a
distribution of topics
■ vb → Latent Dirichlet Allocation
● each topic vector is represented by a list of words with different
weights
● each topic vector is represented by a list of words with different
weights
● after transforming text into vectors, we can use these vectors
for national language processing tasks, such as
sentence/document classification
○ We can use one-hot encoding → inefficient because it creates long vectors
with many zeros, which uses a lot of computer memory
■ does not encode similarity, vb. cosine similarities between two one-hot
encoded vectors are always zero

, ● The dot product of two vectors can also be used to measure similarity, which
considers both the angle and the vector lengths. Cosine similarity is a
normalized dot product

exercise 7.1

● we can use word embeddings to efficiently represent text as vectors, in which similar
words have a similar encoding in a high dimensional space
● position (distance and direction) in the word embedding vector space can encode
semantic relations, such as the relation between a country and its capital
○ we can represent words by their context
○ Word2Vec → method to train word embeddings by context. The goal is to
use the center word to predict nearby words as accurate as possible, based
on probabilities

$9.18

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

FloorReeuwijk

3.5

(2)

Also available in package deal

Get to know the seller

FloorReeuwijk Universiteit van Amsterdam

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

1 month ago

3.5

2 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller FloorReeuwijk. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.18. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 48957 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

Samenvatting week 5-7 Data Science

Written for

Document information

Subjects

Content preview

Also available in package deal

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?