Week 5 - Text Data Processing
Natural Language Processing (NLP):
Examples:
● text preprocessing
● bag of words and Tf-IDF
● Topic modeling
● Word embeddings and Word2Vec
● Sentence/document representations
● Attention mechanism
● before the deep learning era we need to preprocess text using tokenization and
normalization
● tokenization→ separates a sentence into word fragment
○ we can lower the cases first before tokenization
○ vb. tokens = nltk.tokenize.word_tokenize(s.lower())
● during tokenization we can also remove unwanted tokens → such as punctuations,
digits, symbols, emojis, stop words
○ vb:
■ >>> stws = nltk.corpus.stopwords.words(‘english’)
■ >>> tokens = [t for t in tokens if t.isaplha() and t not in stws]
● normalization →
○ stemming → chops or replaces word tails with the goal of approximate the
word’s original form
○ vb:
■ >>> stemmer = nltk.stem.porter.PorterStemmer()
■ >>> clean_tokens = [stemmer.stem(t) for t in tokens]
○ lemmatization → uses dictionaries and full morphological analysis to correctly
identify the lemma for each word
○ vb:
■ >>> from nltk.corpus import wordnet
■ >>> lemmatizer = nltk.stem.WordNetLemmatizer()
■ >>> pos = [wordnet_pos(p) for p in nltk.pos_tag(tokens)]
● >>> clean_tokens = [lemmatizer.lemmatize(t,p) for t, p in pos]
■ to perform lemmatization appropriately we need POS (Part Of
Speech) which means labeling the role of each word in a particular
part of speech
■ vb:
● def wordnet_pos(nltk_pos):
● if nltk_pos[1].startswith('V’): return (nltk_pos[0], wordnet.VERB)
● if nltk_pos[1].startswith('J’): return (nltk_pos[0], wordnet.ADJ)
● if nltk_pos[1].startswith('R’): return (nltk_pos[0], wordnet.ADV)
● else: return (nltk_pos[0], wordnet.NOUN)
,● now we have the cleaned tokens that represent a sentence. We need to transform
them to data points in some high-dimensional space
○ Bag of Words
○ vectors → data points → arrays of numbers that encode both the direction
and length information
○ Bag of Words → can be problematic because it weights all words equally,
even after removing stop words
○ solution → use TF-IDF (term frequency-inverse document frequency) to
transform sentences or documents into vectors
■ weighted Bag of Words
■ 𝑤𝑡,𝑑 = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷)
● Term Frequency (TF) → measures how frequently a term
(word) appears in a document. There are different
implementations such as using a log function to scale it down
○ 𝑡𝑓(𝑡, 𝑑) = 𝑓𝑡,𝑑
○ alternative implementation: 𝑡𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔10(𝑓𝑡,𝑑 + 1)
● Inverse Document Frequency (IDF) → weights each word by
considering how frequently it shows in different documents.
IDF is higher when the terms appears in fewer documents
𝑁
● 𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔10 ( 𝑛 )
𝑡
○ N → number of documents
○ 𝑛𝑡→ number of documents t appears in
○ we can also use topic modeling to encode a sentence/document into a
distribution of topics
■ vb → Latent Dirichlet Allocation
● each topic vector is represented by a list of words with different
weights
● each topic vector is represented by a list of words with different
weights
● after transforming text into vectors, we can use these vectors
for national language processing tasks, such as
sentence/document classification
○ We can use one-hot encoding → inefficient because it creates long vectors
with many zeros, which uses a lot of computer memory
■ does not encode similarity, vb. cosine similarities between two one-hot
encoded vectors are always zero
, ● The dot product of two vectors can also be used to measure similarity, which
considers both the angle and the vector lengths. Cosine similarity is a
normalized dot product
exercise 7.1
● we can use word embeddings to efficiently represent text as vectors, in which similar
words have a similar encoding in a high dimensional space
● position (distance and direction) in the word embedding vector space can encode
semantic relations, such as the relation between a country and its capital
○ we can represent words by their context
○ Word2Vec → method to train word embeddings by context. The goal is to
use the center word to predict nearby words as accurate as possible, based
on probabilities
Natural Language Processing (NLP):
Examples:
● text preprocessing
● bag of words and Tf-IDF
● Topic modeling
● Word embeddings and Word2Vec
● Sentence/document representations
● Attention mechanism
● before the deep learning era we need to preprocess text using tokenization and
normalization
● tokenization→ separates a sentence into word fragment
○ we can lower the cases first before tokenization
○ vb. tokens = nltk.tokenize.word_tokenize(s.lower())
● during tokenization we can also remove unwanted tokens → such as punctuations,
digits, symbols, emojis, stop words
○ vb:
■ >>> stws = nltk.corpus.stopwords.words(‘english’)
■ >>> tokens = [t for t in tokens if t.isaplha() and t not in stws]
● normalization →
○ stemming → chops or replaces word tails with the goal of approximate the
word’s original form
○ vb:
■ >>> stemmer = nltk.stem.porter.PorterStemmer()
■ >>> clean_tokens = [stemmer.stem(t) for t in tokens]
○ lemmatization → uses dictionaries and full morphological analysis to correctly
identify the lemma for each word
○ vb:
■ >>> from nltk.corpus import wordnet
■ >>> lemmatizer = nltk.stem.WordNetLemmatizer()
■ >>> pos = [wordnet_pos(p) for p in nltk.pos_tag(tokens)]
● >>> clean_tokens = [lemmatizer.lemmatize(t,p) for t, p in pos]
■ to perform lemmatization appropriately we need POS (Part Of
Speech) which means labeling the role of each word in a particular
part of speech
■ vb:
● def wordnet_pos(nltk_pos):
● if nltk_pos[1].startswith('V’): return (nltk_pos[0], wordnet.VERB)
● if nltk_pos[1].startswith('J’): return (nltk_pos[0], wordnet.ADJ)
● if nltk_pos[1].startswith('R’): return (nltk_pos[0], wordnet.ADV)
● else: return (nltk_pos[0], wordnet.NOUN)
,● now we have the cleaned tokens that represent a sentence. We need to transform
them to data points in some high-dimensional space
○ Bag of Words
○ vectors → data points → arrays of numbers that encode both the direction
and length information
○ Bag of Words → can be problematic because it weights all words equally,
even after removing stop words
○ solution → use TF-IDF (term frequency-inverse document frequency) to
transform sentences or documents into vectors
■ weighted Bag of Words
■ 𝑤𝑡,𝑑 = 𝑡𝑓(𝑡, 𝑑) × 𝑖𝑑𝑓(𝑡, 𝐷)
● Term Frequency (TF) → measures how frequently a term
(word) appears in a document. There are different
implementations such as using a log function to scale it down
○ 𝑡𝑓(𝑡, 𝑑) = 𝑓𝑡,𝑑
○ alternative implementation: 𝑡𝑓(𝑡, 𝑑) = 𝑙𝑜𝑔10(𝑓𝑡,𝑑 + 1)
● Inverse Document Frequency (IDF) → weights each word by
considering how frequently it shows in different documents.
IDF is higher when the terms appears in fewer documents
𝑁
● 𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔10 ( 𝑛 )
𝑡
○ N → number of documents
○ 𝑛𝑡→ number of documents t appears in
○ we can also use topic modeling to encode a sentence/document into a
distribution of topics
■ vb → Latent Dirichlet Allocation
● each topic vector is represented by a list of words with different
weights
● each topic vector is represented by a list of words with different
weights
● after transforming text into vectors, we can use these vectors
for national language processing tasks, such as
sentence/document classification
○ We can use one-hot encoding → inefficient because it creates long vectors
with many zeros, which uses a lot of computer memory
■ does not encode similarity, vb. cosine similarities between two one-hot
encoded vectors are always zero
, ● The dot product of two vectors can also be used to measure similarity, which
considers both the angle and the vector lengths. Cosine similarity is a
normalized dot product
exercise 7.1
● we can use word embeddings to efficiently represent text as vectors, in which similar
words have a similar encoding in a high dimensional space
● position (distance and direction) in the word embedding vector space can encode
semantic relations, such as the relation between a country and its capital
○ we can represent words by their context
○ Word2Vec → method to train word embeddings by context. The goal is to
use the center word to predict nearby words as accurate as possible, based
on probabilities