Problem Movie reviews, figuring out which are posi8ve/nega8ve without having to
read them.
Search and count posi8ve and nega8ve words
Example 1
Posi8ve: good, excellent
Nega8ve: bad, worst
Logis'c Regression
To figure out a ‘posi8veness score’ based on the features.
A way to figure out the weights of the different words to measure the posi8veness of the
review.
Example 1
𝐏𝐫[𝒙 ∈ 𝒑𝒐𝒔] = 𝝈(𝜶𝒈𝒐𝒐𝒅 𝒙𝒈𝒐𝒐𝒅 + 𝜶𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 𝒙𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 + 𝜶𝒃𝒂𝒅 𝒙𝒃𝒂𝒅 + 𝜶𝒘𝒐𝒓𝒔𝒕 𝒙𝒘𝒐𝒓𝒔𝒕 )
With
Bag of Words
You can transform any text of any length into a vector of fixed signs.
Figure out the words used by reviewers and count each one.
Corpus all the reviews
Token unit of text; word, punctua8on, …
Vocabulary/Dic8onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 8mes the word appears in all reviews
Term Frequency (in a document) number of 8mes the word appears in one review
Document Frequency number of reviews the word appears in
,For each document
- Create a vector of dimension V (#unique words) = #posi8ons
in this new vector;
- In posi8on i, write the number of 8mes the token (smallest
unit of text) at posi8on i in the dic8onary appears in the
documents;
- Many words of the Vocabulary are not men8oned in this
par8cular document/review, so count will be 0 and will not
be men8oned in this token list.
Example
Sentence 1: ‘The cat sat on the hat’
Sentence 2: ‘the dog ate the cat and the hat’
Vocabulary: 8 words, so number of dimensions of BoW is 8.
and, ate, cat, dog, hat, on, sat, the
BoW 1: [ 0, 0, 1, 0, 1, 1, 1, 2 ]
BoW 2: [ 1, 1, 1, 1, 1, 0, 0, 3 ]
NOTEBOOK 01 Bag of Words
Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question you are looking the answer to
Query the text you will use to search in your corpus
Examples Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’
Use case 2: ‘Where does Victor Hugo men8on Notre-Dame?’
Corpus: en8re works from Victor Hugo
Document: 1 paragraph
Query: ‘notre dame’
,Tokenizer Program that takes text and splits it into smaller units. Book into chapters,
paragraphs, sentences or words.
NLTK and SpaCy are Python libraries for text analy8cs but might produce different text breaks.
The sentence tokenizer will split a text into sentences
The word tokenizer will split a text into words
SKLEARN Generali0es
Classes like ‘CountVectorizer’ or ‘TfidVectorizer’ work in the following way:
Instan8ate an object with specific parameters
𝒗 = 𝑪𝒐𝒖𝒏𝒕𝑽𝒆𝒄𝒕𝒐𝒓𝒊𝒛𝒆𝒓(… )
Fit this object to your corpus = learn the vocabulary
Method 𝒗. 𝒇𝒊𝒕(… )
Transform any piece of text you have into a vector
Method 𝒗. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎(… )
Raw Count
Take a text and represent it as a vector
Each text is represented by a vector with N dimensions
Each dimension is representa8ve of 1 word of the Vocabulary
The coefficient in dimension k is #word at index k in Vocabulary is seen in represented text.
Example code
Do we consider ‘And’ differently than ‘and’ -> use lowercase=True to get around this problem
False: gives 134 unique words True: gives 127 unique words
S is sentence we are looking for, BoW shape (1, 127) with 127 unique words, bow gives all 1
with counts where the words of the sentence are in the Vocabulary.
Use ‘show_bow(count_small, bow[0])’ to also see which words are where in this BoW vector
Including the counts given below.
, Search Engine
If we want to create a search engine, so lemng the user enter a text query we can use
𝒒𝒖𝒆𝒓𝒚 = 𝒊𝒏𝒑𝒖𝒕(“𝑻𝒚𝒑𝒆 𝒚𝒐𝒖𝒓 𝒒𝒖𝒆𝒓𝒚: “)
𝒒𝒖𝒆𝒓𝒚_𝒃𝒐𝒘 = 𝒄𝒐𝒖𝒏𝒕. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎([𝒒𝒖𝒆𝒓𝒚])
Search through the corpus the documents that are similar to the query.
Similarity: we use the cosine similarity of the BoW vectors of two texts to evaluate their similarity.
Example code
The similarity matrix has D rows (#documents in
corpus) and 1 column.
Coefficient at row k is the cosine similarity
between the document at index k in the column
and the query.
TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents.
Consider a word in a document in a corpus (all reviews)
Term Frequency (TF) #word appears in the document
Document Frequency (DF) #documents in which the word appears in the whole corpus
Inverse DF (IDF) Inverse of DF
Then 𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭
High value TF-IDF: uncommon words, word that appears
in the document, but not a lot overall in the corpus (so
it’s a specific word).
Low value TF-IDF: common words, word that appears in
the document, but also in a lot of other documents in
the corpus.
So, we calculate the specificity of words against other
documents in the corpus.
For the TF, IDF and normaliza8on we have specific given
formulas to calculate the TF-IDF.