Summary

Summary - Text Retrieval and Mining (6013B0801Y)

Rating

Sold

Pages

Uploaded on

09-04-2024

Written in

2023/2024

Text Retrieval and Mining summary based on Lectures of University of Amsterdam (UvA - Universiteit van Amsterdam) course 6013B0801Y of the study programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Learned about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Show more Read less

Institution

Module

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Universiteit van Amsterdam (UvA)
Study: Business Analytics
Module: Text Retrieval and Mining (6013B0801Y)

All documents for this subject (2)

Document information

Uploaded on: April 9, 2024
Number of pages: 9
Written in: 2023/2024
Type: Summary

Subjects

data science
text retrieval
text mining
business analytics
econometrie
econometrics
data
science
text
uva
universiteit van amsterdam
econometrics and data science
samenvatting
summary

Content preview

Lecture 1: Working with Words (Bag of Words)
Bag of Words
Corpus all the documents
Token unit of text; word, punctua9on, etc.
Vocabulary/Dic9onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 9mes the word appears in all documents
Term Frequency (in a document) number of 9mes the word appears in one document
Document Frequency number of documents the word appears in
Tokenizer program that takes text and splits it into smaller units

BoW: You can transform any text of any length into a vector of ﬁxed signs.
We ﬁgure out which words are used in the documents and count each one. Then, for the diﬀerent
documents we create diﬀerent BoW vectors, coun9ng 1 in posi9on j when the token at posi9on i in
the dic9onary appears in the speciﬁc document. So, BoW vector has a lot of zeros, because many
words in the dic9onary won’t be used in the speciﬁc document.
So, shape BoW = (1, #words in Vocabulary)

Raw Count, BoW:
- Each text is represented by a vector with V dimensions (BoW with shape (1, #words in
Vocabulary)),
- The coeﬃcient at posi9on k = the number of 9mes the token at index k in the Vocabulary
appears in the speciﬁc text/document (so integers).
- Many words of the Vocabulary are not men9oned in this par9cular document/review, so
count will be 0 and will not be men9oned in this token list.

Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question for which you are searching an answer
Query the text you will use to search in your corpus
Example Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Cosine Similarity
Searching through the corpus for the documents that are similar to the query.
For this we use Cosine Similarity of the BoW vectors of two texts to evaluate their similarity.
Shape Similarity matrix = (#documents in corpus, 1)
The coeﬃcient at row k = the cosine similarity between the document at index k and the
query.
Cosine Similarity with Raw Count coeﬃcients puts too much emphasis on the number of occurrences
of a word within a document. That’s why we look at TF-IDF.
If Cosine Similarity is 0.0 of 2 text embeddings = those texts had no words in common.

, TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents. Each document is represented as a sparse embedding.
So, we calculate the speciﬁcity of words against other documents in the corpus.

𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

Term Frequency (TF) number of 9mes the word appears in the document,
Document Frequency (DF) number of documents in which the word appears in the corpus,
Inverse DF (IDF) Inverse of DF.

High TF-IDF: Uncommon words,
Word appears in the document, but not a lot overall in the corpus.
Low TF-IDF: Common words,
Word appears in the document, but also in a lot of other documents in the corpus.
The coeﬃcient at dimension k = the coeﬃcient of the token at index k in the Vocabulary.

Text Processing
Stopping Removing stop words,
Filter by Token Padern Tokens made of leders or numbers, tokens at least n characters,
Filter by Frequency Retain only top N tokens, based on number of 9mes they appear in corpus,
Stemming Remove plurals, conjuga9on, all diﬀerent forms of a word into one token,
Lemma9zing As stemming but must be a real word.
N-Grams Collec9ng groups of N consecu9ve tokens in text, which combina9ons repeat

Filter by Document Frequency Number of documents in which a token appears.
A word appears in nearly all documents: doesn’t par9cipate ac9vely to make a diﬀerence
between documents.
A word appears in only 1 or 2 documents: same, it’s likely a typo or artefact.
Code Min_df = 3 words that appear in more than 3 docs will be in Vocabulary.
Max_df = 0.9 words that appear in less than 90% of the documents will be in Vocabulary.

Logis=c Regression
- LogisEc Regression: there is 1 weight per input feature (here 1 weight per topic) and per
target class (1 weight per gender).
A posi9ve weight: presence of a topic in the descrip9on is correlated with users being of the
target gender.
Strong weight (+5.1 compared to others): seems that users enjoying Cinema have a high
change of belonging to gender 1,
Large nega9ve weight (-4.9): liking the cinema is a strong indica9on of not belonging to
gender 3.
- The fact that we can predict so well someone’s gender from the topics tackled in their
hobbies and tastes, indicate a strong link between one’s behaviour and their gender.

$5.45

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

daydortmans

Also available in package deal

Get to know the seller

daydortmans Universiteit van Amsterdam

View profile

Sold

Member since

1 year

Number of followers

Documents

Last sold

9 months ago

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller daydortmans. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $5.45. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46231 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

Summary - Text Retrieval and Mining (6013B0801Y)

Written for

Document information

Subjects

Content preview

Also available in package deal

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?