Samenvatting

Summary - Text Retrieval and Mining (6013B0801Y)

Beoordeling

Verkocht

Pagina's

Geüpload op

09-04-2024

Geschreven in

2023/2024

Text Retrieval and Mining summary based on Lectures of University of Amsterdam (UvA - Universiteit van Amsterdam) course 6013B0801Y of the study programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Learned about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Meer zien Lees minder

Instelling

Vak

Voorbeeld van de inhoud

Lecture 1: Working with Words (Bag of Words)
Bag of Words
Corpus all the documents
Token unit of text; word, punctua9on, etc.
Vocabulary/Dic9onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 9mes the word appears in all documents
Term Frequency (in a document) number of 9mes the word appears in one document
Document Frequency number of documents the word appears in
Tokenizer program that takes text and splits it into smaller units

BoW: You can transform any text of any length into a vector of ﬁxed signs.
We ﬁgure out which words are used in the documents and count each one. Then, for the diﬀerent
documents we create diﬀerent BoW vectors, coun9ng 1 in posi9on j when the token at posi9on i in
the dic9onary appears in the speciﬁc document. So, BoW vector has a lot of zeros, because many
words in the dic9onary won’t be used in the speciﬁc document.
So, shape BoW = (1, #words in Vocabulary)

Raw Count, BoW:
- Each text is represented by a vector with V dimensions (BoW with shape (1, #words in
Vocabulary)),
- The coeﬃcient at posi9on k = the number of 9mes the token at index k in the Vocabulary
appears in the speciﬁc text/document (so integers).
- Many words of the Vocabulary are not men9oned in this par9cular document/review, so
count will be 0 and will not be men9oned in this token list.

Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question for which you are searching an answer
Query the text you will use to search in your corpus
Example Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Cosine Similarity
Searching through the corpus for the documents that are similar to the query.
For this we use Cosine Similarity of the BoW vectors of two texts to evaluate their similarity.
Shape Similarity matrix = (#documents in corpus, 1)
The coeﬃcient at row k = the cosine similarity between the document at index k and the
query.
Cosine Similarity with Raw Count coeﬃcients puts too much emphasis on the number of occurrences
of a word within a document. That’s why we look at TF-IDF.
If Cosine Similarity is 0.0 of 2 text embeddings = those texts had no words in common.

, TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents. Each document is represented as a sparse embedding.
So, we calculate the speciﬁcity of words against other documents in the corpus.

𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

Term Frequency (TF) number of 9mes the word appears in the document,
Document Frequency (DF) number of documents in which the word appears in the corpus,
Inverse DF (IDF) Inverse of DF.

High TF-IDF: Uncommon words,
Word appears in the document, but not a lot overall in the corpus.
Low TF-IDF: Common words,
Word appears in the document, but also in a lot of other documents in the corpus.
The coeﬃcient at dimension k = the coeﬃcient of the token at index k in the Vocabulary.

Text Processing
Stopping Removing stop words,
Filter by Token Padern Tokens made of leders or numbers, tokens at least n characters,
Filter by Frequency Retain only top N tokens, based on number of 9mes they appear in corpus,
Stemming Remove plurals, conjuga9on, all diﬀerent forms of a word into one token,
Lemma9zing As stemming but must be a real word.
N-Grams Collec9ng groups of N consecu9ve tokens in text, which combina9ons repeat

Filter by Document Frequency Number of documents in which a token appears.
A word appears in nearly all documents: doesn’t par9cipate ac9vely to make a diﬀerence
between documents.
A word appears in only 1 or 2 documents: same, it’s likely a typo or artefact.
Code Min_df = 3 words that appear in more than 3 docs will be in Vocabulary.
Max_df = 0.9 words that appear in less than 90% of the documents will be in Vocabulary.

Logis=c Regression
- LogisEc Regression: there is 1 weight per input feature (here 1 weight per topic) and per
target class (1 weight per gender).
A posi9ve weight: presence of a topic in the descrip9on is correlated with users being of the
target gender.
Strong weight (+5.1 compared to others): seems that users enjoying Cinema have a high
change of belonging to gender 1,
Large nega9ve weight (-4.9): liking the cinema is a strong indica9on of not belonging to
gender 3.
- The fact that we can predict so well someone’s gender from the topics tackled in their
hobbies and tastes, indicate a strong link between one’s behaviour and their gender.

Meld schending auteursrecht

Geschreven voor

Instelling: Universiteit van Amsterdam (UvA)
Studie: Business Analytics
Vak: Text Retrieval and Mining (6013B0801Y)

Documentinformatie

Geüpload op: 9 april 2024
Aantal pagina's: 9
Geschreven in: 2023/2024
Type: SAMENVATTING

Onderwerpen

data science
text retrieval
text mining
business analytics
econometrie
econometrics
data
science
text
uva
universiteit van amsterdam
econometrics and data science
samenvatting
summary

€4,99

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

daydortmans

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

daydortmans Universiteit van Amsterdam

Bekijk profiel

Volgen

Verkocht

Lid sinds

2 jaar

Aantal volgers

Documenten

Laatst verkocht

4 maanden geleden

0,0

0 beoordelingen

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper daydortmans. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €4,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 40709 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Summary - Text Retrieval and Mining (6013B0801Y)

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?