100% tevredenheidsgarantie Direct beschikbaar na je betaling Lees online óf als PDF Geen vaste maandelijkse kosten 4,6 TrustPilot
logo-home
Samenvatting

Summary - Text Retrieval and Mining (6013B0801Y)

Beoordeling
-
Verkocht
-
Pagina's
9
Geüpload op
09-04-2024
Geschreven in
2023/2024

Text Retrieval and Mining summary based on Lectures of University of Amsterdam (UvA - Universiteit van Amsterdam) course 6013B0801Y of the study programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Learned about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Meer zien Lees minder

Voorbeeld van de inhoud

Lecture 1: Working with Words (Bag of Words)
Bag of Words
Corpus all the documents
Token unit of text; word, punctua9on, etc.
Vocabulary/Dic9onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 9mes the word appears in all documents
Term Frequency (in a document) number of 9mes the word appears in one document
Document Frequency number of documents the word appears in
Tokenizer program that takes text and splits it into smaller units

BoW: You can transform any text of any length into a vector of fixed signs.
We figure out which words are used in the documents and count each one. Then, for the different
documents we create different BoW vectors, coun9ng 1 in posi9on j when the token at posi9on i in
the dic9onary appears in the specific document. So, BoW vector has a lot of zeros, because many
words in the dic9onary won’t be used in the specific document.
So, shape BoW = (1, #words in Vocabulary)

Raw Count, BoW:
- Each text is represented by a vector with V dimensions (BoW with shape (1, #words in
Vocabulary)),
- The coefficient at posi9on k = the number of 9mes the token at index k in the Vocabulary
appears in the specific text/document (so integers).
- Many words of the Vocabulary are not men9oned in this par9cular document/review, so
count will be 0 and will not be men9oned in this token list.

Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question for which you are searching an answer
Query the text you will use to search in your corpus
Example Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Cosine Similarity
Searching through the corpus for the documents that are similar to the query.
For this we use Cosine Similarity of the BoW vectors of two texts to evaluate their similarity.
Shape Similarity matrix = (#documents in corpus, 1)
The coefficient at row k = the cosine similarity between the document at index k and the
query.
Cosine Similarity with Raw Count coefficients puts too much emphasis on the number of occurrences
of a word within a document. That’s why we look at TF-IDF.
If Cosine Similarity is 0.0 of 2 text embeddings = those texts had no words in common.

, TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents. Each document is represented as a sparse embedding.
So, we calculate the specificity of words against other documents in the corpus.


𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

Term Frequency (TF) number of 9mes the word appears in the document,
Document Frequency (DF) number of documents in which the word appears in the corpus,
Inverse DF (IDF) Inverse of DF.

High TF-IDF: Uncommon words,
Word appears in the document, but not a lot overall in the corpus.
Low TF-IDF: Common words,
Word appears in the document, but also in a lot of other documents in the corpus.
The coefficient at dimension k = the coefficient of the token at index k in the Vocabulary.


Text Processing
Stopping Removing stop words,
Filter by Token Padern Tokens made of leders or numbers, tokens at least n characters,
Filter by Frequency Retain only top N tokens, based on number of 9mes they appear in corpus,
Stemming Remove plurals, conjuga9on, all different forms of a word into one token,
Lemma9zing As stemming but must be a real word.
N-Grams Collec9ng groups of N consecu9ve tokens in text, which combina9ons repeat

Filter by Document Frequency Number of documents in which a token appears.
A word appears in nearly all documents: doesn’t par9cipate ac9vely to make a difference
between documents.
A word appears in only 1 or 2 documents: same, it’s likely a typo or artefact.
Code Min_df = 3 words that appear in more than 3 docs will be in Vocabulary.
Max_df = 0.9 words that appear in less than 90% of the documents will be in Vocabulary.



Logis=c Regression
- LogisEc Regression: there is 1 weight per input feature (here 1 weight per topic) and per
target class (1 weight per gender).
A posi9ve weight: presence of a topic in the descrip9on is correlated with users being of the
target gender.
Strong weight (+5.1 compared to others): seems that users enjoying Cinema have a high
change of belonging to gender 1,
Large nega9ve weight (-4.9): liking the cinema is a strong indica9on of not belonging to
gender 3.
- The fact that we can predict so well someone’s gender from the topics tackled in their
hobbies and tastes, indicate a strong link between one’s behaviour and their gender.

Documentinformatie

Geüpload op
9 april 2024
Aantal pagina's
9
Geschreven in
2023/2024
Type
SAMENVATTING

Onderwerpen

€4,99
Krijg toegang tot het volledige document:

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Maak kennis met de verkoper
Seller avatar
daydortmans

Ook beschikbaar in voordeelbundel

Thumbnail
Voordeelbundel
Summary and Lecture notes - Text Retrieval and Mining (6013B0801Y)
-
2 2 2024
€ 10,98 Meer info

Maak kennis met de verkoper

Seller avatar
daydortmans Universiteit van Amsterdam
Bekijk profiel
Volgen Je moet ingelogd zijn om studenten of vakken te kunnen volgen
Verkocht
2
Lid sinds
1 jaar
Aantal volgers
0
Documenten
2
Laatst verkocht
3 weken geleden

0,0

0 beoordelingen

5
0
4
0
3
0
2
0
1
0

Populaire documenten

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

Student with book image

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen