100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4,6 TrustPilot
logo-home
Resumen

Summary - Text Retrieval and Mining (6013B0801Y)

Puntuación
-
Vendido
-
Páginas
9
Subido en
09-04-2024
Escrito en
2023/2024

Text Retrieval and Mining summary based on Lectures of University of Amsterdam (UvA - Universiteit van Amsterdam) course 6013B0801Y of the study programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Learned about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Mostrar más Leer menos
Institución
Grado

Vista previa del contenido

Lecture 1: Working with Words (Bag of Words)
Bag of Words
Corpus all the documents
Token unit of text; word, punctua9on, etc.
Vocabulary/Dic9onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 9mes the word appears in all documents
Term Frequency (in a document) number of 9mes the word appears in one document
Document Frequency number of documents the word appears in
Tokenizer program that takes text and splits it into smaller units

BoW: You can transform any text of any length into a vector of fixed signs.
We figure out which words are used in the documents and count each one. Then, for the different
documents we create different BoW vectors, coun9ng 1 in posi9on j when the token at posi9on i in
the dic9onary appears in the specific document. So, BoW vector has a lot of zeros, because many
words in the dic9onary won’t be used in the specific document.
So, shape BoW = (1, #words in Vocabulary)

Raw Count, BoW:
- Each text is represented by a vector with V dimensions (BoW with shape (1, #words in
Vocabulary)),
- The coefficient at posi9on k = the number of 9mes the token at index k in the Vocabulary
appears in the specific text/document (so integers).
- Many words of the Vocabulary are not men9oned in this par9cular document/review, so
count will be 0 and will not be men9oned in this token list.

Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question for which you are searching an answer
Query the text you will use to search in your corpus
Example Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Cosine Similarity
Searching through the corpus for the documents that are similar to the query.
For this we use Cosine Similarity of the BoW vectors of two texts to evaluate their similarity.
Shape Similarity matrix = (#documents in corpus, 1)
The coefficient at row k = the cosine similarity between the document at index k and the
query.
Cosine Similarity with Raw Count coefficients puts too much emphasis on the number of occurrences
of a word within a document. That’s why we look at TF-IDF.
If Cosine Similarity is 0.0 of 2 text embeddings = those texts had no words in common.

, TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents. Each document is represented as a sparse embedding.
So, we calculate the specificity of words against other documents in the corpus.


𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

Term Frequency (TF) number of 9mes the word appears in the document,
Document Frequency (DF) number of documents in which the word appears in the corpus,
Inverse DF (IDF) Inverse of DF.

High TF-IDF: Uncommon words,
Word appears in the document, but not a lot overall in the corpus.
Low TF-IDF: Common words,
Word appears in the document, but also in a lot of other documents in the corpus.
The coefficient at dimension k = the coefficient of the token at index k in the Vocabulary.


Text Processing
Stopping Removing stop words,
Filter by Token Padern Tokens made of leders or numbers, tokens at least n characters,
Filter by Frequency Retain only top N tokens, based on number of 9mes they appear in corpus,
Stemming Remove plurals, conjuga9on, all different forms of a word into one token,
Lemma9zing As stemming but must be a real word.
N-Grams Collec9ng groups of N consecu9ve tokens in text, which combina9ons repeat

Filter by Document Frequency Number of documents in which a token appears.
A word appears in nearly all documents: doesn’t par9cipate ac9vely to make a difference
between documents.
A word appears in only 1 or 2 documents: same, it’s likely a typo or artefact.
Code Min_df = 3 words that appear in more than 3 docs will be in Vocabulary.
Max_df = 0.9 words that appear in less than 90% of the documents will be in Vocabulary.



Logis=c Regression
- LogisEc Regression: there is 1 weight per input feature (here 1 weight per topic) and per
target class (1 weight per gender).
A posi9ve weight: presence of a topic in the descrip9on is correlated with users being of the
target gender.
Strong weight (+5.1 compared to others): seems that users enjoying Cinema have a high
change of belonging to gender 1,
Large nega9ve weight (-4.9): liking the cinema is a strong indica9on of not belonging to
gender 3.
- The fact that we can predict so well someone’s gender from the topics tackled in their
hobbies and tastes, indicate a strong link between one’s behaviour and their gender.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
9 de abril de 2024
Número de páginas
9
Escrito en
2023/2024
Tipo
RESUMEN

Temas

$5.88
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor
Seller avatar
daydortmans

Documento también disponible en un lote

Conoce al vendedor

Seller avatar
daydortmans Universiteit van Amsterdam
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
2
Miembro desde
1 año
Número de seguidores
0
Documentos
2
Última venta
3 semanas hace

0.0

0 reseñas

5
0
4
0
3
0
2
0
1
0

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes