Resumen

Summary - Text Retrieval and Mining (6013B0801Y)

Puntuación

Vendido

Páginas

Subido en

09-04-2024

Escrito en

2023/2024

Text Retrieval and Mining summary based on Lectures of University of Amsterdam (UvA - Universiteit van Amsterdam) course 6013B0801Y of the study programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Learned about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Mostrar más Leer menos

Institución

Grado

Vista previa del contenido

Lecture 1: Working with Words (Bag of Words)
Bag of Words
Corpus all the documents
Token unit of text; word, punctua9on, etc.
Vocabulary/Dic9onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 9mes the word appears in all documents
Term Frequency (in a document) number of 9mes the word appears in one document
Document Frequency number of documents the word appears in
Tokenizer program that takes text and splits it into smaller units

BoW: You can transform any text of any length into a vector of ﬁxed signs.
We ﬁgure out which words are used in the documents and count each one. Then, for the diﬀerent
documents we create diﬀerent BoW vectors, coun9ng 1 in posi9on j when the token at posi9on i in
the dic9onary appears in the speciﬁc document. So, BoW vector has a lot of zeros, because many
words in the dic9onary won’t be used in the speciﬁc document.
So, shape BoW = (1, #words in Vocabulary)

Raw Count, BoW:
- Each text is represented by a vector with V dimensions (BoW with shape (1, #words in
Vocabulary)),
- The coeﬃcient at posi9on k = the number of 9mes the token at index k in the Vocabulary
appears in the speciﬁc text/document (so integers).
- Many words of the Vocabulary are not men9oned in this par9cular document/review, so
count will be 0 and will not be men9oned in this token list.

Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question for which you are searching an answer
Query the text you will use to search in your corpus
Example Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Cosine Similarity
Searching through the corpus for the documents that are similar to the query.
For this we use Cosine Similarity of the BoW vectors of two texts to evaluate their similarity.
Shape Similarity matrix = (#documents in corpus, 1)
The coeﬃcient at row k = the cosine similarity between the document at index k and the
query.
Cosine Similarity with Raw Count coeﬃcients puts too much emphasis on the number of occurrences
of a word within a document. That’s why we look at TF-IDF.
If Cosine Similarity is 0.0 of 2 text embeddings = those texts had no words in common.

, TF-IDF
Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents. Each document is represented as a sparse embedding.
So, we calculate the speciﬁcity of words against other documents in the corpus.

𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

Term Frequency (TF) number of 9mes the word appears in the document,
Document Frequency (DF) number of documents in which the word appears in the corpus,
Inverse DF (IDF) Inverse of DF.

High TF-IDF: Uncommon words,
Word appears in the document, but not a lot overall in the corpus.
Low TF-IDF: Common words,
Word appears in the document, but also in a lot of other documents in the corpus.
The coeﬃcient at dimension k = the coeﬃcient of the token at index k in the Vocabulary.

Text Processing
Stopping Removing stop words,
Filter by Token Padern Tokens made of leders or numbers, tokens at least n characters,
Filter by Frequency Retain only top N tokens, based on number of 9mes they appear in corpus,
Stemming Remove plurals, conjuga9on, all diﬀerent forms of a word into one token,
Lemma9zing As stemming but must be a real word.
N-Grams Collec9ng groups of N consecu9ve tokens in text, which combina9ons repeat

Filter by Document Frequency Number of documents in which a token appears.
A word appears in nearly all documents: doesn’t par9cipate ac9vely to make a diﬀerence
between documents.
A word appears in only 1 or 2 documents: same, it’s likely a typo or artefact.
Code Min_df = 3 words that appear in more than 3 docs will be in Vocabulary.
Max_df = 0.9 words that appear in less than 90% of the documents will be in Vocabulary.

Logis=c Regression
- LogisEc Regression: there is 1 weight per input feature (here 1 weight per topic) and per
target class (1 weight per gender).
A posi9ve weight: presence of a topic in the descrip9on is correlated with users being of the
target gender.
Strong weight (+5.1 compared to others): seems that users enjoying Cinema have a high
change of belonging to gender 1,
Large nega9ve weight (-4.9): liking the cinema is a strong indica9on of not belonging to
gender 3.
- The fact that we can predict so well someone’s gender from the topics tackled in their
hobbies and tastes, indicate a strong link between one’s behaviour and their gender.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Universiteit van Amsterdam (UvA)
Estudio: Business Analytics
Grado: Text Retrieval and Mining (6013B0801Y)

Todos documentos para esta materia (2)

Información del documento

Subido en: 9 de abril de 2024
Número de páginas: 9
Escrito en: 2023/2024
Tipo: RESUMEN

Temas

data science
text retrieval
text mining
business analytics
econometrie
econometrics
data
science
text
uva
universiteit van amsterdam
econometrics and data science
samenvatting
summary

$5.88

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

daydortmans

Documento también disponible en un lote

Conoce al vendedor

daydortmans Universiteit van Amsterdam

Ver perfil

Seguir

Vendido

Miembro desde

1 año

Número de seguidores

Documentos

Última venta

3 semanas hace

0.0

0 reseñas

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller daydortmans. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $5.88. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now