Notas de lectura

Lecture notes - Text Retrieval and Mining (6013B0801Y)

Puntuación

Vendido

Páginas

Subido en

09-04-2024

Escrito en

2023/2024

Text Retrieval and Mining (course 6013B0801Y) course at University of Amsterdam (UvA - Universiteit van Amsterdam) given at programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Information about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Mostrar más Leer menos

Institución

Grado

Vista previa del contenido

Lecture 1: Working with Words (Bag of Words)
Problem Movie reviews, ﬁguring out which are posi8ve/nega8ve without having to
read them.

Search and count posi8ve and nega8ve words

Example 1
Posi8ve: good, excellent
Nega8ve: bad, worst

Logis'c Regression
To ﬁgure out a ‘posi8veness score’ based on the features.
A way to ﬁgure out the weights of the diﬀerent words to measure the posi8veness of the
review.

Example 1

𝐏𝐫[𝒙 ∈ 𝒑𝒐𝒔] = 𝝈(𝜶𝒈𝒐𝒐𝒅 𝒙𝒈𝒐𝒐𝒅 + 𝜶𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 𝒙𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 + 𝜶𝒃𝒂𝒅 𝒙𝒃𝒂𝒅 + 𝜶𝒘𝒐𝒓𝒔𝒕 𝒙𝒘𝒐𝒓𝒔𝒕 )

With

Bag of Words
You can transform any text of any length into a vector of ﬁxed signs.
Figure out the words used by reviewers and count each one.

Corpus all the reviews
Token unit of text; word, punctua8on, …
Vocabulary/Dic8onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 8mes the word appears in all reviews
Term Frequency (in a document) number of 8mes the word appears in one review
Document Frequency number of reviews the word appears in

,For each document
- Create a vector of dimension V (#unique words) = #posi8ons
in this new vector;
- In posi8on i, write the number of 8mes the token (smallest
unit of text) at posi8on i in the dic8onary appears in the
documents;
- Many words of the Vocabulary are not men8oned in this
par8cular document/review, so count will be 0 and will not
be men8oned in this token list.

Example
Sentence 1: ‘The cat sat on the hat’
Sentence 2: ‘the dog ate the cat and the hat’

Vocabulary: 8 words, so number of dimensions of BoW is 8.
and, ate, cat, dog, hat, on, sat, the
BoW 1: [ 0, 0, 1, 0, 1, 1, 1, 2 ]
BoW 2: [ 1, 1, 1, 1, 1, 0, 0, 3 ]

NOTEBOOK 01 Bag of Words
Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question you are looking the answer to
Query the text you will use to search in your corpus

Examples Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Use case 2: ‘Where does Victor Hugo men8on Notre-Dame?’
Corpus: en8re works from Victor Hugo
Document: 1 paragraph
Query: ‘notre dame’

,Tokenizer Program that takes text and splits it into smaller units. Book into chapters,
paragraphs, sentences or words.

NLTK and SpaCy are Python libraries for text analy8cs but might produce diﬀerent text breaks.
The sentence tokenizer will split a text into sentences
The word tokenizer will split a text into words

SKLEARN Generali0es
Classes like ‘CountVectorizer’ or ‘TﬁdVectorizer’ work in the following way:
Instan8ate an object with speciﬁc parameters
𝒗 = 𝑪𝒐𝒖𝒏𝒕𝑽𝒆𝒄𝒕𝒐𝒓𝒊𝒛𝒆𝒓(… )
Fit this object to your corpus = learn the vocabulary
Method 𝒗. 𝒇𝒊𝒕(… )
Transform any piece of text you have into a vector
Method 𝒗. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎(… )

Raw Count
Take a text and represent it as a vector
Each text is represented by a vector with N dimensions
Each dimension is representa8ve of 1 word of the Vocabulary
The coeﬃcient in dimension k is #word at index k in Vocabulary is seen in represented text.

Example code

Do we consider ‘And’ diﬀerently than ‘and’ -> use lowercase=True to get around this problem
False: gives 134 unique words True: gives 127 unique words

S is sentence we are looking for, BoW shape (1, 127) with 127 unique words, bow gives all 1
with counts where the words of the sentence are in the Vocabulary.
Use ‘show_bow(count_small, bow[0])’ to also see which words are where in this BoW vector
Including the counts given below.

, Search Engine
If we want to create a search engine, so lemng the user enter a text query we can use
𝒒𝒖𝒆𝒓𝒚 = 𝒊𝒏𝒑𝒖𝒕(“𝑻𝒚𝒑𝒆 𝒚𝒐𝒖𝒓 𝒒𝒖𝒆𝒓𝒚: “)
𝒒𝒖𝒆𝒓𝒚_𝒃𝒐𝒘 = 𝒄𝒐𝒖𝒏𝒕. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎([𝒒𝒖𝒆𝒓𝒚])

Search through the corpus the documents that are similar to the query.
Similarity: we use the cosine similarity of the BoW vectors of two texts to evaluate their similarity.

Example code

The similarity matrix has D rows (#documents in
corpus) and 1 column.

Coeﬃcient at row k is the cosine similarity
between the document at index k in the column
and the query.

TF-IDF

Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents.
Consider a word in a document in a corpus (all reviews)
Term Frequency (TF) #word appears in the document
Document Frequency (DF) #documents in which the word appears in the whole corpus
Inverse DF (IDF) Inverse of DF

Then 𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

High value TF-IDF: uncommon words, word that appears
in the document, but not a lot overall in the corpus (so
it’s a speciﬁc word).
Low value TF-IDF: common words, word that appears in
the document, but also in a lot of other documents in
the corpus.

So, we calculate the speciﬁcity of words against other
documents in the corpus.

For the TF, IDF and normaliza8on we have speciﬁc given
formulas to calculate the TF-IDF.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Universiteit van Amsterdam (UvA)
Estudio: Business Analytics
Grado: Text Retrieval and Mining (6013B0801Y)

Todos documentos para esta materia (2)

Información del documento

Subido en: 9 de abril de 2024
Número de páginas: 34
Escrito en: 2023/2024
Tipo: NOTAS DE LECTURA
Profesor(es): J. rossi, julien rossi
Contiene: Todas las clases

Temas

data science
data
science
text
text retrieval
text mining
text retrieval and mining
econometrics
econometrie
business analytics
business
analytics
econometrics and data science

$7.05

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

daydortmans

Documento también disponible en un lote

Conoce al vendedor

daydortmans Universiteit van Amsterdam

Ver perfil

Seguir

Vendido

Miembro desde

1 año

Número de seguidores

Documentos

Última venta

3 semanas hace

0.0

0 reseñas

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller daydortmans. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $7.05. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now