100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4,6 TrustPilot
logo-home
Notas de lectura

Lecture notes - Text Retrieval and Mining (6013B0801Y)

Puntuación
-
Vendido
-
Páginas
34
Subido en
09-04-2024
Escrito en
2023/2024

Text Retrieval and Mining (course 6013B0801Y) course at University of Amsterdam (UvA - Universiteit van Amsterdam) given at programme Business Analytics and programme Econometrics and Data Science, year 2023/2024. Information about Bag of Words, Cosine Similarity, TF-IDF, Text Processing, Text Mining, Part of Speech, Constituency Parsing, Named Entity Recognition (NER), Entity Linking, Topic Modeling, Latent Dirichlet Allocation (LDA), BERTopic, Word Embeddings, Word Co-Occurrence Matrix, Word Analogy, GloVe, Word2Vec, Neural Network, Language model, N-Gram Language model, Greedy Generation, RNN (Recurrent Neural Network), Encoder, Decoder, BERT, Masked Language model, Pre-Training, Fine-Turning, Relevance Score, Recall@K, Precision@K.

Mostrar más Leer menos
Institución
Grado

Vista previa del contenido

Lecture 1: Working with Words (Bag of Words)
Problem Movie reviews, figuring out which are posi8ve/nega8ve without having to
read them.

Search and count posi8ve and nega8ve words

Example 1
Posi8ve: good, excellent
Nega8ve: bad, worst




Logis'c Regression
To figure out a ‘posi8veness score’ based on the features.
A way to figure out the weights of the different words to measure the posi8veness of the
review.

Example 1

𝐏𝐫[𝒙 ∈ 𝒑𝒐𝒔] = 𝝈(𝜶𝒈𝒐𝒐𝒅 𝒙𝒈𝒐𝒐𝒅 + 𝜶𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 𝒙𝒆𝒙𝒄𝒆𝒍𝒍𝒆𝒏𝒕 + 𝜶𝒃𝒂𝒅 𝒙𝒃𝒂𝒅 + 𝜶𝒘𝒐𝒓𝒔𝒕 𝒙𝒘𝒐𝒓𝒔𝒕 )

With




Bag of Words
You can transform any text of any length into a vector of fixed signs.
Figure out the words used by reviewers and count each one.

Corpus all the reviews
Token unit of text; word, punctua8on, …
Vocabulary/Dic8onary all unique words appearing in the corpus
V size of vocabulary, the number of words
Corpus Frequency number of 8mes the word appears in all reviews
Term Frequency (in a document) number of 8mes the word appears in one review
Document Frequency number of reviews the word appears in

,For each document
- Create a vector of dimension V (#unique words) = #posi8ons
in this new vector;
- In posi8on i, write the number of 8mes the token (smallest
unit of text) at posi8on i in the dic8onary appears in the
documents;
- Many words of the Vocabulary are not men8oned in this
par8cular document/review, so count will be 0 and will not
be men8oned in this token list.


Example
Sentence 1: ‘The cat sat on the hat’
Sentence 2: ‘the dog ate the cat and the hat’

Vocabulary: 8 words, so number of dimensions of BoW is 8.
and, ate, cat, dog, hat, on, sat, the
BoW 1: [ 0, 0, 1, 0, 1, 1, 1, 2 ]
BoW 2: [ 1, 1, 1, 1, 1, 0, 0, 3 ]



NOTEBOOK 01 Bag of Words
Document the smallest unit of text of your use case (paper, paragraph, recipe)
Use case the typical question you are looking the answer to
Query the text you will use to search in your corpus


Examples Use case 1: ‘Which academic papers are about black holes?’
Corpus: academic papers uploaded in ArXiv
Document: 1 paper
Query: ‘black hole’

Use case 2: ‘Where does Victor Hugo men8on Notre-Dame?’
Corpus: en8re works from Victor Hugo
Document: 1 paragraph
Query: ‘notre dame’

,Tokenizer Program that takes text and splits it into smaller units. Book into chapters,
paragraphs, sentences or words.

NLTK and SpaCy are Python libraries for text analy8cs but might produce different text breaks.
The sentence tokenizer will split a text into sentences
The word tokenizer will split a text into words

SKLEARN Generali0es
Classes like ‘CountVectorizer’ or ‘TfidVectorizer’ work in the following way:
Instan8ate an object with specific parameters
𝒗 = 𝑪𝒐𝒖𝒏𝒕𝑽𝒆𝒄𝒕𝒐𝒓𝒊𝒛𝒆𝒓(… )
Fit this object to your corpus = learn the vocabulary
Method 𝒗. 𝒇𝒊𝒕(… )
Transform any piece of text you have into a vector
Method 𝒗. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎(… )

Raw Count
Take a text and represent it as a vector
Each text is represented by a vector with N dimensions
Each dimension is representa8ve of 1 word of the Vocabulary
The coefficient in dimension k is #word at index k in Vocabulary is seen in represented text.

Example code

Do we consider ‘And’ differently than ‘and’ -> use lowercase=True to get around this problem
False: gives 134 unique words True: gives 127 unique words




S is sentence we are looking for, BoW shape (1, 127) with 127 unique words, bow gives all 1
with counts where the words of the sentence are in the Vocabulary.
Use ‘show_bow(count_small, bow[0])’ to also see which words are where in this BoW vector
Including the counts given below.

, Search Engine
If we want to create a search engine, so lemng the user enter a text query we can use
𝒒𝒖𝒆𝒓𝒚 = 𝒊𝒏𝒑𝒖𝒕(“𝑻𝒚𝒑𝒆 𝒚𝒐𝒖𝒓 𝒒𝒖𝒆𝒓𝒚: “)
𝒒𝒖𝒆𝒓𝒚_𝒃𝒐𝒘 = 𝒄𝒐𝒖𝒏𝒕. 𝒕𝒓𝒂𝒏𝒔𝒇𝒐𝒓𝒎([𝒒𝒖𝒆𝒓𝒚])

Search through the corpus the documents that are similar to the query.
Similarity: we use the cosine similarity of the BoW vectors of two texts to evaluate their similarity.

Example code



The similarity matrix has D rows (#documents in
corpus) and 1 column.



Coefficient at row k is the cosine similarity
between the document at index k in the column
and the query.




TF-IDF

Adjust the raw count to favor words that appear a lot in a few documents, as opposed to those who
appear a lot in all documents.
Consider a word in a document in a corpus (all reviews)
Term Frequency (TF) #word appears in the document
Document Frequency (DF) #documents in which the word appears in the whole corpus
Inverse DF (IDF) Inverse of DF

Then 𝑻𝑭 − 𝑰𝑫𝑭(𝒕𝒆𝒓𝒎, 𝒅𝒐𝒄𝒖𝒎𝒆𝒏𝒕, 𝒄𝒐𝒓𝒑𝒖𝒔) = 𝑻𝑭 ∗ 𝑰𝑫𝑭

High value TF-IDF: uncommon words, word that appears
in the document, but not a lot overall in the corpus (so
it’s a specific word).
Low value TF-IDF: common words, word that appears in
the document, but also in a lot of other documents in
the corpus.

So, we calculate the specificity of words against other
documents in the corpus.

For the TF, IDF and normaliza8on we have specific given
formulas to calculate the TF-IDF.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
9 de abril de 2024
Número de páginas
34
Escrito en
2023/2024
Tipo
NOTAS DE LECTURA
Profesor(es)
J. rossi, julien rossi
Contiene
Todas las clases

Temas

$7.05
Accede al documento completo:

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Conoce al vendedor
Seller avatar
daydortmans

Documento también disponible en un lote

Conoce al vendedor

Seller avatar
daydortmans Universiteit van Amsterdam
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
2
Miembro desde
1 año
Número de seguidores
0
Documentos
2
Última venta
3 semanas hace

0.0

0 reseñas

5
0
4
0
3
0
2
0
1
0

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes