Escrito por estudiantes que aprobaron Inmediatamente disponible después del pago Leer en línea o como PDF ¿Documento equivocado? Cámbialo gratis 4,6 TrustPilot
logo-home
Resumen

Summary Data Analysis & Retrieval Midterm

Puntuación
-
Vendido
1
Páginas
22
Subido en
01-06-2022
Escrito en
2021/2022

All subjects that are discussed in the first part of the Data Analysis & Retrieval (INFOB3DAR) course for the midterm, clearly summarized. Based on the lectures and the book.

Institución
Grado

Vista previa del contenido

Data-analysis & retrieval
Midterm

Indexing 2

MapReduce 5

Scoring & Ranking 7
Top-k searching 9

Frequent item sets 11

Linear algebra 13
Gaussian elimination 14

PageRank 15

Approximate string matching 16

,Indexing
Text searching
- Collection: Fixed set of documents
- Goal: retrieving documents relevant to the user’s information need
- User’s need for information usually expressed by one or more search terms

Quality measures:
- Precision: fraction of retrieved documents that is relevant to user’s information need
(also called selectivity)
- Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity)

Boolean retrieval
- Basic model for IR (Information Retrieval)
- Uses logical operators (AND, OR, NOT) and brackets
- Term-document incidence matrix: matrix that shows if a term appears in a document
(with 1 or 0 for true or false)
- For each term, you get a bit array where each bit is determined by whether or not
the term is contained in the corresponding document
- You can use the above mentioned bit array in bitwise operations to run queries
- For example, get documents from a collection of 6 that meet the query ‘Brutus
AND Caesar BUT NOT Calpurnia’ could correspond to the bitwise operation
110100 && 110111 && (NOT 010000) = 100100
- Problem: collections are often rather large, too large for the use of such a matrix
- Solved by the sparse matrix approach
- Sparse matrix: documents identified by unique docID and terms are organised in a
dictionary, with each term having its own posting list
- Posting list: ordered list of documents containing the corresponding term
- Dictionary is implemented as a hash table or tree like structure
- Implementations of postings lists:
- Internal memory, static situation: arrays
- Internal memory, dynamic situation: linked list
- External memory: linked list (block structure)




Tree like structures
- B-tree: binary tree, but with a maximum of 4 branches leading out from a node (4
branches needs 3 values to determine which branch to follow
- Trie (prefix tree): leafs are the terms, built up by various nodes adding prefixes

, Indexing process
1. From the documents to be indexed, get the relevant tokens (terms)
2. Modify the tokens to be more general (no capital letters etc.)
3. Get posting lists for the terms using indexer

Boolean query processing
- Query = term1 AND term2
1. Locate postings list for p1 for term1
2. Locate postings list p2 for term2
3. Calculate the intersection of p1 and p2 by list merging
- List merging: keep only the docID’s that occur in all input lists (intersection)
1. Get docID of both lists
2. If they are equal, add to result
3. Otherwise, get next docID of list where current docID is lower than other
4. Repeat steps 2 and 3 with new docID
- Query = term1 AND NOT term2
- Go through the first list and add only the docID’s that do not occur in the other
input list to the result
- Query = term1 AND term2 AND … AND termn
- 𝑛! possibilities of merging one by one
- Use the length of postings list, merge lists with smallest length first
- Skip pointers: pointer to docID further down the list
- Use when the docID that the pointer points to
is lower than the current docID of the other list
- For example, on the right, the pointer from 11
to 31 will be used when merging the two lists
- Many skip pointers → more comparisons,
more skips, higher memory cost
- Few skip pointers → fewer comparisons, less frequent skips, longer
jumps, lower memory cost
- Rule of thumb: 𝑛 skip pointers for postings list of length 𝑛
- Instead of merging the lists one by one, you can merge them all at the same time
by making a slight adjustment to the algorithm for list merging
- This approach allows more efficient use of skip pointers

Phrase queries
- Juxtapositions of terms: difference between “fight club” and “fight” AND “club”
- Solution 1: biword index (make an index for each term)

Libro relacionado

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

¿Un libro?
No
¿Qué capítulos están resumidos?
Desconocido
Subido en
1 de junio de 2022
Número de páginas
22
Escrito en
2021/2022
Tipo
RESUMEN

Temas

$11.28
Accede al documento completo:

¿Documento equivocado? Cámbialo gratis Dentro de los 14 días posteriores a la compra y antes de descargarlo, puedes elegir otro documento. Puedes gastar el importe de nuevo.
Escrito por estudiantes que aprobaron
Inmediatamente disponible después del pago
Leer en línea o como PDF

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
Suniht Universiteit Utrecht
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
94
Miembro desde
4 año
Número de seguidores
55
Documentos
19
Última venta
3 meses hace

3.9

13 reseñas

5
7
4
2
3
2
2
0
1
2

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes