100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary Data Analysis & Retrieval Midterm

Rating
-
Sold
1
Pages
22
Uploaded on
01-06-2022
Written in
2021/2022

All subjects that are discussed in the first part of the Data Analysis & Retrieval (INFOB3DAR) course for the midterm, clearly summarized. Based on the lectures and the book.

Institution
Course










Whoops! We can’t load your doc right now. Try again or contact support.

Connected book

Written for

Institution
Study
Course

Document information

Summarized whole book?
No
Which chapters are summarized?
Unknown
Uploaded on
June 1, 2022
Number of pages
22
Written in
2021/2022
Type
Summary

Subjects

Content preview

Data-analysis & retrieval
Midterm

Indexing 2

MapReduce 5

Scoring & Ranking 7
Top-k searching 9

Frequent item sets 11

Linear algebra 13
Gaussian elimination 14

PageRank 15

Approximate string matching 16

,Indexing
Text searching
- Collection: Fixed set of documents
- Goal: retrieving documents relevant to the user’s information need
- User’s need for information usually expressed by one or more search terms

Quality measures:
- Precision: fraction of retrieved documents that is relevant to user’s information need
(also called selectivity)
- Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity)

Boolean retrieval
- Basic model for IR (Information Retrieval)
- Uses logical operators (AND, OR, NOT) and brackets
- Term-document incidence matrix: matrix that shows if a term appears in a document
(with 1 or 0 for true or false)
- For each term, you get a bit array where each bit is determined by whether or not
the term is contained in the corresponding document
- You can use the above mentioned bit array in bitwise operations to run queries
- For example, get documents from a collection of 6 that meet the query ‘Brutus
AND Caesar BUT NOT Calpurnia’ could correspond to the bitwise operation
110100 && 110111 && (NOT 010000) = 100100
- Problem: collections are often rather large, too large for the use of such a matrix
- Solved by the sparse matrix approach
- Sparse matrix: documents identified by unique docID and terms are organised in a
dictionary, with each term having its own posting list
- Posting list: ordered list of documents containing the corresponding term
- Dictionary is implemented as a hash table or tree like structure
- Implementations of postings lists:
- Internal memory, static situation: arrays
- Internal memory, dynamic situation: linked list
- External memory: linked list (block structure)




Tree like structures
- B-tree: binary tree, but with a maximum of 4 branches leading out from a node (4
branches needs 3 values to determine which branch to follow
- Trie (prefix tree): leafs are the terms, built up by various nodes adding prefixes

, Indexing process
1. From the documents to be indexed, get the relevant tokens (terms)
2. Modify the tokens to be more general (no capital letters etc.)
3. Get posting lists for the terms using indexer

Boolean query processing
- Query = term1 AND term2
1. Locate postings list for p1 for term1
2. Locate postings list p2 for term2
3. Calculate the intersection of p1 and p2 by list merging
- List merging: keep only the docID’s that occur in all input lists (intersection)
1. Get docID of both lists
2. If they are equal, add to result
3. Otherwise, get next docID of list where current docID is lower than other
4. Repeat steps 2 and 3 with new docID
- Query = term1 AND NOT term2
- Go through the first list and add only the docID’s that do not occur in the other
input list to the result
- Query = term1 AND term2 AND … AND termn
- 𝑛! possibilities of merging one by one
- Use the length of postings list, merge lists with smallest length first
- Skip pointers: pointer to docID further down the list
- Use when the docID that the pointer points to
is lower than the current docID of the other list
- For example, on the right, the pointer from 11
to 31 will be used when merging the two lists
- Many skip pointers → more comparisons,
more skips, higher memory cost
- Few skip pointers → fewer comparisons, less frequent skips, longer
jumps, lower memory cost
- Rule of thumb: 𝑛 skip pointers for postings list of length 𝑛
- Instead of merging the lists one by one, you can merge them all at the same time
by making a slight adjustment to the algorithm for list merging
- This approach allows more efficient use of skip pointers

Phrase queries
- Juxtapositions of terms: difference between “fight club” and “fight” AND “club”
- Solution 1: biword index (make an index for each term)

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
Suniht Universiteit Utrecht
Follow You need to be logged in order to follow users or courses
Sold
93
Member since
4 year
Number of followers
55
Documents
19
Last sold
2 hours ago

3.9

13 reviews

5
7
4
2
3
2
2
0
1
2

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions