Summary

Summary Data Analysis & Retrieval Midterm

Rating

Sold

Pages

Uploaded on

01-06-2022

Written in

2021/2022

All subjects that are discussed in the first part of the Data Analysis & Retrieval (INFOB3DAR) course for the midterm, clearly summarized. Based on the lectures and the book.

Institution

Course

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Connected book

Christopher D. Manning, Prabhakar Raghavan Introduction to Information Retrieval

Edition:2008
ISBN:9780521865715
Edition:Unknown

Written for

Institution: Universiteit Utrecht (UU)
Study: Informatica
Course: Data Analyse En Retrieval (INFOB3DAR)

All documents for this subject (2)

Document information

Summarized whole book?: No
Which chapters are summarized?: Unknown
Uploaded on: June 1, 2022
Number of pages: 22
Written in: 2021/2022
Type: Summary

Subjects

indexing
information retrieval
mapreduce
map reduce
top k searching
top k querying
top k scoring
scoring and ranking
frequent item sets
linear algebra
gaussian elimination
pagerank
page rank
approxima

Content preview

Data-analysis & retrieval
Midterm

Indexing 2

MapReduce 5

Scoring & Ranking 7
Top-k searching 9

Frequent item sets 11

Linear algebra 13
Gaussian elimination 14

PageRank 15

Approximate string matching 16

,Indexing
Text searching
- Collection: Fixed set of documents
- Goal: retrieving documents relevant to the user’s information need
- User’s need for information usually expressed by one or more search terms

Quality measures:
- Precision: fraction of retrieved documents that is relevant to user’s information need
(also called selectivity)
- Recall: fraction of relevant docs in collection that are retrieved (also called sensitivity)

Boolean retrieval
- Basic model for IR (Information Retrieval)
- Uses logical operators (AND, OR, NOT) and brackets
- Term-document incidence matrix: matrix that shows if a term appears in a document
(with 1 or 0 for true or false)
- For each term, you get a bit array where each bit is determined by whether or not
the term is contained in the corresponding document
- You can use the above mentioned bit array in bitwise operations to run queries
- For example, get documents from a collection of 6 that meet the query ‘Brutus
AND Caesar BUT NOT Calpurnia’ could correspond to the bitwise operation
110100 && 110111 && (NOT 010000) = 100100
- Problem: collections are often rather large, too large for the use of such a matrix
- Solved by the sparse matrix approach
- Sparse matrix: documents identified by unique docID and terms are organised in a
dictionary, with each term having its own posting list
- Posting list: ordered list of documents containing the corresponding term
- Dictionary is implemented as a hash table or tree like structure
- Implementations of postings lists:
- Internal memory, static situation: arrays
- Internal memory, dynamic situation: linked list
- External memory: linked list (block structure)

Tree like structures
- B-tree: binary tree, but with a maximum of 4 branches leading out from a node (4
branches needs 3 values to determine which branch to follow
- Trie (prefix tree): leafs are the terms, built up by various nodes adding prefixes

, Indexing process
1. From the documents to be indexed, get the relevant tokens (terms)
2. Modify the tokens to be more general (no capital letters etc.)
3. Get posting lists for the terms using indexer

Boolean query processing
- Query = term1 AND term2
1. Locate postings list for p1 for term1
2. Locate postings list p2 for term2
3. Calculate the intersection of p1 and p2 by list merging
- List merging: keep only the docID’s that occur in all input lists (intersection)
1. Get docID of both lists
2. If they are equal, add to result
3. Otherwise, get next docID of list where current docID is lower than other
4. Repeat steps 2 and 3 with new docID
- Query = term1 AND NOT term2
- Go through the first list and add only the docID’s that do not occur in the other
input list to the result
- Query = term1 AND term2 AND … AND termn
- 𝑛! possibilities of merging one by one
- Use the length of postings list, merge lists with smallest length first
- Skip pointers: pointer to docID further down the list
- Use when the docID that the pointer points to
is lower than the current docID of the other list
- For example, on the right, the pointer from 11
to 31 will be used when merging the two lists
- Many skip pointers → more comparisons,
more skips, higher memory cost
- Few skip pointers → fewer comparisons, less frequent skips, longer
jumps, lower memory cost
- Rule of thumb: 𝑛 skip pointers for postings list of length 𝑛
- Instead of merging the lists one by one, you can merge them all at the same time
by making a slight adjustment to the algorithm for list merging
- This approach allows more efficient use of skip pointers

Phrase queries
- Juxtapositions of terms: difference between “fight club” and “fight” AND “club”
- Solution 1: biword index (make an index for each term)

$11.68

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

Suniht

3.9

(13)

Get to know the seller

Suniht Universiteit Utrecht

View profile

Sold

Member since

4 year

Number of followers

Documents

Last sold

1 month ago

3.9

13 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Suniht. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $11.68. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 57791 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Summary Data Analysis & Retrieval Midterm

Connected book

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?