Summary

Summary for Information Retrieval Exam (X_400435)

Rating

Sold

Pages

Uploaded on

30-05-2023

Written in

2022/2023

Samenvatting voor Information Retrieval Exam (X_400435) voor minor Data Science aan de VU. Informatie: Lecture 1: Introduction (book chapter 1) Lecture 2: Indexing and Boolean Retrieval (book chapters 2 and 4) Lecture 3: What to Index? (book chapters 2 and 3) Lecture 4: Beyond Simple Queries (book chapters 3 and 6) Lecture 5: Vector Space Model (book chapter 6) Lecture 6: Index Compression (book chapter 5) Lecture 7: Evaluation (book chapter 8) Lecture 8: Link Analysis (book chapter 21) Lecture 9: Link Analysis Continued and Web Structure (book chapters 21.3 and 19) Lecture 10: Web Crawling (book chapters 20) Lecture 11: Clustering and Topic Modeling (book chapter 16-17) Lecture 12: Classification (book chapter 13-15)

Show more Read less

Institution

Module

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Connected book

Christopher D. Manning, Prabhakar Raghavan Introduction to Information Retrieval

Edition:2008
ISBN:9780521865715
Edition:Unknown

Written for

Institution: Vrije Universiteit Amsterdam (VU)
Study: Artificial Intelligence
Module: Information Retrieval (X_400435)

All documents for this subject (3)

Document information

Summarized whole book?: Yes
Uploaded on: May 30, 2023
Number of pages: 31
Written in: 2022/2023
Type: Summary

Subjects

data science
artificial intelligence
computer science
queries
vector space model
index compression
link analysis
web crawling
classification
indexing and boolean retrieval
link analysis continued and web structure
clustering and topic modeling

Content preview

Information retrieval 1

Is finding material of an unstructured nature that satisfies an information need from within
large collections (usually stored on computers)

First idea for an automated system was 1945 by Vannevar Bush in As We May Think
In 1960s the field of Information Retreival emerged

Evolution of IR
1960-70s: era of Boolean Retrieval
1975s: first Vector Space Model
1980s: large document database systems run by companies became available (LexisNexis,
MedLine)
1990s: FTP search and the dawn of Web search (lycos, Yahoo)

IR in 2000s
Google
- Link analysis & ranking
- Multimedia IR (image and video analysis)
- Cross-language IR
- Semantic Web Technologies (DBPedia)

IR since 2010s
Categorization and clustering, and recommendation:
- iTunes “Top Songs”
- Amazon “people who bought this also bought …”
- IBMs Watson system (business related: predict future outcomes)
- Recommendations in Netflix, spotify, youtube

IR versus DB
IR DB (databases)
Unstructured data structured
Set of keywords (loose semantics) well defined query (SQL)
Incomplete query specification, partial matching complete query specification, exact matc
Relevant items for result, errors tolerable single error results in failure
Probabilistic models deterministic models

,What is needed to build a search engine

What makes a search engine good?
Speed + User happiness
Which of following actions if fastest and slowest?
1 – main memory reference (read random byte from memory) 1
2 – Hard disk seek (read random byte from hard disk) 5
3 – SSD random read (read random byte from solid-state drive) 3
4 – Zip 1KB of data (compress 1000 bytes in memory) 2
5 – Round trip within same datacenter (send one byte to another computer in same fast
datacenter network and back) 4
6 – Send one byte from Netherlands to California and back 6

,2:

In memory, can use linked lists or variable-length arrays

Token = an instance of a sequence of
characters in some particular document that are grouped together as a useful semantic unit
for processing
Type = the class of all tokens consisting of exactly the same character sequence
Term = a (perhaps normalized) type that is included in the IR systems dictionary

Bottleneck: sorting
Sorting lot of records on disk is much too slow – in particular for hard disks but also for SSDs
but data is too large for memory, so we need an external sorting algo

, Summary
Boolean retrieval:
 A simple and well-understood retrieval model
Inverted indexes:
 Inverting an index demands a lot of resources
 Sorting the index is the critical step

3:
Biword indexes: Index every consecutive pair of terms in the text as a phrase
So “Friends, Romans, Countrymen would generate biwrods: friends romans & romans
countrymen
So we can now process 2-word phrase queries in a straightforward manner
But with longer phrase queries could be false positives, we cannot verify that it contains
query

Problems:
 False positives in answer set
 In particular for phrases with frequent words like “beer of the month”
 Index blow-up due to bigger dictionary
 Infeasible for more than biwords
Biword indexes are therefore not the standard solution, but they can be part of a compound
strategy

$13.77

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

simonvanrens

4.0

(1)

Get to know the seller

simonvanrens Vrije Universiteit Amsterdam

View profile

Sold

Member since

5 year

Number of followers

Documents

Last sold

1 year ago

4.0

1 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller simonvanrens. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $13.77. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 49388 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 15 years now

Summary for Information Retrieval Exam (X_400435)

Connected book

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?