College aantekeningen

Lectures Information Retrieval

Beoordeling

Verkocht

Pagina's

Geüpload op

13-04-2020

Geschreven in

2019/2020

Hoorcolleges van het vak Information Retrieval

Instelling

Vak

Oeps! We kunnen je document nu niet laden. Probeer het nog eens of neem contact op met support.

Meld schending auteursrecht

Gekoppeld boek

Christopher D. Manning, Prabhakar Raghavan Introduction to Information Retrieval

Uitgave:2008
ISBN:9780521865715
Druk:Onbekend

Geschreven voor

Instelling: Vrije Universiteit Amsterdam (VU)
Studie: Artificial Intelligence
Vak: Information Retrieval

Alle documenten voor dit vak (3)

Documentinformatie

Geüpload op: 13 april 2020
Aantal pagina's: 30
Geschreven in: 2019/2020
Type: College aantekeningen
Docent(en): Onbekend
Bevat: Alle colleges

Onderwerpen

Voorbeeld van de inhoud

Information Retrieval
HC01 Introduction
Information retrieval (IR) is finding material (usually documents) of an unstructured nature
(usually text) that satisfies an information need from within large collections (usually
stored on computers). In this course, focusing on automated information retrieval.

History of Information Retrieval
Many years ago, books were sorted by author and the other books are sorted by title. Lots of
manual effort. Library was much more useful, and you could find stuff. Nowadays, we think
this method is not so user-friendly.

Automated Information Retrieval
The first idea for an automated system was described in 1945 by Vannevar Bush. He didn’t
actually build the system. In the 1960s the field of Information Retrieval emerged. Computers
became a bit more widespread.

Evolution of Information Retrieval
1960-70s: the era of Boolean Retrieval
- the simplest type of retrieval
- either true or false, what the query asks for or not
1975: The first Vector Space Model
- Still these days seen as the main model
1980s: Large document database systems run by companies became available (LexisNexis,
Medline)
1990s: FTP search (different protocol than HTTP) and the dawn of Web search (Lycos,
Yahoo, Altavista)
- Altavista is the most popular one. Enable web-users to search the whole web.

Information Retrieval in the 2000s
Google came along a revolutionized many things.
● Link Analysis and Ranking
● Question Answering
○ allow users to actually ask questions
● Multimedia IR (image and video analysis
● Cross-language IR
● Document Summarization
● Semantic Web Technologies (DBpedia)

Information Retrieval in the 2010s
Categorization and clustering, and recommendation systems:
● iTunes “Top Songs”
● Amazon “people who bought this also bought”
● IBM’s Watson system
● Netflix “Top Picks”
○ based on all data they have and your past behavior
What is Needed to Build a Search Engine?
Search Engine Components

, - What (software) components are needed to build a search engine such as Google?
- User Interface (list of links)
- Crawler
- needs a big database, stores all the pages so we have them locally.
- Document cache: replicating something from something else
- Gets all these documents and stores it here
- Preprocess
- Preprocess query into something you can work with
- Query execution
- Other type of structured database, index; purpose: optimizing query execution
- List of documents
- Post-processing

The Information Retrieval Framework
[model]

Information Retrieval versus Databases
Databases results are exact/clearly defined/structured data. Query is very well defined.
Single error results in a failure.

Information Retrieval results are not perfect/not what we are looking for/doesn’t necessarily
has to be the right result/unstructured data. Set of keywords (loose semantics). Errors are
tolerable.

What makes search engines good?
Search speed: Latency
Latency of actions (from fastest to slowest)
1. Main memory reference (read random byte from memory)
2. Zip 1KB of data (compress 1000 bytes in memory)
3. SSD random read (read random byte from solid-state drive)
4. Round trip within same datacenter (send one byte to another computer in the same
fast datacenter network and back
5. 2
6. Send packet CA → Netherlands → CA 6

, HC02 Indexing and Boolean Retrieval

Why can we be sure that the matrix is sparse?
- 1000 out of 500.000 = 0.2% meaning that 99.8% are zeros.

Inverted index
For each term (word) t, we must store a list of all documents that contain t
Identify each by a document ID: a serial document number. So you do not have to reference
to the title.
What data structure could we use for these identifier lists?
- Hash table/map, Dictionary (python)

We need variable-size lists (called “posting lists”):
- On disk, a continuous run of postings is best
- In memory, can use linked lists or variable-length arrays

Dictionary postings:
“ Brutus → “ 1,2,4 …
links: “→”

Inverted Index Construction
Input text: Friends, Romans, countrymen…
Tokenizer
Tokens Friends|Romans|countrymen|..
Normalization
Normalized: friend|roman|countryman|
Indexer
Inverted Index: friend → 1,2,4,5
roman → 1,5,11
countryman → 1, 32, 87, ..

We only want to go through every document once.

Indexer Step 1: Token Sequence
Extract sequence of <normalized tokens, document ID> pairs.

Indexer Step 2: Sort (main indexing step)
Sort by terms (and then by document ID)

Indexer Step 3: Dictionary and Postings
Multiple term entries in a single documents are merged:
- Split into Dictionary and Postings
- Document frequency information is added

€3,49

Krijg toegang tot het volledige document:

100% tevredenheidsgarantie

Direct beschikbaar na je betaling

Lees online óf als PDF

Geen vaste maandelijkse kosten

Maak kennis met de verkoper

cdh

4,0

(1)

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

cdh Vrije Universiteit Amsterdam

Bekijk profiel

Volgen

Verkocht

Lid sinds

6 jaar

Aantal volgers

Documenten

Laatst verkocht

2 jaar geleden

4,0

1 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper cdh. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49388 samenvattingen verkocht Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Lectures Information Retrieval

Gekoppeld boek

Geschreven voor

Documentinformatie

Onderwerpen

Voorbeeld van de inhoud

Meer vakken binnen Vrije Universiteit Amsterdam (VU) > Artificial Intelligence

Ook beschikbaar in voordeelbundel

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?