Notas de lectura

Information Retrieval Summary

Name: Information Retrieval Summary 2021-2022
SKU: doc_1444004
Rating: 3.80 (5 reviews)
Author: bastudent

Puntuación

3.8

(5)

Vendido

Páginas

Subido en

15-12-2021

Escrito en

2021/2022

Detailed document on all the lectures 1-13.

Institución

Grado

Vista previa del contenido

INFORMATION RETRIEVAL

WEEK 1
LECTURE 1: Introduction

What is information retrieval?
Information retrieval: information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
that satisfies an information need from within large collections (usually stored on computers).

History of information retrieval
- Information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar
professional searchers. Nowadays, everyone engages in IR every day e.g. search engines or searching an inbox.
- Before computers information retrieval existed in some form. E.g. at the library, one could find books by searching through
index cards.
- Automated information retrieval:
o The first idea for an automated information retrieval system was described in 1945 by Vannevar Bush in ‘As We May
Think’.
o Microfilm stored in a large desk that could be accessed through knobs and screens that would show the microfilm
recordings. It was a mechanical process that could pull out the right book. It also made use of hyperlinks that you could
use to get from one book to the next or find books about a certain topic.
o In the 1960s the field of information retrieval emerged when text documents became available in digital form.
- Evolution of information retrieval:
o 1960-70s: the era of Boolean retrieval
o 1975: the first vector space model
o 1980s: large document database systems run by companies became available e.g. LexisNexis and MedLine
o 1990s: FTP search and the dawn of Web search e.g. Lycos, Yahoo, and Altavista.
- Information retrieval in the 2000s:
o Link analysis and ranking e.g. Google
o Question answering
o Multimedia IR (image and video analysis)
o Cross-language IR
o Semantic Web Technologies e.g. DBpedia
- Information retrieval since 2010:
o Categorization and clustering, and recommendation:
• iTunes - “top songs"
• Amazon - “people who bought this also bought"
• IBM's Watson system recommendations in Netflix, Spotify, YouTube, etc.
o Machine Learning for IR: learn to rank
o Knowledge graphs using Semantic Web Technologies e.g. the structured facts and photos that appear when you search
for someone on the Web.

Information retrieval vs. databases
- Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style
searching.
Information retrieval Databases
(Mostly) unstructured data e.g. text Structured data e.g. tables
Set of keywords (loose semantics) Well defined query (regular expression, SQL)
Incomplete query specification, partial matching Complete query specification, exact match
Relevant items for result, errors are tolerable Single error results in a failure
Probabilistic models Deterministic models
- Unstructured data: data which does not have clear, semantically overt, easy-for-a-computer structure. No data is truly
unstructured e.g. headings, paragraphs etc.
- Structured data: a relational database such as those used to maintain product inventories and personnel records.

IR scale
1. Web search: the system has to provide search over billions of documents stored on millions of computers.
2. Enterprise, institutional, and domain-specific search: retrieval might be provided for collections such as a corporation’s
internal documents, a database of patents, or research articles on biochemistry.
3. Personal information retrieval: consumer operating systems have integrated IR e.g. email provide search but also text
classification.

1

,What is needed to build a search engine?
- The most important part of a search engine or the information retrieval framework is the index.

The information retrieval framework

What makes a search engine good?
- Search speed:
o How fast does it search?
o Latency as a function of index size.
- Index speed:
o How fast does it index?
o Number of documents per hour.
- Expressiveness of query language:
o Ability to express complex information needs.
o Speed on complex queries.
- Disk space requirements
- User interface features
- User happiness

Search speed (latency) from fastest to slowest:
1. Main memory reference (read random byte from memory).
2. Zip 1KB of data (compress 1000 bytes in memory).
3. SSD random read (read random byte from solid-state drive).
4. Round trip within same datacenter (send one byte to another
computer in the same fast datacenter network and back).
5. Hard disk seek (read random byte from hard disk).
6. Send one byte from the Netherlands to California and back.

2

,LECTURE 2: Indexing and Boolean retrieval

Searching unstructured data
- Example of a Boolean query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- A simple and naïve first attempt to process this query: brute force-approach in the form of a linear search through text
documents, called grepping. Useful for simple querying of words however, not efficient for large document collections,
flexible matching operations, or ranked retrieval.
- Therefore, we need the information retrieval framework, where a lot of preliminary effort is required to create a structure
that makes it efficient to answer a query.
- The index is the main structure to be created.

Term-document incidence matrix
- Incidence matrix: a binary-term document.
o Terms are the indexed units that are usually words.
o Depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the
documents it appears in, or a vector for each document, showing the terms that occur in it.

- If we take the rows, we have a vector with 0/1 values for each term.
- To answer the query:
o We take the vectors for Brutus and the vector for Caesar and the
complemented (NOT = 0 and 1 switched) vector for Calpurnia.
o We then apply a bit-wise logical AND
110100 AND 110111 AND 101111 = 100100
o The answer is the first and the fourth document: Antony and Cleopatra and Hamlet.

Bigger collections
- With a slightly more realistic example: if N = 1 000 000 documents of each 1 000 words (tokens) long, and 1 word is on
average 6 bytes, then the document collection is 6 GB in size (1 000 000 000 bytes = 1 GB).
- In this case, our vocabulary V consists of about |V|= 500 000 distinct words (terms).

The matrix is getting very big
- 500 000 (row for each distinct term) x 1 000 000 (column for each document) matrix has half-a-trillion 0s and 1s, which is
too many to fit in a computer’s memory. 500 000 000 000 = 500Gb/8 = 62.5GB (1 byte (B) = 8 bits (b)).
- The matrix is extremely sparse.
- Sparse matrix: a matrix with few non-zero entries, thus comprised mostly of zeros.
o The number of words in each document is relatively small compared to the number of distinct terms. E.g. each column
can have a maximum of 1000 1s and there are at least 499 000 0s. Therefore, 99.8% of the cells are zero.
o A term-document matrix is extremely sparse.
- A much better representation is to record only the things that do occur, that is, the 1 position. This is central to the idea of
inverted index.

Inverted index
- Records only the 1 positions and is also known as an inverted file.
- There is a dictionary of items.
- For each term (word) t, we store a list of all documents that contain t, called
a postings list, which is a variable-size list.
- Data structure of a postings list:
o On disk: a continuous run of postings without explicit pointers is best.
o In memory: can use linked lists or variable-length arrays.

3

, - The only way to access a list is to go through the list from the start till the end.
- Identify each document by a document ID: a unique serial document number.
- Each item in the list, which records that a term appeared in a document is called a posting.
- Although there are still 500 000 terms, the length of each postings list is much smaller than before, hence the size is smaller.

Inverted index construction

- Steps for building the index after collecting the documents to be indexed:
1. Token sequence: tokenize the text, turning each document into a list of tokens.
o (Normalized token, document ID) pairs
2. Sort: do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms.
o Sort by terms (alphabetically) and then by document ID.

3. Dictionary and postings: index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
o Multiple term entries in a single document are merged.
o Split into dictionary and postings.
o Document frequency information is added.
• Document frequency: the number of documents which contain the term. Same as the length of each posting.
We use this information for improving query time efficiency and, later, for weighting in ranked retrieval
models.

- Since a term often occurs in a number of documents, this data organization reduces the storage requirements of the index.
- This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
- We pay for storage of both the dictionary and the postings lists:
o Dictionary is kept in memory.
o Postings lists are kept on disk since it is larger.

4

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: Vrije Universiteit Amsterdam (VU)
Estudio: Business Analytics and Data Science Minor
Grado: Information Retrieval

Todos documentos para esta materia (1)

Información del documento

Subido en: 15 de diciembre de 2021
Número de páginas: 44
Escrito en: 2021/2022
Tipo: NOTAS DE LECTURA
Profesor(es): Tobia kuhn
Contiene: Todas las clases

Temas

information retrieval
ir

$9.44

Accede al documento completo:

Comprado por 22 estudiantes

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

bastudent

3.4

(8)

Reseñas de compradores verificados

Se muestran los 5 comentarios

waterdijk Business Analytics · 13 reseñas

2 año hace

It is not a summary. It is basically a document that has all the slides written on it without any reasoning between them. The slides were more useful to study.

ilmajaganjac1 Computer Science · 4 reseñas

3 año hace

ramapamudji Vwo 6 · 7 reseñas

3 año hace

berendwiewel Biomedische Wetenschappen · 3 reseñas

4 año hace

jorissnl X · 18 reseñas

4 año hace

3.8

5 reseñas

Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

bastudent Universiteit van Amsterdam

Ver perfil

Seguir

Vendido

Miembro desde

5 año

Número de seguidores

Documentos

Última venta

4 meses hace

3.4

8 reseñas

Documentos populares

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller bastudent. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $9.44. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now