100% de satisfacción garantizada Inmediatamente disponible después del pago Tanto en línea como en PDF No estas atado a nada 4,6 TrustPilot
logo-home
Notas de lectura

Information Retrieval Summary

Puntuación
3.8
(5)
Vendido
22
Páginas
44
Subido en
15-12-2021
Escrito en
2021/2022

Detailed document on all the lectures 1-13.

Institución
Grado











Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Escuela, estudio y materia

Institución
Estudio
Grado

Información del documento

Subido en
15 de diciembre de 2021
Número de páginas
44
Escrito en
2021/2022
Tipo
Notas de lectura
Profesor(es)
Tobia kuhn
Contiene
Todas las clases

Temas

Vista previa del contenido

INFORMATION RETRIEVAL

WEEK 1
LECTURE 1: Introduction

What is information retrieval?
Information retrieval: information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
that satisfies an information need from within large collections (usually stored on computers).

History of information retrieval
- Information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar
professional searchers. Nowadays, everyone engages in IR every day e.g. search engines or searching an inbox.
- Before computers information retrieval existed in some form. E.g. at the library, one could find books by searching through
index cards.
- Automated information retrieval:
o The first idea for an automated information retrieval system was described in 1945 by Vannevar Bush in ‘As We May
Think’.
o Microfilm stored in a large desk that could be accessed through knobs and screens that would show the microfilm
recordings. It was a mechanical process that could pull out the right book. It also made use of hyperlinks that you could
use to get from one book to the next or find books about a certain topic.
o In the 1960s the field of information retrieval emerged when text documents became available in digital form.
- Evolution of information retrieval:
o 1960-70s: the era of Boolean retrieval
o 1975: the first vector space model
o 1980s: large document database systems run by companies became available e.g. LexisNexis and MedLine
o 1990s: FTP search and the dawn of Web search e.g. Lycos, Yahoo, and Altavista.
- Information retrieval in the 2000s:
o Link analysis and ranking e.g. Google
o Question answering
o Multimedia IR (image and video analysis)
o Cross-language IR
o Semantic Web Technologies e.g. DBpedia
- Information retrieval since 2010:
o Categorization and clustering, and recommendation:
• iTunes - “top songs"
• Amazon - “people who bought this also bought"
• IBM's Watson system recommendations in Netflix, Spotify, YouTube, etc.
o Machine Learning for IR: learn to rank
o Knowledge graphs using Semantic Web Technologies e.g. the structured facts and photos that appear when you search
for someone on the Web.

Information retrieval vs. databases
- Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style
searching.
Information retrieval Databases
(Mostly) unstructured data e.g. text Structured data e.g. tables
Set of keywords (loose semantics) Well defined query (regular expression, SQL)
Incomplete query specification, partial matching Complete query specification, exact match
Relevant items for result, errors are tolerable Single error results in a failure
Probabilistic models Deterministic models
- Unstructured data: data which does not have clear, semantically overt, easy-for-a-computer structure. No data is truly
unstructured e.g. headings, paragraphs etc.
- Structured data: a relational database such as those used to maintain product inventories and personnel records.

IR scale
1. Web search: the system has to provide search over billions of documents stored on millions of computers.
2. Enterprise, institutional, and domain-specific search: retrieval might be provided for collections such as a corporation’s
internal documents, a database of patents, or research articles on biochemistry.
3. Personal information retrieval: consumer operating systems have integrated IR e.g. email provide search but also text
classification.


1

,What is needed to build a search engine?
- The most important part of a search engine or the information retrieval framework is the index.

The information retrieval framework




What makes a search engine good?
- Search speed:
o How fast does it search?
o Latency as a function of index size.
- Index speed:
o How fast does it index?
o Number of documents per hour.
- Expressiveness of query language:
o Ability to express complex information needs.
o Speed on complex queries.
- Disk space requirements
- User interface features
- User happiness

Search speed (latency) from fastest to slowest:
1. Main memory reference (read random byte from memory).
2. Zip 1KB of data (compress 1000 bytes in memory).
3. SSD random read (read random byte from solid-state drive).
4. Round trip within same datacenter (send one byte to another
computer in the same fast datacenter network and back).
5. Hard disk seek (read random byte from hard disk).
6. Send one byte from the Netherlands to California and back.




2

,LECTURE 2: Indexing and Boolean retrieval

Searching unstructured data
- Example of a Boolean query: Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia?
- A simple and naïve first attempt to process this query: brute force-approach in the form of a linear search through text
documents, called grepping. Useful for simple querying of words however, not efficient for large document collections,
flexible matching operations, or ranked retrieval.
- Therefore, we need the information retrieval framework, where a lot of preliminary effort is required to create a structure
that makes it efficient to answer a query.
- The index is the main structure to be created.

Term-document incidence matrix
- Incidence matrix: a binary-term document.
o Terms are the indexed units that are usually words.
o Depending on whether we look at the matrix rows or columns, we can have a vector for each term, which shows the
documents it appears in, or a vector for each document, showing the terms that occur in it.




- If we take the rows, we have a vector with 0/1 values for each term.
- To answer the query:
o We take the vectors for Brutus and the vector for Caesar and the
complemented (NOT = 0 and 1 switched) vector for Calpurnia.
o We then apply a bit-wise logical AND
110100 AND 110111 AND 101111 = 100100
o The answer is the first and the fourth document: Antony and Cleopatra and Hamlet.

Bigger collections
- With a slightly more realistic example: if N = 1 000 000 documents of each 1 000 words (tokens) long, and 1 word is on
average 6 bytes, then the document collection is 6 GB in size (1 000 000 000 bytes = 1 GB).
- In this case, our vocabulary V consists of about |V|= 500 000 distinct words (terms).

The matrix is getting very big
- 500 000 (row for each distinct term) x 1 000 000 (column for each document) matrix has half-a-trillion 0s and 1s, which is
too many to fit in a computer’s memory. 500 000 000 000 = 500Gb/8 = 62.5GB (1 byte (B) = 8 bits (b)).
- The matrix is extremely sparse.
- Sparse matrix: a matrix with few non-zero entries, thus comprised mostly of zeros.
o The number of words in each document is relatively small compared to the number of distinct terms. E.g. each column
can have a maximum of 1000 1s and there are at least 499 000 0s. Therefore, 99.8% of the cells are zero.
o A term-document matrix is extremely sparse.
- A much better representation is to record only the things that do occur, that is, the 1 position. This is central to the idea of
inverted index.

Inverted index
- Records only the 1 positions and is also known as an inverted file.
- There is a dictionary of items.
- For each term (word) t, we store a list of all documents that contain t, called
a postings list, which is a variable-size list.
- Data structure of a postings list:
o On disk: a continuous run of postings without explicit pointers is best.
o In memory: can use linked lists or variable-length arrays.

3

, - The only way to access a list is to go through the list from the start till the end.
- Identify each document by a document ID: a unique serial document number.
- Each item in the list, which records that a term appeared in a document is called a posting.
- Although there are still 500 000 terms, the length of each postings list is much smaller than before, hence the size is smaller.

Inverted index construction




- Steps for building the index after collecting the documents to be indexed:
1. Token sequence: tokenize the text, turning each document into a list of tokens.
o (Normalized token, document ID) pairs
2. Sort: do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms.
o Sort by terms (alphabetically) and then by document ID.




3. Dictionary and postings: index the documents that each term occurs in by creating an inverted index, consisting of a
dictionary and postings.
o Multiple term entries in a single document are merged.
o Split into dictionary and postings.
o Document frequency information is added.
• Document frequency: the number of documents which contain the term. Same as the length of each posting.
We use this information for improving query time efficiency and, later, for weighting in ranked retrieval
models.




- Since a term often occurs in a number of documents, this data organization reduces the storage requirements of the index.
- This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.
- We pay for storage of both the dictionary and the postings lists:
o Dictionary is kept in memory.
o Postings lists are kept on disk since it is larger.




4
$9.76
Accede al documento completo:
Comprado por 22 estudiantes

100% de satisfacción garantizada
Inmediatamente disponible después del pago
Tanto en línea como en PDF
No estas atado a nada

Reseñas de compradores verificados

Se muestran los 5 comentarios
2 año hace

It is not a summary. It is basically a document that has all the slides written on it without any reasoning between them. The slides were more useful to study.

3 año hace

3 año hace

4 año hace

4 año hace

3.8

5 reseñas

5
3
4
0
3
1
2
0
1
1
Reseñas confiables sobre Stuvia

Todas las reseñas las realizan usuarios reales de Stuvia después de compras verificadas.

Conoce al vendedor

Seller avatar
Los indicadores de reputación están sujetos a la cantidad de artículos vendidos por una tarifa y las reseñas que ha recibido por esos documentos. Hay tres niveles: Bronce, Plata y Oro. Cuanto mayor reputación, más podrás confiar en la calidad del trabajo del vendedor.
bastudent Universiteit van Amsterdam
Seguir Necesitas iniciar sesión para seguir a otros usuarios o asignaturas
Vendido
60
Miembro desde
5 año
Número de seguidores
51
Documentos
0
Última venta
3 meses hace

3.4

8 reseñas

5
3
4
1
3
2
2
0
1
2

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

Student with book image

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes