History of Information Retrieval
Automated IR
' '
first idea in 1945 As
'
:
we think
may
-
microfilm mechanical recordings with links between books
In 60 's field emerged
•
the ,
the IR
Evolution of Information Retrieval
1960-1970 's the era Boolean Retrieval
-
:
of
•
1975 :
first vector space Model
1980 's available
large DBS made
-
:
1990 's FTP search and dawn web
-
of
:
IR in the 2000 'S
link
analysis & retrieval
Coogle)
-
Q&A
'
Multimedia IR C
image / video)
'
cross
-
languages
-
-
semantic web CDBPedia)
IR since 2010 's
categorization and
clustering and recommendations :
" "
Top songs
-
" "
People who bought this also
bought
i
IBM Watson
-
video / media recommendations ( youtube netflix)
-
,
!
knowledge graphs
IR vs DB
IR :
DBS :
cmostly) unstructured data structured
- .
Key words ( loose semantics) defined query language
•
-
incomplete query specification complete specification
'
query
-
, ,
partial matching exact matching
relevant items for result single failure
-
error
-
=
,
errors aoe tolerable
-
deterministic models
probabilistic models
-
, The information retrieval framework
search speed
criteria for Good Search Engines
How fast does it search ?
-
↳
Latency as a function of index size
How fast does index ?
-
it
↳ Number documents / hour
of
-
Expressiveness of query language
↳
Ability to express complex information needs
↳ speed on complex queries
Disk requirements
•
space
•
User bossiness
,Lecture 2 :
Indexing and Boolean Retrieval
Term -
Document Incidence Matrix
r
Incidence vectors
vectors with 011 for each term
Bigger collections
consider a text collection of N =
1 million documents ,
each with about 1000
words (tokens) :
On -6
bytes / word C including
-
and
average ,
we
might have spaces
punctuation)
-
.
: 6GB of data in the documents
would consist about IVI 500 000 distinct
:
vocabulary
-
=
-
our of
words (terms)
Matrix is Getting very Big
500K ✗ IM = 1 trillion
matrix Cie zeros)
belt ,
the is
very sparse .
mostly .
↳ is 1 's
per column ,
there a maximum
of 1000 compared to 500K words
i.
better representation :
only record Is !
Inverted index
-
For each chord ) t ,
we store a list
of all documents that contain t
Identify each
by document ID serial document number
-
a :
need C- lists)
we variable size lists
posting
•
-
continuous best
disk run
postings is
-
on , of
linked lists variable
-
in memory
we can use or
length arrays
-
,
, Inverted Index Construction
Indexer seep 1 : Token sequence
Exact sequence of < normalized token , document ID >
pairs
Indexer seep 2 : sort
Indexer sleep 3 :
Dictionary and Postings
Multiple term entries in a
single document are merged :
split into dictionary and postings
•
document
frequency information is added
-