100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached 4.2 TrustPilot
logo-home
Summary

Summary asdfae

Rating
-
Sold
-
Pages
6
Uploaded on
18-10-2022
Written in
2022/2023

Summary of 6 pages for the course World Economy at Napier University, Edinburgh (dsasgeavd asdf)










Whoops! We can’t load your doc right now. Try again or contact support.

Document information

Uploaded on
October 18, 2022
Number of pages
6
Written in
2022/2023
Type
Summary

Content preview

Integrating the Probabilistic Model
BM25/BM25F into Lucene.

Joaquı́n Pérez-Iglesias1, José R. Pérez-Agüera2, Vı́ctor Fresno1 and
Yuval Z. Feinstein3
1
NLP&IR Group, Universidad Nacional de Educación a Distancia, Spain
arXiv:0911.5046v2 [cs.IR] 1 Dec 2009




2
University of North Carolina at Chapel Hill, USA
3
Answers Corporation, Jerusalem 91481, Israel
, , ,





Abstract. This document describes the BM25 and BM25F implemen-
tation using the Lucene Java Framework. The implementation described
here can be downloaded from [Pérez-Iglesias 08a]. Both models have
stood out at TREC by their performance and are considered as state-
of-the-art in the IR community. BM25 is applied to retrieval on plain
text documents, that is for documents that do not contain fields, while
BM25F is applied to documents with structure.



Introduction

Apache Lucene is a high-performance and full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search. Lucene is scalable and offers high-performance
indexing, and has become one of the most used search engine libraries in both
academia and industry [Lucene 09].
Lucene ranking function, the core of any search engine applied to determine
how relevant a document is to a given query, is built on a combination of the
Vector Space Model (VSM) and the Boolean model of Information Retrieval.
The main idea behind Lucene approach is the more times a query term appears
in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query [Lucene 09].
Lucene uses also the Boolean model to first narrow down the documents that
need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its ex-
tension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene’s constraints to be widely used by IR community is
the lack of different retrieval models implementations. Our goal with this work is
to offer to IR community a more advanced ranking model which can be compared
with other IR software, like Terrier, Lemur, CLAIRlib or Xapian.

, 1 Motivation

There exists previous implementations of alternative Information Retrieval Mod-
els for Lucene. The most representative case of that is the Language Model im-
plementation4 from Intelligent Systems Lab Amsterdam. Another example is
described at [Doron 07] where Lucene is compared with Juru system. In this
case Lucene document length normalization is changed in order to improve the
Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search
engine relevance, so from our point of view, a BM25/BM25F implementation for
Lucene becomes necessary to make Lucene more popular for IR community.


Included Models

The developed models are based in the information that can be found at [Robertson 07].
More specifically the implemented ranking functions are as next:


BM25
X occursdt
R(q, d) = ld
t∈q
k1 ((1 − b) + b avl d
) + occursdt

where occursdt is the term frequency of t in d; ld is the document d length; avld is
the document average length along the collection; k1 is a free parameter usually
chosen as 2 and b ∈ [0, 1] (usually 0.75). Assigning 0 to b is equivalent to avoid
the process of normalisation and therefore the document length will not affect
the final score. If b takes 1, we will be carrying out a full length normalisation.
The classical inverse document frequency is computed as next:

N − df (t) + 0.5
idf (t) = log
df (t) + 0.5

where N is the number of documents in the collection and df is the number of
documents where appears the term t.
A different version of this formula, as can be found at Wikipedia5 , multiplies
the obtained bm25 weight by the constant (k1 + 1) in order to normalize the
weight of terms with a frequency equals to 1 that occurs in documents with an
average length.


BM25F

First we obtain the accumulated weight of a term over all fields as next:
4
http://ilps.science.uva.nl/resources/lm-lucene
5
http://en.wikipedia.org/wiki/Probabilistic relevance model (BM25)
£8.19
Get access to the full document:

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Get to know the seller
Seller avatar
Patrick95

Get to know the seller

Seller avatar
Patrick95 Academy of Live and Recorded Arts
View profile
Follow You need to be logged in order to follow users or courses
Sold
0
Member since
10 year
Number of followers
10
Documents
0
Last sold
-

0.0

0 reviews

5
0
4
0
3
0
2
0
1
0

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions