Summary

Summary asdfae

Rating

Sold

Pages

Uploaded on

18-10-2022

Written in

2022/2023

Summary of 6 pages for the course World Economy at Napier University, Edinburgh (dsasgeavd asdf)

Institution

Module

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Napier University, Edinburgh
Study: Unknown
Module: World Economy

All documents for this subject (1)

Document information

Uploaded on: October 18, 2022
Number of pages: 6
Written in: 2022/2023
Type: Summary

Subjects

Content preview

Integrating the Probabilistic Model
BM25/BM25F into Lucene.

Joaquı́n Pérez-Iglesias1, José R. Pérez-Agüera2, Vı́ctor Fresno1 and
Yuval Z. Feinstein3
1
NLP&IR Group, Universidad Nacional de Educación a Distancia, Spain
arXiv:0911.5046v2 [cs.IR] 1 Dec 2009

2
University of North Carolina at Chapel Hill, USA
3
Answers Corporation, Jerusalem 91481, Israel
, , ,

Abstract. This document describes the BM25 and BM25F implemen-
tation using the Lucene Java Framework. The implementation described
here can be downloaded from [Pérez-Iglesias 08a]. Both models have
stood out at TREC by their performance and are considered as state-
of-the-art in the IR community. BM25 is applied to retrieval on plain
text documents, that is for documents that do not contain fields, while
BM25F is applied to documents with structure.

Introduction

Apache Lucene is a high-performance and full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search. Lucene is scalable and offers high-performance
indexing, and has become one of the most used search engine libraries in both
academia and industry [Lucene 09].
Lucene ranking function, the core of any search engine applied to determine
how relevant a document is to a given query, is built on a combination of the
Vector Space Model (VSM) and the Boolean model of Information Retrieval.
The main idea behind Lucene approach is the more times a query term appears
in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query [Lucene 09].
Lucene uses also the Boolean model to first narrow down the documents that
need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its ex-
tension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene’s constraints to be widely used by IR community is
the lack of different retrieval models implementations. Our goal with this work is
to offer to IR community a more advanced ranking model which can be compared
with other IR software, like Terrier, Lemur, CLAIRlib or Xapian.

, 1 Motivation

There exists previous implementations of alternative Information Retrieval Mod-
els for Lucene. The most representative case of that is the Language Model im-
plementation4 from Intelligent Systems Lab Amsterdam. Another example is
described at [Doron 07] where Lucene is compared with Juru system. In this
case Lucene document length normalization is changed in order to improve the
Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search
engine relevance, so from our point of view, a BM25/BM25F implementation for
Lucene becomes necessary to make Lucene more popular for IR community.

Included Models

The developed models are based in the information that can be found at [Robertson 07].
More specifically the implemented ranking functions are as next:

BM25
X occursdt
R(q, d) = ld
t∈q
k1 ((1 − b) + b avl d
) + occursdt

where occursdt is the term frequency of t in d; ld is the document d length; avld is
the document average length along the collection; k1 is a free parameter usually
chosen as 2 and b ∈ [0, 1] (usually 0.75). Assigning 0 to b is equivalent to avoid
the process of normalisation and therefore the document length will not affect
the final score. If b takes 1, we will be carrying out a full length normalisation.
The classical inverse document frequency is computed as next:

N − df (t) + 0.5
idf (t) = log
df (t) + 0.5

where N is the number of documents in the collection and df is the number of
documents where appears the term t.
A different version of this formula, as can be found at Wikipedia5 , multiplies
the obtained bm25 weight by the constant (k1 + 1) in order to normalize the
weight of terms with a frequency equals to 1 that occurs in documents with an
average length.

BM25F

First we obtain the accumulated weight of a term over all fields as next:
4
http://ilps.science.uva.nl/resources/lm-lucene
5
http://en.wikipedia.org/wiki/Probabilistic relevance model (BM25)

£8.19

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

Patrick95

Get to know the seller

Patrick95 Academy of Live and Recorded Arts

View profile

Sold

Member since

10 year

Number of followers

Documents

Last sold

0.0

0 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their exams and reviewed by others who've used these revision notes.

Didn't get what you expected? Choose another document

No problem! You can straightaway pick a different document that better suits what you're after.

Pay as you like, start learning straight away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and smashed it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Patrick95. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for £8.19. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 46431 documents were sold in the last 30 days Founded in 2010, the go-to place to buy revision notes and other study material for 16 years now

Summary asdfae

Written for

Document information

Subjects

Content preview

More courses for Napier University, Edinburgh >

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning straight away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?