Notas de lectura

title

Puntuación

Vendido

Páginas

Subido en

26-07-2022

Escrito en

2021/2022

Lecture notes of 6 pages for the course Bob at AOC Friesland (ots gopd)

Institución

Grado

Ups! No podemos cargar tu documento ahora. Inténtalo de nuevo o contacta con soporte.

Informar violación de derechos de autor

Escuela, estudio y materia

Institución: AOC Friesland
Estudio: Bio
Grado: Bob

Todos documentos para esta materia (1)

Información del documento

Subido en: 26 de julio de 2022
Número de páginas: 6
Escrito en: 2021/2022
Tipo: Notas de lectura
Profesor(es): Hans rubens
Contiene: Todas las clases

Temas

key
word

Vista previa del contenido

Integrating the Probabilistic Model
BM25/BM25F into Lucene.

Joaquı́n Pérez-Iglesias1, José R. Pérez-Agüera2, Vı́ctor Fresno1 and
Yuval Z. Feinstein3
1
NLP&IR Group, Universidad Nacional de Educación a Distancia, Spain
arXiv:0911.5046v2 [cs.IR] 1 Dec 2009

2
University of North Carolina at Chapel Hill, USA
3
Answers Corporation, Jerusalem 91481, Israel
, , ,

Abstract. This document describes the BM25 and BM25F implemen-
tation using the Lucene Java Framework. The implementation described
here can be downloaded from [Pérez-Iglesias 08a]. Both models have
stood out at TREC by their performance and are considered as state-
of-the-art in the IR community. BM25 is applied to retrieval on plain
text documents, that is for documents that do not contain fields, while
BM25F is applied to documents with structure.

Introduction

Apache Lucene is a high-performance and full-featured text search engine library
written entirely in Java. It is a technology suitable for nearly any application
that requires full-text search. Lucene is scalable and offers high-performance
indexing, and has become one of the most used search engine libraries in both
academia and industry [Lucene 09].
Lucene ranking function, the core of any search engine applied to determine
how relevant a document is to a given query, is built on a combination of the
Vector Space Model (VSM) and the Boolean model of Information Retrieval.
The main idea behind Lucene approach is the more times a query term appears
in a document relative to the number of times the term appears in the whole
collection, the more relevant that document will be to the query [Lucene 09].
Lucene uses also the Boolean model to first narrow down the documents that
need to be scored based on the use of boolean logic in the query specification.
In this paper, the implementation of BM25 probabilistic model and its ex-
tension for semi-structured IR, BM25F, is described in detail.
One of the main Lucene’s constraints to be widely used by IR community is
the lack of different retrieval models implementations. Our goal with this work is
to offer to IR community a more advanced ranking model which can be compared
with other IR software, like Terrier, Lemur, CLAIRlib or Xapian.

, 1 Motivation

There exists previous implementations of alternative Information Retrieval Mod-
els for Lucene. The most representative case of that is the Language Model im-
plementation4 from Intelligent Systems Lab Amsterdam. Another example is
described at [Doron 07] where Lucene is compared with Juru system. In this
case Lucene document length normalization is changed in order to improve the
Lucene ranking function performance.
BM25 has been widely use by IR researchers and engineers to improve search
engine relevance, so from our point of view, a BM25/BM25F implementation for
Lucene becomes necessary to make Lucene more popular for IR community.

Included Models

The developed models are based in the information that can be found at [Robertson 07].
More specifically the implemented ranking functions are as next:

BM25
X occursdt
R(q, d) = ld
t∈q
k1 ((1 − b) + b avl d
) + occursdt

where occursdt is the term frequency of t in d; ld is the document d length; avld is
the document average length along the collection; k1 is a free parameter usually
chosen as 2 and b ∈ [0, 1] (usually 0.75). Assigning 0 to b is equivalent to avoid
the process of normalisation and therefore the document length will not affect
the final score. If b takes 1, we will be carrying out a full length normalisation.
The classical inverse document frequency is computed as next:

N − df (t) + 0.5
idf (t) = log
df (t) + 0.5

where N is the number of documents in the collection and df is the number of
documents where appears the term t.
A different version of this formula, as can be found at Wikipedia5 , multiplies
the obtained bm25 weight by the constant (k1 + 1) in order to normalize the
weight of terms with a frequency equals to 1 that occurs in documents with an
average length.

BM25F

First we obtain the accumulated weight of a term over all fields as next:
4
http://ilps.science.uva.nl/resources/lm-lucene
5
http://en.wikipedia.org/wiki/Probabilistic relevance model (BM25)

$7.92

Accede al documento completo:

100% de satisfacción garantizada

Inmediatamente disponible después del pago

Tanto en línea como en PDF

No estas atado a nada

Conoce al vendedor

intergalactictrainee

Conoce al vendedor

intergalactictrainee

Ver perfil

Seguir

Vendido

Miembro desde

3 año

Número de seguidores

Documentos

Última venta

0.0

0 reseñas

Recientemente visto por ti

Por qué los estudiantes eligen Stuvia

Creado por compañeros estudiantes, verificado por reseñas

Calidad en la que puedes confiar: escrito por estudiantes que aprobaron y evaluado por otros que han usado estos resúmenes.

¿No estás satisfecho? Elige otro documento

¡No te preocupes! Puedes elegir directamente otro documento que se ajuste mejor a lo que buscas.

Paga como quieras, empieza a estudiar al instante

Sin suscripción, sin compromisos. Paga como estés acostumbrado con tarjeta de crédito y descarga tu documento PDF inmediatamente.

“Comprado, descargado y aprobado. Así de fácil puede ser.”

Alisha Student

Preguntas frecuentes

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

100% de satisfacción garantizada: ¿Cómo funciona?

Nuestra garantía de satisfacción le asegura que siempre encontrará un documento de estudio a tu medida. Tu rellenas un formulario y nuestro equipo de atención al cliente se encarga del resto.

Who am I buying this summary from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller intergalactictrainee. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy this summary for $7.92. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45,681 summaries were sold in the last 30 days Founded in 2010, the go-to place to buy summaries for 16 years now