Exam (elaborations)

Using Syntactic Distributional Patterns for Data-Driven Answer Extraction from the Web

Rating

Sold

Pages

Grade

A+

Uploaded on

25-08-2024

Written in

2024/2025

The Answer Extractor System Once a Natural-Language query triggers our QA system (QA-SYSTEM), this is sent out to Google so to retrieve a small number of snippets (i.e., usually 30), which are then normalized and cleaned up of math symbols and html tags. Next, the system performs the query analysis in order to determine the EAT by a simple 990 A. Figueroa and J. Atkinson Wh-keyword matching. If the EAT is a location, it triggers our answer extraction module based on the acquisition of distributional syntactic patterns (SCV-AE). 3.1 Answer Extraction by Acquiring Syntactic Patterns First, the answer extractor (SCV-AE) extracts WEB ENTITIES from the retrieved snippets. A WEB ENTITY is a stream of words in which every word of the sequence starts with a capital letter, for instance: “Robbie Williams”, “London”, etc. Then, the negative evidence is used for filtering the WEB ENTITIES according to a list of banned WEB ENTITIES, we see as banned words, query terms and words that usually start with a capital letter on web snippets (i.e., page, home, link, etc). This rule for distinguishing WEB ENTITIES is considered as part of the innate knowledge. From here, the set of sentences (determined by punctuation signs) and WEB ENTITIES are passed on to our Automatic Annotator (AA) which returns at most three ranked answers and the sentences where they occur. The annotated sentences are used by SCV-AE for updating the syntactic context vectors, and for computing the value of the likelihood L for every WEB ENTITY. The learning strategy is seen as the synergy between the annotator and the answer extractor. The rank of answers returned by AA is only used as a baseline. AA is based on a strategy for extracting answers to questions which aim at a location as answer. This measures the similarity between the query and each sentence by aligning characters of the query in each sentence. In addition, AA validates each WEB ENTITY using a lexical database of locations (WordNet). This strategy can identify answers if and only if they are on the lexical database. Let Q be the set of all questions that triggered the QA-SYSTEM which aimed to the same EAT. A is the set of answers to the questions in Q. Each component φi of the syntactic context vectors of the EAT of Q is given by: φ l i (EAT ) = sum∀Aj∈Afreq(wi , Aj ) φ r i (EAT ) = sum∀Aj∈Afreq(Aj , wi) Where freq(wi , Aj ) is the frequency in which wi occurs immediately to the left of Aj , the sum over all Aj ∈ A gives the frequency of wi to the left of the EAT, and freq(Aj , wi) is the homologous to the right. Next, φ l (EAT ) and φ r (EAT ) provide the information of the role of the EAT in the local context. For the simplicity’s sake, φ l and φ r refer to syntactic context vectors φ l (EAT ) and φ r (EAT ) respectively. If we consider our illustrative example, φ l (LOCAT ION) and φ r (LOCAT ION) are shown in table 3. Note that φ r represents the null vector as there is no word occuring to the right of the EAT LOCATION. Then, the Syntactic Likelihood of an answer A ′ is computed as follows: L(A ′ ) = φ lφ l (A ′ ) + φ rφ r

Show more Read less

Institution

Using Syntactic Distributional Patterns

Course

Using Syntactic Distributional Patterns

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: Using Syntactic Distributional Patterns
Course: Using Syntactic Distributional Patterns

Document information

Uploaded on: August 25, 2024
Number of pages: 11
Written in: 2024/2025
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

using syntactic distributional patterns for data d
the answer extractor system once a natural languag

Content preview

Using Syntactic Distributional Patterns for
Data-Driven Answer Extraction from the Web

Alejandro Figueroa1 and John Atkinson2,⋆
1
Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI,
Stuhlsatzenhausweg 3, D - 66123, Saarbrücken, Germany
2
Department of Computer Sciences, Universidad de Concepción, Concepción, Chile
,

Abstract. In this work, a data-driven approach for extracting answers
from web-snippets is presented. Answers are identified by matching con-
textual distributional patterns of the expected answer type(EAT) and
answer candidates. These distributional patterns are directly learnt from
previously annotated tuples {question, sentence, answer}, and the learn-
ing mechanism is based on the principles language acquisition. Results
shows that this linguistic motivated data-driven approach is encouraging.

Keywords: Natural Language Processing, Question Answering.

1 Introduction

The increase of the amount of information on the Web has led search engines
to deal with a huge amount of data as users have become retrievers of all sorts.
Nowadays, search engines are not only focusing on retrieving relevant documents
for a user’s particular request. They also provide other services (i.e., Group
Search, News Search, Glossary), hence the complexity of the request of the users
has addressed the research to Question Answering (QA) systems. These aim to
answer natural language (NL) questions prompted by users, by searching the
answer in a set of available documents on the Web. QA is a challenging task due
to the ambiguity of language and the complexity of the linguistic phenomena
that can be found in NL documents.
Typical questions to answer are those that look for name entities as answers
(i.e., locations, persons, dates, organizations). Nevertheless, QA systems are not
restricted to these kinds of questions. They also try to deal with more complex
ones that may require demanding reasoning tasks while the system is looking
for the answer [11].
Usually, QA systems start by analyzing the query [4,7] in order to determine
the EAT. The EAT allows the QA system to narrow the search space [8], while
it is ranking documents, sentences or sequences of words in which the answer is
⋆
This research is sponsored by FONDECYT, Chile under grant number 1040469 “Un
Modelo Evolucionario de Descubrimiento de Conocimiento Explicativo desde Textos
con Base Semantica con Implicaciones para el Analisis de Inteligencia.”

A. Gelbukh and C.A. Reyes-Garcia (Eds.): MICAI 2006, LNAI 4293, pp. 985–995, 2006.
c Springer-Verlag Berlin Heidelberg 2006

, 986 A. Figueroa and J. Atkinson

supposed to be. This set of likely answers is called answer candidates. In this
last step of the zooming process, the QA system must decide which are the most
suitable answers for the triggering query. This extraction and ranking of answer
candidates is traditionally based on [6,7,8] frequency counting, pattern match-
ing and detecting diﬀerent orderings of query words, called paraphrases. Answer
extraction modules attempt to take advantage of the redundancy provided by
diﬀerent information sources. This redundancy increases significantly the prob-
ability of finding a paraphrase, in which the answer can be readily identified.
Normally, QA systems extract these paraphrases at the sentence level [10]. The
rules for identifying paraphrases can manually be written or automatically learnt
[6,10], and they can consist of pre-parsed trees [10], or simple string based ma-
nipulations [6]. In general, paraphrases are learnt by retrieving sentences that
contain preciously annotated question-answer pairs. For example in [10], anchor
terms (i.e., “Lennon 1980”) are sent to the web, in order to retrieve sentences
that contain query and answer terms. Then, patterns are extracted from this
set of sentences with their likelihood being proportional to their redundancy on
the Web[7]. In most cases, the new set of retrieved sentences is matched with
paraphrases in order to extract new answers. At the same time, a huge set of
paraphrases [6] decreases considerably the need of deep linguistic processing like:
anaphora or synonym resolution. In some cases, it reduces the extraction to a
pattern matching by means of regular expressions[10]. As a result, strategies
based on paraphrases tend to perform better when questions aim for a name
entity as an answer: Locations, Names, Organizations. But, they perform poorly
when they aim for Noun Phrases[10].
Due to the huge amount of paraphrases, statistical methods are also used for
extracting answers. In [5], a strategy for answering questions is learnt directly
from data. This strategy conceives the answer extraction problem as a binary
classification problem in which text snippets are labelled as correct or incorrect.
The classifier is based on a set of features from lexical n-grams to parse trees.
The major problem of statistical-based approaches is that, frequently, they get
inexact answers, which usually consist of substrings of the answer, the answer
surrounded by some context words, or strings highly closed to answers.
Nevertheless, it is still unclear how each diﬀerent technique contributes to deal
with the linguistic phenomena that QA systems face while searching for the an-
swer. One solution for this may involve a trade-oﬀ between the implementation
of rule-based and easy re-trainable data-driven systems. In [10], a strategy for
combining the output of diﬀerent kinds of answer extractors is introduced. This
re-ranker is based on a Maximum Entropy Linear Classifier, which was trained
on a set of 48 diﬀerent types of features such as ranking in the answer extraction
modules, redundancy, negative feedback, etc. Results show that a good strategy
for combing answer extractors, based mainly on diﬀerent strategies, can signifi-
cantly improve the overall performance of QA systems [11].
Strategies based on paraphrases aim to find a re-writing of the query within
the text where the answer is easily identified. Their main drawback is that when-
ever the answer is in an context, which do not match any re-writing rule, it will

$15.99

Get access to the full document:

100% satisfaction guarantee

Immediately available after payment

Both online and in PDF

No strings attached

Get to know the seller

StudyCenter1

4.3

(27)

Get to know the seller

StudyCenter1 Teachme2-tutor

View profile

Sold

221

Member since

2 year

Number of followers

Documents

3854

Last sold

4 days ago

Nursing school is hard! Im here to simply the information and make it easier!

My mission is to be your LIGHT in the dark. If you"re worried or having trouble in nursing school, I really want my notes to be your guide! I know they have helped countless others get through and thats all i want for YOU! Stay with me and you will find everything you need to study and pass any tests,quizzes abd exams!

4.3

27 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller StudyCenter1. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $15.99. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 45158 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 15 years now

Using Syntactic Distributional Patterns for Data-Driven Answer Extraction from the Web

Written for

Document information

Subjects

Content preview

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?