2 RETRIEVALVERSATIONS USEFUL TO CONSTRUCT AND UPDATE A MODELDOCUMENT...

4.2 Retrieval

versations useful to construct and update a model

Document retrieval We retrieve the top 20 doc-

of the user’s interests, goals and level of under-

uments returned by Google 4 for each query pro-

standing. From a QA point of view, the main goal

duced via query expansion. These are processed

of the dialogue component is to provide users with

in the following steps, which progressively narrow

a friendly interface to build their requests. A typi-

the part of the text containing relevant informa-

cal scenario would start this way:

tion.

— System: Hi, how can I help you?

— User: I would like to know what books Roald Dahl wrote.

Keyphrase extraction Once the documents are

The query sentence “what books Roald Dahl wrote” , is

retrieved, we perform keyphrase extraction to de-

thus extracted and handed to the QA module. In a

termine their three most relevant topics using Kea

second phase, the dialogue module is responsible

(Witten et al., 1999), an extractor based on Nạve

for providing the answer to the user once the QA

Bayes classification.

module has generated it. The dialogue manager

Estimation of reading levels To adapt the read-

consults the UM to decide on the most suitable

ability of the results to the user, we estimate

formulation of the answer (e.g. short sentences)

the reading difficulty of the retrieved documents

and produce the final answer accordingly, e.g.:

using the Smoothed Unigram Model (Collins-

— System: Roald Dahl wrote many books for kids and adults,

Thompson and Callan, 2004), which proceeds in

including: “The Witches”, “Charlie and the Chocolate Fac-

tory”, and “James and the Giant Peach".

3

https://traloihay.net

2

https://traloihay.net

4

https://traloihay.net

two phases. 1) In the training phase, sets of repre-

ter is assigned a score consisting in the maximal

sentative documents are collected for a given num-

score of the documents composing it. This allows

to rank not only documents, but also clusters, and

ber of reading levels. Then, a unigram language

model is created for each set, i.e. a list of (word

present results grouped by cluster in decreasing or-

stem, probability) entries for the words appearing

der of document score.

in its documents. Our models account for the fol-

Answer presentation We present our answers

lowing reading levels: poor (suitable for ages 7–

in an HTML page, where results are listed follow-