SECTION 5 GIVES THE RESULTS OF A NUMBER OF EXPERI-KNOWLEDGE ABOUT THE...

4.1 Querying the Web

'!"(*)+,

where

!

is the number of pages in the WebWe use a Web-mining algorithm that considers thenumber of pages retrieved by the search engine. Inwhere

appears and

&

'"()+,

is the maximumcontrast, qualitative approaches to Web mining (e.g.number of pages that can be returned by the search(Brill et al., 2001)) analyze the document content,engine. We set this constant experimentally. How-ever in two of the formulas we use (i.e. Point-as a result considering only a relatively small num-ber of pages. For information retrieval we used thewise Mutual Information and Corrected ConditionalProbability)

&

'"()-+.

may be ignored.AltaVista search engine. Its advanced syntax allowsthe use of operators that implement the idea of vali-The joint probability P(Qsp,Asp) is calculated bydation patterns introduced in Section 2. Queries aremeans of the validation pattern probability:composed usingNEAR,ORandANDboolean opera-tors. TheNEARoperator searches pages where two

"#/%$0"#1234(

words appear in a distance of no more than 10 to-We have tested three alternative measures to es-kens: it is used to put together the question and thetimate the degree of relevance of Web searches:answer sub-patterns in a single validation pattern.Pointwise Mutual Information, Maximal LikelihoodTheOR operator introduces variations in the wordRatio and Corrected Conditional Probability, a vari-order and verb forms. Finally, the ANDoperator isant of Conditional Probability which considers theused as an alternative toNEAR, allowing more dis-asymmetry of the question-answer relation. Eachtance among pattern elements.measure provides an answer validity score: high val-If the question sub-pattern

does not returnues are interpreted as strong evidence that the vali-any document or returns less than a certain thresh-dation pattern is consistent. This is a clue to the factold (experimentally set to 7) the question patternthat the Web pages where this pattern appears con-is relaxed by cutting one word; in this way a newtain validation fragments, which imply answer accu-query is formulated and submitted to the search en-racy.gine. This is repeated until no more words can bePointwise Mutual Information (PMI) (Manningcut or the returned number of documents becomesand Sch¨utze, 1999) has been widely used to find co-higher than the threshold. Pattern relaxation is per-occurrence in large corpora.formed using word-ignoring rules in a specified or-der. Such rules, for instance, ignore the focus of the

&65

question, because it is unlikely that it occurs in a

"

Qsp,Asp%$

"#

Qsp,Asp

validation fragment; ignore adverbs and adjectives,

"#

Qsp

879"#

Asp

because are less significant; ignore nouns belongingPMI(Qsp,Asp) is used as a clue to the internalto the WordNet classes “abstraction”, “psychologi-coherence of the question-answer validation patterncal feature” or “group”, because usually they specifyQAp. Substituting the probabilities in the PMI for-finer details and human attitudes. Names, numbersmula with the previously introduced Web statistics,and measures are preferred over all the lower-casewe obtain: