2 PROPOSED RESEARCH – CONTRIBUTIONS, - “INEXACT”

4.2 Proposed Research Contributions,

- “Inexact”: if some data is missing from or added to the

Experiments and Limitations

answer - (criteria 3 & 4) - “Unsupported”: if the answer is not supported via other  Moldovan et al. (LASSO) [8], 1999: documents - (criterion 5). Contribution The main challenge of most QA systems is to retrieve chunks Their research relied on NLP techniques in novel ways to find of 50-bytes, called “short answers” or 250-bytes which are answers in large collections of documents. The question was called “long answers”, as a requirement of TREC QA Track processed by combining syntactic information with semantic [10]. However, in order to provide automated evaluation for information that characterized the question (e.g. question type these answers, each question has pairs of “answers patterns” or question focus), in which eight heuristic rules were defined and “supporting documents identifiers”. Therefore, there are to extract the keywords used for identifying the answer. The two main types of evaluation, namely “lenient” and “strict” research also introduced paragraph indexing where retrieved evaluations. “Lenient” evaluation uses only the answers documents were first filtered into paragraphs and then patterns without using the supporting documents identifiers, ordered. and hence it does not ensure that the document has stated the Experimental environment and results answer. “Strict” evaluation, on the other hand, uses both the The experimental environment was composed of 200 answers patterns along with the supporting documents questions of the TREC8 corpus, in which all questions were identifiers. classified according to a hierarchy of Q-subclasses. There are several evaluation metrics that differ from one QA Table 6: Experimental Results – Moldovan et al. campaign to another (e.g. TREC, CLEF, NTCIR, etc). Moreover, some researchers develop and utilize their own Answers in top 5 MRR score (strict) customized metrics. However, the following measures are the Short answer (50-bytes) 68.1% 55.5% most commonly used measures that are typically utilized for Long answer (250-bytes) 77.7% 64.5% automated evaluation: Limitations The question was considered to be answered correctly just if it was among the top five ranked long answers. Although, this was not considered a problem at that time, but starting from  Hermjakob [18], 2001: TREC-2002, it was required for all QA systems to provide only one answer. The research showed that parsing improved dramatically when the Penn Treebank training corpus was enriched with an  Harabagiu et al. (FALCON) [16], 2000: additional Questions Treebank, in which the parse trees were semantically enriched to facilitate question-answering matching. The research also described the hierarchical The same developers of LASSO [8] continued their work and structure of different answer types “Qtargets” in which proposed another QA system called FALCON which adapted questions were classified. the same architecture of LASSO. The newly proposed system, FALCON, was characterized by additional features and In the first two test runs, the system was trained on 2000 and components. They generated a retrieval model for boosting 3000 Wall Street Journal WSJ sentences (enriched Penn knowledge in the answer engine by using WordNet for Treebank). In the third and fourth runs, the parser was trained semantic processing of questions. Also, in order to overcome the main limitation that appeared in LASSO, they provided a with the same WSJ sentences augmented by 38 treebanked justification option to rule-out erroneous answers to provide pre-TREC8 questions. For the fifth run, 200 TREC8 questions only one answer. were added as training sentences testing TREC9 sentences. In the final run, the TREC8 and TREC9 questions were divided into five subsets of about 179 questions. The system was The experiments were held on the TREC9 corpus in which questions and document collection were larger than that of trained on 2000 WSJ sentences plus 975 questions. TREC8 and of a higher degree of difficulty. The experimental Table 9: Experimental Results – Hermjakob results of FALCON outperformed those of LASSO, which Qtarget No. of Labeled Tagging proved that the added features had enhanced the preceding Penn acc. added Q. accuracy Precision Recall model. (lenient) sentences 2000 0 83.47% 82.49% 94.65% 63.0% 65.5% Table 7: Experimental Results – Harabagiu et al. 3000 0 84.74% 84.16% 94.51% 65.3% 67.4% MRR score 2000 38 91.20% 89.37% 97.63% 85.9% 87.2% (lenient) MRR score 3000 38 91.52% 90.09% 97.29% 86.4% 87.8% Short answer (50-bytes) 59.9% 58.0% 2000 238 94.16% 93.39% 98.46% 91.9% 93.1% Long answer (250-bytes) 77.8% 76.0% 2000 975 95.71% 95.45% 98.83% 96.1% 97.3%  Gaizauskas and Humphreys (QA-LaSIE) [20], 2000: Radev et al. (NSIR) [15], 2002: The research presented an approach based on linking an IR They presented a probabilistic method for Web-based Natural system with an NLP system that performed linguistic analysis. Language Question Answering, called probabilistic phrase The IR system treated the question as a query and returned a reranking (PPR). Their NSIR web-based system utilized a flat set of ranked documents or passages. The NLP system parsed taxonomy of 17 classes, in which two methods were used to the questions and analyzed the returned documents or classify the questions; the machine learning approach using a passages yielding a semantic representation for each. A decision tree classifier, and a heuristic rule-based approach. privileged query variable within the semantic representation of the question was instantiated against the semantic The system was evaluated upon the 200 question from representation of the analyzed documents to discover the TREC8, in which it achieved a total reciprocal document rank answer. of 0.20. The accuracy in classifying questions had been greatly improved using heuristics. Using machine learning, Their proposed approach had been evaluated in the TREC8 the training error rate was around 20% and the test error rate QA Track. They tested the system with two different IR reached 30%. While the training error in the heuristic engines under different environments, in which the best approach never exceeded 8% and the testing error was around achieved results were as follows: 18%. Table 8: Experimental Results – Gaizauskus & Humphreys The PPR approach did not achieve the expected promising Precision Recall results due to simple sentence segmentation and POS (parts-Short answers (50-bytes) 26.67% 16.67% of-speech) tagging and text chunking. Also, their QA system Long answers (250-bytes) 53.33% 33.33% did not reformulate the query submitted by the user. The overall success of the approach was limited, as only two-thirds of the test set questions was parsed. Also, the QA-LaSIE system employed a small number of business domain ontology although the QA system was intended to be general (open-domain).  Ravichandran & Hovy [22], 2002: Bayes (NB), Decision Tree (DT) and Sparse Network of Winnows (SNoW). Furthermore, they proposed a special They presented a method that learns patterns from online data kernel function called tree kernel that was computed efficiently by dynamic programming to enable the SVM to using some seed questions and answer anchors, without needing human annotation. take advantage of the syntactic structures of questions which were helpful to question classification. Using the TREC10 question set, two set of experiments were performed. In the first one, the TREC corpus was used as the Under the same experimental environment used by Li and Roth [17], all learning algorithms were trained on five input source using an IR component of their QA system. In the second experiment, the web was used as the input source different sizes training datasets and were then tested on TREC10 questions. The experimental results proved that the using AltaVista search engine to perform IR. SVM algorithm outperformed the four other methods in Table 10: Experimental Results – Ravichandran & Hovy classifying questions either under the coarse-grained category Question Type No. of (Table 11), or under the fine-grained category (Table 12). The questions MRR on TREC docs MRR on the web question classification performance was measured by BIRTHYEAR 8 48% 69% accuracy, i.e. the proportion of correctly classified questions INVENTOR 6 17% 58% among all test questions. DISCOVERER 4 13% 88% Table 11: Experimental Results (coarse-grained) – DEFINITION 102 34% 39% Zhang & Lee Algorithm 1000 2000 3000 4000 5500 WHY-FAMOUS 3 33% 0% NN 70.0% 73.6% 74.8% 74.8% 75.6% LOCATION 16 75% 86% NB 53.8% 60.4% 74.2% 76.0% 77.4% DT 78.8% 79.8% 82.0% 83.4% 84.2% It only worked for certain types of questions that had fixed SNoW 71.8% 73.4% 74.2% 78.2% 66.8% anchors, such as “where was X born”. Therefore, it performed SVM 76.8% 83.4% 87.2% 87.4% 85.8% badly with general definitional questions, since the patterns did not handle long-distance dependencies. Table 12: Experimental Results (fine-grained) – Li & Roth [17], 2002: NN 57.4% 62.8% 65.2% 67.2% 68.4% Their main contribution was proposing a hierarchical NB 48.8% 52.8% 56.6% 56.2% 58.4% taxonomy in which questions were classified and answers DT 67.0% 70.0% 73.6% 75.4% 77.0% were identified upon that taxonomy. Li and Roth used and SNoW 42.2% 66.2% 69.0% 66.6% 74.0% tested a machine learning technique called SNoW in order to classify the questions into coarse and fine classes of the SVM 68.0% 75.0% 77.2% 77.4% 80.2% taxonomy. They also showed through another experiment the  Xu et al. [23], 2003: differences between a hierarchical and flat classification of a question. For definitional QA, they adopted a hybrid approach that used Their experiments used about 5500 questions divided into five various complementary components including information different sizes datasets (1000, 2000, 3000, 4000, 5500), retrieval and various linguistic and extraction tools such as collected from four different sources. These datasets were name finding, parsing, co-reference resolution, proposition used to train their classifier, which was then tested using 500 extraction, relation extraction and extraction of structured other questions collected from TREC10. Their experimental patterns. results proved that the question classification problem can be solved quite accurately using a learning approach. They performed three runs using the F-metric for evaluation. In the first run, BBN2003A, the web was not used in answer The research did not consider or test other machine learning finding. In the second run, BBN2003B, answers for factoid classifiers that could have achieved more accurate results than questions were found using both TREC corpus and the web SNoW, and at the same time it did not provide any reason for while list questions were found using BBN2003A. Finally, choosing SNoW in particular over other machine learning BBN2003C was the same as BBN2003B except that if the algorithms. answer for a factoid question was found multiple times in the corpus, its score was boosted.  Zhang and Lee [19], 2003: Table 13: Experimental Results (Definitional QA) – Xu et al. This research worked on the limitation of the aforementioned BBN2003A BBN2003B BBN2003C Baseline research [17], and carried out a comparison between five