2 PROPOSED RESEARCH – CONTRIBUTIONS, - “INEXACT”

Question

4.2 Proposed  Research  –  Contributions, -  “Inexact”: if some data is missing from or added to the Experiments and Limitations answer - (criteria 3 & 4) -  “Unsupported”: if the answer is not supported via other   Moldovan et al. (LASSO) [8], 1999: documents - (criterion 5). Contribution The main challenge of most QA systems is to retrieve chunks Their research relied on NLP techniques in novel ways to find of  50-bytes,  called  “short  answers”  or  250-bytes  which  are answers in large collections of documents. The question was called “long answers”, as a requirement of TREC QA Track processed by combining syntactic information with semantic [10]. However, in order to provide automated evaluation for information that characterized the question (e.g. question type these answers, each question has pairs of “answers patterns” or question focus), in which eight heuristic rules were defined and “supporting documents identifiers”. Therefore, there are to extract the keywords used for identifying the answer. The two  main types of evaluation, namely “lenient” and “strict” research also introduced paragraph indexing where retrieved evaluations.  “Lenient”  evaluation  uses  only  the  answers documents  were  first  filtered  into  paragraphs  and  then patterns without using  the supporting documents identifiers, ordered. and hence it does not ensure that the document has stated the Experimental environment and results answer. “Strict” evaluation, on the other hand, uses both the The  experimental  environment  was  composed  of  200 answers  patterns  along  with  the  supporting  documents questions of the TREC8 corpus, in which all questions were identifiers. classified according to a hierarchy of Q-subclasses.  There are several evaluation metrics that differ from one QA Table 6: Experimental Results – Moldovan et al. campaign  to  another  (e.g.  TREC,  CLEF,  NTCIR,  etc). Moreover,  some  researchers  develop  and  utilize  their  own Answers in top 5  MRR score (strict) customized metrics. However, the following measures are the Short answer (50-bytes)  68.1%  55.5% most commonly used measures that are typically utilized for Long answer (250-bytes)  77.7%  64.5% automated evaluation: Limitations The question was considered to be answered correctly just if it was among the top five ranked long answers. Although, this was not considered a problem at that time, but starting from   Hermjakob [18], 2001: TREC-2002,  it  was  required  for  all  QA  systems  to  provide only one answer. The  research  showed  that  parsing  improved  dramatically when the Penn Treebank training corpus was enriched with an   Harabagiu et al. (FALCON) [16], 2000: additional Questions Treebank, in which the parse trees were semantically  enriched  to  facilitate  question-answering matching.  The  research  also  described  the  hierarchical The same developers of LASSO [8] continued their work and structure  of  different  answer  types  “Qtargets”  in  which proposed another QA system called FALCON which adapted questions were classified.  the same architecture of LASSO. The newly proposed system, FALCON,  was  characterized  by  additional  features  and In the first two test runs, the system was trained on 2000 and components.  They  generated  a  retrieval  model  for  boosting 3000  Wall  Street  Journal  WSJ  sentences  (enriched  Penn knowledge  in  the  answer  engine  by  using  WordNet  for Treebank). In the third and fourth runs, the parser was trained semantic processing of questions. Also, in order to overcome the main limitation that appeared in LASSO, they provided a with  the  same  WSJ  sentences  augmented  by  38  treebanked justification option to rule-out erroneous answers to provide pre-TREC8 questions. For the fifth run, 200 TREC8 questions only one answer.  were added as training sentences testing TREC9 sentences. In the final run, the TREC8 and TREC9 questions were divided into  five  subsets  of  about  179  questions.  The  system  was The  experiments  were  held  on  the  TREC9  corpus  in  which questions  and  document  collection  were  larger  than  that  of trained on 2000 WSJ sentences plus 975 questions. TREC8 and of a higher degree of difficulty. The experimental Table 9: Experimental Results – Hermjakob results  of  FALCON  outperformed  those  of  LASSO,  which Qtarget No. of Labeled Tagging proved  that  the  added  features  had  enhanced  the  preceding Penn acc. added Q. accuracy Precision Recall model. (lenient) sentences 2000  0  83.47%  82.49%  94.65%  63.0%  65.5% Table 7: Experimental Results – Harabagiu et al. 3000  0  84.74%  84.16%  94.51%  65.3%  67.4% MRR score 2000  38  91.20%  89.37%  97.63%  85.9%  87.2% (lenient)   MRR score 3000  38  91.52%  90.09%  97.29%  86.4%  87.8% Short answer (50-bytes)  59.9%  58.0% 2000  238  94.16%  93.39%  98.46%  91.9%  93.1% Long answer (250-bytes)  77.8%  76.0% 2000  975  95.71%  95.45%  98.83%  96.1%  97.3%   Gaizauskas and Humphreys (QA-LaSIE) [20], 2000:   Radev et al. (NSIR) [15], 2002: The  research presented an approach based on linking an IR They presented a probabilistic method for Web-based Natural system with an NLP system that performed linguistic analysis. Language  Question  Answering,  called  probabilistic  phrase The IR system treated the question as a query and returned a reranking (PPR). Their NSIR web-based system utilized a flat set of ranked documents or passages. The NLP system parsed taxonomy of 17 classes, in which two methods were used to the  questions  and  analyzed  the  returned  documents  or classify the questions; the machine learning approach using a passages  yielding  a  semantic  representation  for  each.  A decision tree classifier, and a heuristic rule-based approach. privileged query variable within the semantic representation of  the  question  was  instantiated  against  the  semantic The  system  was  evaluated  upon  the  200  question  from representation  of  the  analyzed  documents  to  discover  the TREC8, in which it achieved a total reciprocal document rank answer. of  0.20.  The  accuracy  in  classifying  questions  had  been greatly  improved  using  heuristics.  Using  machine  learning, Their proposed approach had been evaluated in  the  TREC8 the training error rate was around 20% and the test error rate QA  Track.  They  tested  the  system  with  two  different  IR reached  30%.  While  the  training  error  in  the  heuristic engines  under  different  environments,  in  which  the  best approach never exceeded 8% and the testing error was around achieved results were as follows:  18%. Table 8: Experimental Results – Gaizauskus & Humphreys The  PPR  approach  did  not  achieve  the  expected  promising Precision  Recall results due to simple sentence segmentation and POS (parts-Short answers (50-bytes)  26.67%  16.67% of-speech) tagging and text chunking. Also, their QA system Long answers (250-bytes)  53.33%  33.33% did not reformulate the query submitted by the user.  The overall success of the approach was limited, as only two-thirds  of  the  test  set  questions  was  parsed.  Also,  the  QA-LaSIE system employed a small number of business domain ontology although the QA system was intended to be general (open-domain).   Ravichandran & Hovy [22], 2002: Bayes  (NB),  Decision  Tree  (DT)  and  Sparse  Network  of Winnows  (SNoW).  Furthermore,  they  proposed  a  special They presented a method that learns patterns from online data kernel  function  called  tree  kernel  that  was  computed efficiently  by  dynamic  programming  to  enable  the  SVM  to using  some  seed  questions  and  answer  anchors,  without needing human annotation. take advantage of the syntactic structures of questions which were helpful to question classification. Using the TREC10 question set, two set of experiments were performed. In the first one, the TREC corpus was used as the Under  the  same  experimental  environment  used  by  Li  and Roth  [17],  all  learning  algorithms  were  trained  on  five input source using an IR component of their QA system. In the second experiment, the web was used as the input source different  sizes  training  datasets  and  were  then  tested  on TREC10 questions. The experimental results proved that the using AltaVista search engine to perform IR.      SVM  algorithm  outperformed  the  four  other  methods  in Table 10: Experimental Results – Ravichandran & Hovy classifying questions either under the coarse-grained category Question Type  No. of (Table 11), or under the fine-grained category (Table 12). The questions  MRR on TREC docs  MRR on the web question  classification  performance  was  measured  by BIRTHYEAR  8  48%  69% accuracy, i.e. the proportion of correctly classified questions INVENTOR  6  17%  58% among all test questions. DISCOVERER  4  13%  88% Table 11: Experimental Results (coarse-grained) – DEFINITION  102  34%  39% Zhang & Lee Algorithm  1000  2000  3000  4000  5500 WHY-FAMOUS  3  33%  0% NN  70.0%  73.6%  74.8%  74.8%  75.6% LOCATION  16  75%  86% NB  53.8%  60.4%  74.2%  76.0%  77.4% DT  78.8%  79.8%  82.0%  83.4%  84.2% It only  worked for certain types of questions that  had fixed SNoW  71.8%  73.4%  74.2%  78.2%  66.8% anchors, such as “where was X born”. Therefore, it performed SVM  76.8%  83.4%  87.2%  87.4%  85.8% badly  with general  definitional  questions, since  the  patterns did not handle long-distance dependencies. Table 12: Experimental Results (fine-grained) –   Li & Roth [17], 2002: NN  57.4%  62.8%  65.2%  67.2%  68.4% Their  main  contribution  was  proposing  a  hierarchical NB  48.8%  52.8%  56.6%  56.2%  58.4% taxonomy  in  which  questions  were  classified  and  answers DT  67.0%  70.0%  73.6%  75.4%  77.0% were  identified  upon  that  taxonomy.  Li  and  Roth  used  and SNoW  42.2%  66.2%  69.0%  66.6%  74.0% tested a machine learning technique called SNoW in order to classify  the  questions  into  coarse  and  fine  classes  of  the SVM  68.0%  75.0%  77.2%  77.4%  80.2% taxonomy. They also showed through another experiment the   Xu et al. [23], 2003: differences between a hierarchical and flat classification of a question. For definitional QA, they adopted a hybrid approach that used Their experiments used about 5500 questions divided into five various  complementary  components  including  information different  sizes  datasets  (1000,  2000,  3000,  4000,  5500), retrieval  and  various  linguistic  and  extraction  tools  such  as collected  from  four  different  sources.  These  datasets  were name  finding,  parsing,  co-reference  resolution,  proposition used to train their classifier, which was then tested using 500 extraction,  relation  extraction  and  extraction  of  structured other questions collected from TREC10. Their experimental patterns.   results proved that the question classification problem can be solved quite accurately using a learning approach. They performed three runs using the F-metric for evaluation. In the first run, BBN2003A, the web was not used in answer The research did not consider or test other machine learning finding. In the  second run, BBN2003B, answers  for factoid classifiers that could have achieved more accurate results than questions were found using both TREC corpus and the web SNoW, and at the same time it did not provide any reason for while  list  questions  were  found  using  BBN2003A.  Finally, choosing  SNoW  in  particular  over  other  machine  learning BBN2003C  was  the  same  as  BBN2003B  except  that  if  the algorithms. answer for a factoid question was found multiple times in the corpus, its score was boosted.   Zhang and Lee [19], 2003: Table 13: Experimental Results (Definitional QA) – Xu et al. This research worked on the limitation of the aforementioned BBN2003A  BBN2003B  BBN2003C  Baseline research  [17],  and  carried  out  a  comparison  between  five

2 PROPOSED RESEARCH – CONTRIBUTIONS, - “INEXACT”

4.2 Proposed Research – Contributions,

Experiments and Limitations

Bạn đang xem 4. - QUESTION ANSWERING SYSTEM AND ITS APPLICATION