I have a large XML file having the following structure:
<brain>
<q>
<question> What are your hobbies? </question>
<question> What do you do for sake of fun? </question>
<question> How do you spend your spare time? </question>
<question> What are your interests? </question>
<question> What do you enjoy most? </question>
<answer> I like [personal_info/hobby] </answer>
<answer>[personal_info/hobby]</answer>
<answer>I enjoy [personal_info/hobby] </answer>
</q>
<q>
<question> Where do you live? </question>
<question> What city do you live in? </question>
<question> Where are you from? </question>
<question> Where are you living? </question>
<question> Where is your residence? </question>
<answer> I live at [personal_info/loc] </answer>
<answer> I am living in [personal_info/loc]</answer>
<answer> At [personal_info/loc]</answer>
<answer> [personal_info/loc]</answer>
</q>
.
.
.
</brain>
As you might have guessed, it is a database for a chatbot. The idea is that the user will enter a question (or any sentence for that matter) and our java-based chatbot will run an XQuery over this file. The XQuery implementation that I am using (known as nux) provides a fuzzy matching of sentence similarity and so will return sentences that partially match. Here is some code to illustrate this:
Nodes results = XQueryUtil.xquery(doc, "declare namespace lucene = \"java:nux.xom.pool.FullTextUtil\"; "
+ "for $q in /brain/q "
+ " for $question in $q/question"
+ " let $score := lucene:match($question, \"How are you\") "
+ " where $score > 0.1 "
+ " order by $score descending "
+ "return $q/answer");
This code is supposed to loop through each brain/q and then q/question and if its similarity score is more than 0.1, it should return <answer>'s of that are in that <q>. The problem is that it returns ALL answer tags. For example if "What are your hobbies?" is asked, it should return
<answer> I like [personal_info/hobby] </answer>
<answer>[personal_info/hobby]</answer>
<answer>I enjoy [personal_info/hobby] </answer>
but returns all the answer tags found in the file. It also repeats them again and again for unpredictable number of times.
Can you please help me on this?
The dataset was generated by running various scripts and were collected and manually checked by me. If necessary, I can change the structure of XML to solve this problem but will prefer not if it is possible.
Thanks for taking time to read my question and thinking to help.