I'm using libxml2 to parse the following XML string:
<?xml version=\"1.0\"?>
<note>
<to>
<name>Tove</name>
<name>Tovi</name>
</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Formatted as a C-style string:
"<?xml version=\"1.0\"?><note><to><name>Tove</name><name>Tovi</name></to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"
This is based on the example from the W3C's site on XML; I only added the nested names in the "to" field.
I have the following recursive code in C++ to parse it into an object tree:
RBCXMLNode * RBCXMLDoc::recursiveProcess(xmlNodePtr node) {
RBCXMLNode *rNode = new RBCXMLNode();
xmlNodePtr childIterator = node->xmlChildrenNode;
const char *chars = (const char *)(node->name);
string name(chars);
const char *content = (const char *)xmlNodeGetContent(node);
rNode->setName(name);
rNode->setUTF8Data(content);
cout << "Just parsed " << rNode->name() << ": " << rNode->stringData() << endl;
while (childIterator != NULL) {
RBCXMLNode *rNode2 = recursiveProcess(childIterator);
rNode->addChild(rNode2);
childIterator = childIterator->next;
}
return rNode;
}
So for each node it creates the matching object, sets its name and content, then recurses for its children. Note that each node is only processed once. However, I get the following (nonsensical, to me at least) output:
Just parsed note: ToveToviJaniReminderDon't forget me this weekend!
Just parsed to: ToveTovi
Just parsed name: Tove
Just parsed text: Tove
Just parsed name: Tovi
Just parsed text: Tovi
Just parsed from: Jani
Just parsed text: Jani
Just parsed heading: Reminder
Just parsed text: Reminder
Just parsed body: Don't forget me this weekend!
Just parsed text: Don't forget me this weekend!
Note that each item is being parsed twice; once giving the name as "text" and one giving it as whatever it should be. Also, the "note" root node is having its data parsed as well; this is undesirable. Also note that this root node is not parsed twice, like the others are.
So I have two questions:
- How do I avoid parsing the root node's data, and just have its name and not its content? This also will presumably happen with more deeply nested nodes as well.
- How do I avoid the duplicate parsing on the other nodes? Obviously, I want to keep the properly named versions, while maintaining the (unlikely) possibility that a node actually is named "text". Also, there may be duplicate nodes that are desired, so just checking to see if the node has been parsed already is not an option.
Thanks in advance.