XML Parsing results in Duplicates using libxml2

Question

I'm using libxml2 to parse the following XML string:

<?xml version=\"1.0\"?>
<note>
    <to>
        <name>Tove</name>
        <name>Tovi</name>
    </to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

Formatted as a C-style string:

"<?xml version=\"1.0\"?><note><to><name>Tove</name><name>Tovi</name></to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

This is based on the example from the W3C's site on XML; I only added the nested names in the "to" field.

I have the following recursive code in C++ to parse it into an object tree:

RBCXMLNode * RBCXMLDoc::recursiveProcess(xmlNodePtr node) {
    RBCXMLNode *rNode = new RBCXMLNode();
    xmlNodePtr childIterator = node->xmlChildrenNode;

    const char *chars = (const char *)(node->name);
    string name(chars);
    const char *content = (const char *)xmlNodeGetContent(node);
    rNode->setName(name);
    rNode->setUTF8Data(content);
    cout << "Just parsed " << rNode->name() << ": " << rNode->stringData() << endl;
    while (childIterator != NULL) {
        RBCXMLNode *rNode2 = recursiveProcess(childIterator);
        rNode->addChild(rNode2);
        childIterator = childIterator->next;
    }
    return rNode;
}

So for each node it creates the matching object, sets its name and content, then recurses for its children. Note that each node is only processed once. However, I get the following (nonsensical, to me at least) output:

Just parsed note: ToveToviJaniReminderDon't forget me this weekend!
Just parsed to: ToveTovi
Just parsed name: Tove
Just parsed text: Tove
Just parsed name: Tovi
Just parsed text: Tovi
Just parsed from: Jani
Just parsed text: Jani
Just parsed heading: Reminder
Just parsed text: Reminder
Just parsed body: Don't forget me this weekend!
Just parsed text: Don't forget me this weekend!

Note that each item is being parsed twice; once giving the name as "text" and one giving it as whatever it should be. Also, the "note" root node is having its data parsed as well; this is undesirable. Also note that this root node is not parsed twice, like the others are.

So I have two questions:

How do I avoid parsing the root node's data, and just have its name and not its content? This also will presumably happen with more deeply nested nodes as well.
How do I avoid the duplicate parsing on the other nodes? Obviously, I want to keep the properly named versions, while maintaining the (unlikely) possibility that a node actually is named "text". Also, there may be duplicate nodes that are desired, so just checking to see if the node has been parsed already is not an option.

Thanks in advance.

Diego Sevilla · Accepted Answer · 2010-11-10 18:01:14Z

2

The main problem I see in your code is that you're calling xmlNodeGetContent(). This is returning you the whole text inside the tag and its ending counterpart.

When parsing with libxml2 you get some nodes whose content is complex, so you cannot rely on xmlNodeGetContent() to retrieve the content. You have to do the recursive function differently. For instance, you the fastest solution to your function would be to only print the node name for nodes that are not text (tested with xmlNodeIsText()), and to write just the xmlNodeGetContent() for nodes that are text. This would give you an output something like:

Just parsed note
Just parsed to
Just parsed name
Just parsed text: Tove
Just parsed name
Just parsed text: Tovi
...

Note that now you only print elements, and only text when you have a text element type.

This also makes sense conceptually, because the content of a non-text node (not text) is so complex that how do you print it? You can only print its label (name). However, text nodes are so simple that you can print their content.

answered Nov 10, 2010 at 18:01

Diego Sevilla

29.1k4 gold badges62 silver badges91 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

jstm88 Over a year ago

That would work except that I do need to get all types of content; I need to be able to get the actual XML string for all types of values. So, for example, I may have a node that contains a floating point number; that needs to go in so I can then parse it in my code. Also, your method associates each element with the name "text", which doesn't work. I need to, for example, search for an object by its name, then retrieve its data.

Diego Sevilla Over a year ago

It was just an example. You have to manually test if a node is "text" and then act in consequence (I just copied/edited the output). If you have floating point values, you have to know "where" they are (name of the tag, for example), because as for XML will appear as text, and you have to decode them. Note that you're processing everything, but content of complex nodes is just complex so that you cannot treat it as a string.

Pete Kirkham Over a year ago

@jfm429 XML is a text mark-up language. It has no concept of floating point numbers. Either you have <a>1.23</a> which is an element whose tag name is "a" with a single child text node containing the text "1.23", or you have an element with named attributes. I haven't used libxml2, but there does seem to be a name member in xmlNode.

jstm88 Over a year ago

I think I see... libXML transforms note -> to -> name -> "Tove" to note -> to -> name -> text -> "Tove"? Is this the case with all XML parsers?

Diego Sevilla Over a year ago

@jfm429: I don't know if all, but most, because in XML, you may need information at all levels: element name, attributes, inner text, and even space between tags..., so each token piece is returned (do not forget comments also :)

|

Collectives™ on Stack Overflow

XML Parsing results in Duplicates using libxml2

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related