3

I have an XML file that I am pulling from the web and parsing. One of the items in the XML is a 'content' value that has HTML. I am using XML::Simple::XMLin to parse the file like so:

$xml= eval { $data->XMLin($xmldata, forcearray => 1, suppressempty=> +'') };

When I use Data::Dumper to dump the hash, I discovered that SimpleXML is parsing the HTML into the hash tree:

'content' => {
      'div' => [
                 {
                   'xmlns' => 'http://www.w3.org/1999/xhtml',
                   'p' => [
                       {
                         'a' => [
                             {
                                'href' => 'http://miamiherald.typepad.com/.a/6a00d83451b26169e20133ec6f4491970b-pi',
                               'style' => 'FLOAT: left',
                               'img' => [
                                   etc.....

This is not what I want. I want to just grab content inside of this entry. How do I do this?

3
  • 2
    What does the original XML look like? Is the HTML in a CDATA section? Commented Apr 14, 2010 at 20:26
  • 1
    Why exactly are using XML::Simple? Commented Apr 15, 2010 at 0:36
  • @Sinan - does XML::LibXML or XML::Parser include some fancy hook which allows manual treatment of content as CDATA? Commented Apr 15, 2010 at 5:38

4 Answers 4

3

My general rule is that when XML::Simple starts to fail, it's time to move on to another XML processing module. XML::Simple is really supposed to be for situations that you don't need to think about. Once you have a weird case that you have to think about, you're going to have to do some extra work that I usually find quite kludgey to integrate with XML::Simple.

Sign up to request clarification or add additional context in comments.

Comments

3
#!/usr/bin/perl

use strict; use warnings;

use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
    or die "Cannot read XML\n";

if ( $reader->nextElement('content') ) {
    print $reader->readInnerXml;
}

__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>

Output:

<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>

Comments

2

If the HTML is included directly in the XML (rather than being escaped or inside a CDATA) then there is no way for XML::Simple to know where to stop parsing.

However, you can reconstitute just the HTML by passing that section of the data structure to XML::Simple's XMLout() function.

Comments

0

If the HTML is not inside CDATA construct or otherwise encoded, what you can do is a slight hack.

Before processing with XML::Simple, find the contents of <my_html> tag which are presumably suspect HTML, and pass them through HTML entity encoder ("<" => "&lt'" etc...) like HTML::Entities. Then insert encoded content instead of the original content of <my_html> tag.

This is VERY hacky, VERY easy to do incorrectly unless you know 100% what you're doing with regular expressions, and should not be done.

Having said that, it WILL solve your problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.