How can Perl's XML::Simple ignore HTML embedded in XML?

Question

I have an XML file that I am pulling from the web and parsing. One of the items in the XML is a 'content' value that has HTML. I am using XML::Simple::XMLin to parse the file like so:

$xml= eval { $data->XMLin($xmldata, forcearray => 1, suppressempty=> +'') };

When I use Data::Dumper to dump the hash, I discovered that SimpleXML is parsing the HTML into the hash tree:

'content' => {
      'div' => [
                 {
                   'xmlns' => 'http://www.w3.org/1999/xhtml',
                   'p' => [
                       {
                         'a' => [
                             {
                                'href' => 'http://miamiherald.typepad.com/.a/6a00d83451b26169e20133ec6f4491970b-pi',
                               'style' => 'FLOAT: left',
                               'img' => [
                                   etc.....

This is not what I want. I want to just grab content inside of this entry. How do I do this?

What does the original XML look like? Is the HTML in a CDATA section? — friedo
– friedo, Commented Apr 14, 2010 at 20:26
@Sinan - does XML::LibXML or XML::Parser include some fancy hook which allows manual treatment of content as CDATA? — DVK
– DVK, Commented Apr 15, 2010 at 5:38

brian d foy · Accepted Answer · 2010-04-16 04:19:00Z

3

My general rule is that when XML::Simple starts to fail, it's time to move on to another XML processing module. XML::Simple is really supposed to be for situations that you don't need to think about. Once you have a weird case that you have to think about, you're going to have to do some extra work that I usually find quite kludgey to integrate with XML::Simple.

answered Apr 16, 2010 at 4:19

brian d foy

134k31 gold badges214 silver badges613 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sinan Ünür · Accepted Answer · 2010-04-15 10:36:10Z

3

#!/usr/bin/perl

use strict; use warnings;

use XML::LibXML::Reader;
my $reader = XML::LibXML::Reader->new(IO => \*DATA)
    or die "Cannot read XML\n";

if ( $reader->nextElement('content') ) {
    print $reader->readInnerXml;
}

__DATA__
<content>
<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img
src="tada"/></a></p>
</div>
</content>

Output:

<div xmlns="http://www.w3.org/1999/xhtml">
<p><a href="http://miamiherald.typepad.com/" style="float:left"><img src="tada"/
></a></p>
</div>

edited Apr 15, 2010 at 10:36

answered Apr 15, 2010 at 10:29

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

Comments

brian d foy · Accepted Answer · 2010-04-16 04:16:45Z

2

If the HTML is included directly in the XML (rather than being escaped or inside a CDATA) then there is no way for XML::Simple to know where to stop parsing.

However, you can reconstitute just the HTML by passing that section of the data structure to XML::Simple's XMLout() function.

edited Apr 16, 2010 at 4:16

brian d foy

134k31 gold badges214 silver badges613 bronze badges

answered Apr 14, 2010 at 20:34

marnanel

212 bronze badges

Comments

DVK · Accepted Answer · 2010-04-14 20:38:22Z

0

If the HTML is not inside CDATA construct or otherwise encoded, what you can do is a slight hack.

Before processing with XML::Simple, find the contents of <my_html> tag which are presumably suspect HTML, and pass them through HTML entity encoder ("<" => "&lt'" etc...) like HTML::Entities. Then insert encoded content instead of the original content of <my_html> tag.

This is VERY hacky, VERY easy to do incorrectly unless you know 100% what you're doing with regular expressions, and should not be done.

Having said that, it WILL solve your problem.

answered Apr 14, 2010 at 20:38

DVK

130k33 gold badges219 silver badges337 bronze badges

Collectives™ on Stack Overflow

How can Perl's XML::Simple ignore HTML embedded in XML?

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related