I made a VERY PRAGMATIC solution to this. It's far from perfect, it uses a lot of things that I would not want to use in production code, and it probably breaks on some of the things your real data has. It does work for the example, though.
Before looking at the code, let's notice a few things that make the XML hard to parse:
- your
CDATA opening is wrong. You are using <![[CDATA[. There is one [ too many. It's supposed to be <![CDATA[.
- the double-quotes within the attribute break XML parsers
I fixed these issues by simply repairing them with a regex. As I said, it is very pragmatic. I do not claim that this is a very good solution.
So here's the code:
use strict; use warnings;
use XML::Simple;
my $html = <<HTML;
<p>Hello world, mixed html and xml content</p>
<a href="http://google.com/">google</a>
<fw:blog id="title" content="hellow world" size="30" width="200px" />
<b>First content section</b>
<fw:content id="middle" width="400px" height="300px">Here is the first content section</fw:content>
<b>Second content section</b>
<fw:content id="left-part" width="400px" height="300px"><![[CDATA[ Here is the first content section]]></fw:content>
<b>Attributes may contains single or double quotes, can we skip double quotes in attributes</b>
<fw:blog id="title" content="what's your name, I may"" be cool" size="30" width="200px" />
<fw:lang id="home" />
HTML
# dispatch table
my %dispatch = (
content => sub {
my ($attr) = @_;
return qq{<div width="$attr->{width}" id="$attr->{id}">Content: $attr->{content}</div>};
},
blog => sub {
my ($attr) = @_;
return qq{<p width="$attr->{width}" id="$attr->{id}">Blog: $attr->{content}</p>};
},
lang => sub {
my ($attr) = @_;
return "<p>FooLanguage</p>";
}
);
# pragmatic repairs based on the example given:
# CDATA only has two brackets, not three, and the closing one is right
$html =~ s/<!\[\[CDATA\[/<![CDATA[/;
# replace tags that do not have a closing tag
$html =~ s{(<fw:[^>]+/>)}{parse($1)}ge;
# replace tags with a closing tag (see http://regex101.com/r/bB0kB5)
$html =~ s{
( # group to $1
<
( # group to $2 and \2
fw: # start with namespace-prefix
[a-zA-z]+ # find tagname
) # end of $2
[^>]* # match everything until the next > (or nothing)
> # end of tag
(?:
[^<]+ # all the stuff before the closing tag
| # or
<!\[CDATA\[.+?\]\]> # a CDATA section
)
</ \2 > # the closing tag is the same as the opening (\2)
)
}
{
parse($1) # dispatch
}gex; # x adds extended readability (i.e. quotes)
print $html;
sub parse {
my ($string) = @_;
# pragmatic repairs based on the example given:
# there can be no unescaped quotes within quotes,
# but there are no empty attributs either
$string =~ s/""/{double-double-quote}/g;
# read with XML::Simple and fetch tagname as well as attributes
my ( $name, $attr ) = each %{ XMLin($string, KeepRoot => 1 ) };
# get rid of the namespace
$name =~ s/^[^:]+://;
# restore quotes
s/{double-double-quote}/""/ for values %$attr;
# dispatch
return $dispatch{$name}->($attr);
}
How does this work?
- I'm assuming all the processing instructions are within tags that have the
fw: namespace.
- There are three types of instruction in the example:
content, blog and lang. I have no idea what they are supposed to do, so I made that up.
- I created a dispatch table. That's a hash with the instructions as keys and coderefs as values. A very good resource on this is the book Higher Order Perl by Mark Jason Dominus.
- I fixed the
CDATA problem globally in the HTML/XML string.
- There are two regexes that take care of substituting the instructions with the actual content. They are using the
/e flag, which executes Perl code in the substitution part of the s///.
- The first one finds all tags that do not have a closing tag, i.e.
<foo />.
- The second one is more complicated. It deals with
<foo>...</foo> and also handles the CDATA in the content. There is no support for CDATA in attributes! The regex uses the /x flag to allow for comments and indentation. For an explanation of the regex, see http://regex101.com/r/bB0kB5.
- My
parse() sub takes the complete matched tag and does stuff to it:
- Replace the double-double-quotes with a placeholder. If there is a real instance of quoted stuff inside an attribute, it will break!
<foo attr="this is "quoted" stuff"> will not work. You will have to find a way of dealing with these.
- It uses XML::Simple to break down the tag into a hashref with attributes. The
KeepRoot option puts the tag name as the key, so we get { foo => { attr1 => 'bar', attr2 => 'baz' }}. I'm using the each built-in to split this up in key and value directly.
- Replace the escaped double-quotes back.
- Dispatch the instruction (which is in
$name) through the dispatch table. The syntax to invoke a coderef with params is $coderef->($arg), but we are using a hash value. We pass the hashref that XML::Simple created from the attributes (and content, but it ends up like an attribute named content).
I'd like to stress again that this will probably not even work on your real data, but it might give some ideas as to how to solve it pragmatically.
"what's your name, I may"" be cool"part. That one breaks everything? Where are you getting this snippet from?