Perl parse xml tags manually using regular expression

Question

I have html content snippet, which contains custom xml tags with attributes or cdata and may have text nodes.

The content snippet is not well formed xml, so I think I can not use xml parser modules.

Here is sample html content snippet:

<p>Hello world, mixed html and xml content</p>
<a href="http://google.com/">google</a>
<fw:blog id="title" content="hellow world" size="30" width="200px" />
<b>First content section</b>
<fw:content id="middle" width="400px" height="300px">Here is the first content section</fw:content>
<b>Second content section</b>
<fw:content id="left-part" width="400px" height="300px"><![[CDATA[ Here is the first content section]]></fw:content>
<b>Attributes may contains single or double quotes, can we skip double quotes in attributes</b>
<fw:blog id="title" content="what's your name, I may"" be cool" size="30" width="200px" />
<fw:lang id="home" />

Assuming I have the name space fw, I need to find and replace all fw xml tags with the program output for each tag.

It's a bad idea to parse XML manually. Use a specialized library. — Dan Dascalescu
– Dan Dascalescu, Commented Mar 2, 2014 at 4:00
I got this to work with XML::Twig but for the "what's your name, I may"" be cool" part. That one breaks everything? Where are you getting this snippet from? — simbabque
– simbabque, Commented Mar 2, 2014 at 9:53

Community · Accepted Answer · 2020-06-20 09:12:55Z

I made a VERY PRAGMATIC solution to this. It's far from perfect, it uses a lot of things that I would not want to use in production code, and it probably breaks on some of the things your real data has. It does work for the example, though.

Before looking at the code, let's notice a few things that make the XML hard to parse:

your CDATA opening is wrong. You are using <![[CDATA[. There is one [ too many. It's supposed to be <![CDATA[.
the double-quotes within the attribute break XML parsers

I fixed these issues by simply repairing them with a regex. As I said, it is very pragmatic. I do not claim that this is a very good solution.

So here's the code:

use strict; use warnings;
use XML::Simple;

my $html = <<HTML;
<p>Hello world, mixed html and xml content</p>
<a href="http://google.com/">google</a>
<fw:blog id="title" content="hellow world" size="30" width="200px" />
<b>First content section</b>
<fw:content id="middle" width="400px" height="300px">Here is the first content section</fw:content>
<b>Second content section</b>
<fw:content id="left-part" width="400px" height="300px"><![[CDATA[ Here is the first content section]]></fw:content>
<b>Attributes may contains single or double quotes, can we skip double quotes in attributes</b>
<fw:blog id="title" content="what's your name, I may"" be cool" size="30" width="200px" />
<fw:lang id="home" />
HTML

# dispatch table
my %dispatch = (
  content => sub {
    my ($attr) = @_;
    return qq{<div width="$attr->{width}" id="$attr->{id}">Content: $attr->{content}</div>};
  },
  blog => sub {
    my ($attr) = @_;
    return qq{<p width="$attr->{width}" id="$attr->{id}">Blog: $attr->{content}</p>};
  },
  lang => sub {
    my ($attr) = @_;
    return "<p>FooLanguage</p>";
  }
);

# pragmatic repairs based on the example given:
# CDATA only has two brackets, not three, and the closing one is right
$html =~ s/<!\[\[CDATA\[/<![CDATA[/;


# replace tags that do not have a closing tag
$html =~ s{(<fw:[^>]+/>)}{parse($1)}ge;
# replace tags with a closing tag (see http://regex101.com/r/bB0kB5)
$html =~ s{
  (                # group to $1
    <
      (            # group to $2 and \2
        fw:        # start with namespace-prefix
        [a-zA-z]+  # find tagname
      )            # end of $2
      [^>]*        # match everything until the next > (or nothing)
    >              # end of tag
    (?:
      [^<]+                 # all the stuff before the closing tag
      |                       # or
      <!\[CDATA\[.+?\]\]>   # a CDATA section
    )
    </  \2  >      # the closing tag is the same as the opening (\2)
  )
}
{
  parse($1)        # dispatch
}gex; # x adds extended readability (i.e. quotes)


print $html;

sub parse {
  my ($string) = @_;

  # pragmatic repairs based on the example given:
  # there can be no unescaped quotes within quotes,
  # but there are no empty attributs either
  $string =~ s/""/{double-double-quote}/g;                

  # read with XML::Simple and fetch tagname as well as attributes
  my ( $name, $attr ) = each %{ XMLin($string, KeepRoot => 1 ) };
  
  # get rid of the namespace
  $name =~ s/^[^:]+://;
  
  # restore quotes
  s/{double-double-quote}/""/ for values %$attr;
  
  # dispatch
  return $dispatch{$name}->($attr);
}

How does this work?

I'm assuming all the processing instructions are within tags that have the fw: namespace.
There are three types of instruction in the example: content, blog and lang. I have no idea what they are supposed to do, so I made that up.
I created a dispatch table. That's a hash with the instructions as keys and coderefs as values. A very good resource on this is the book Higher Order Perl by Mark Jason Dominus.
I fixed the CDATA problem globally in the HTML/XML string.
There are two regexes that take care of substituting the instructions with the actual content. They are using the /e flag, which executes Perl code in the substitution part of the s///.
- The first one finds all tags that do not have a closing tag, i.e. <foo />.
- The second one is more complicated. It deals with <foo>...</foo> and also handles the CDATA in the content. There is no support for CDATA in attributes! The regex uses the /x flag to allow for comments and indentation. For an explanation of the regex, see http://regex101.com/r/bB0kB5.
My parse() sub takes the complete matched tag and does stuff to it:
- Replace the double-double-quotes with a placeholder. If there is a real instance of quoted stuff inside an attribute, it will break! <foo attr="this is "quoted" stuff"> will not work. You will have to find a way of dealing with these.
- It uses XML::Simple to break down the tag into a hashref with attributes. The KeepRoot option puts the tag name as the key, so we get { foo => { attr1 => 'bar', attr2 => 'baz' }}. I'm using the each built-in to split this up in key and value directly.
- Replace the escaped double-quotes back.
- Dispatch the instruction (which is in $name) through the dispatch table. The syntax to invoke a coderef with params is $coderef->($arg), but we are using a hash value. We pass the hashref that XML::Simple created from the attributes (and content, but it ends up like an attribute named content).

I'd like to stress again that this will probably not even work on your real data, but it might give some ideas as to how to solve it pragmatically.

really nice code and documentation, I am studying it and will take some time to understand it as a starter, what I really want is an array of hashes of all matched tags with their attributes so I can process them almost as you did and as you expected the name space tags fw:blog will be processed with a blog function or class.
@daliaessam: You should have said what you wanted to do in the first place. Would have saved some guessing. But you would still need to replace them, so doing so as you parse through the document is the most sensible I believe. Basically you have your own little template engine. How about using one from CPAN?
sorry about that, should I post a new question asking for the same but getting the matched tags in array of hashes with the tags attributes? and yes the cdata was typo. The double quote inside the attribute can it be escaped by backslash like \"? or use entity?.
For XML::Simple you can't use a backslash. I tried that already. ;-) Entities would work. Regarding the change you want to make, try it. If you can't get it to work, create a new question with a short example.

Collectives™ on Stack Overflow

Perl parse xml tags manually using regular expression

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related