Find and replace xml tags in html string in Perl with regex

Question

I need to find and replace xml tags inside html string which is not complete xml that's why I can not use xml parser to deal with it. So I need to manually find the xml tags and replace them with content inside these html strings.

Example of html string containing the xml tags:

some text<p>hello p</p>
<vars type="text" name="fname" age="64" style="<b>color='red'</b>
Class::SubClass->color" /> other text or html open tags like <p><table><tr>

So I need to find the xml "vars" tags with their variable number of optional attributes and replace them with contents.

Miller · Accepted Answer · 2014-05-25 21:05:02Z

2

Do not use regular expressions for parsing HTML. Instead use an actual HTML Parser like Mojo::DOM. There's a nice 8 minute video about using this module at mojocast episode 5.

The following takes your html, and translates your special vars tag into some new text.

use strict;
use warnings;

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $var ($dom->find('vars')->each) {
    my $type = $var->{type};
    my $name = $var->{name};

    $var->replace("<b>name is $name</b> + <i>type is $type</i>");
}

print $dom;

__DATA__
<html>
<head>
<title>Always use a parser, not a regex</title>
</head>
<body>
some text<p>hello p</p>
<vars type="text" name="fname" age="64" style="<b>color='red'</b>
Class::SubClass->color" /> other text or html open tags like <p><table><tr><td></td></tr></table>

</body></html>

Outputs:

<html>
<head>
<title>Always use a parser, not a regex</title>
</head>
<body>
some text<p>hello p</p>
<b>name is fname</b> + <i>type is text</i> other text or html open tags like <p></p><table><tr><td></td></tr></table>

</body></html>

answered May 25, 2014 at 21:05

Miller

35.3k4 gold badges42 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

daliaessam Over a year ago

Your solution looks very nice, the only downside is this module will load entire framework Mojo, I am already using Moose in my app so It will be heavy to load another framework for such small thing, do you know any other standalone Perl module that I can use instead of this one? the reason I am trying to use regex is to avoid tons of modules for small task.

daliaessam Over a year ago

Regarding "Do not use regular expressions for parsing HTML", I looked at Mojo::DOM::HTML and Mojo::DOM::CSS which Mojo::DOM is built on, they both use regular experssions to parse the input. I also looked at XML::TreePP and found also it uses regex to parse the entire document. So I think using regex is all fine.

Miller Over a year ago

Yes, of course using regex to parse html is doable. And if you know enough to look at the source of those modules, you're more capable than most who'd try. Overall, it's just wiser to not recreate the wheel, especially since that is a much harder project to do well than it would initially seem. Nevertheless, if you have performance concerns, than I can understand your desire to try.

daliaessam · Accepted Answer · 2014-05-26 01:23:35Z

Looking at some Perl parsers for XML and HTML like Mojo::DOM as pointed by Miller answer above and also looking at XML::TreePP, I found they are using regex to parse the entire contents, so I tried their regex and got good results just may need some optimizations.

Here is what I did:

my $text =<<'XHTML';
some text
<p>hello p</p>
<vars  type="text" name= "fname" single='single quoted' unqouted=noquotes hastags=" <b>color='red'</b> Class::SubClass->color"/>
other text or html open tags like
<vars type="text" name= "lname" single1='single quoted' unqouted1=noquotes hastags1=" <b>bgcolor='red'</b> Class::SubClass->bgcolor">
<table><tr>
<vars name="mname" />
XHTML

while ( $text =~ m{(<vars\s+([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)/?>)}sxgi ) {
    my $match = $1;
    my $args = $2;
    #print "[[$match]] \n{{$args}}\n\n";

    #parse name=value attributes, values may be double or single quoted or unquoted
    while ( $args =~ m/([^<>=\s\/]+|\/)(?:\s*=\s*(?:"([^"]*?)"|'([^']*?)'|([^>\s\/]*)))?\s*/sxgi ) {
        my $name = $1;
        #any better solution with regex above to just get $2
        my $value = $2? $2: ($3? $3 : $4);
        print "$name=$value\n";
    }
    print "\n";
}

and here is the output as expected:

type=text
name=fname
single=single quoted
unqouted=noquotes
hastags= <b>color='red'</b> Class::SubClass->color

type=text
name=lname
single1=single quoted
unqouted1=noquotes
hastags1= <b>bgcolor='red'</b> Class::SubClass->bgcolor

name=mname

of course the variable $match in the code has the entire match so I can replace it with my contents.

the second regex that matches the attributes needs optimizations, I am not satisfied with this line :

my $value = $2? $2: ($3? $3 : $4);

can the regex be modified to just get the attribute value in $2.

The regex as used in Mojo::Dom is

my $ATTR_RE = qr/
  ([^<>=\s\/]+|\/)   # Key
  (?:
    \s*=\s*
    (?:
      "([^"]*?)"     # Quotation marks
    |
      '([^']*?)'     # Apostrophes
    |
      ([^>\s\/]*)    # Unquoted
    )
  )?
  \s*
/x;
my $END_RE   = qr!^\s*/\s*(.+)!;
my $TOKEN_RE = qr/
  ([^<]+)?                                          # Text
  (?:
    <\?(.*?)\?>                                     # Processing Instruction
  |
    <!--(.*?)--\s*>                                 # Comment
  |
    <!\[CDATA\[(.*?)\]\]>                           # CDATA
  |
    <!DOCTYPE(
      \s+\w+
      (?:(?:\s+\w+)?(?:\s+(?:"[^"]*"|'[^']*'))+)?   # External ID
      (?:\s+\[.+?\])?                               # Int Subset
      \s*
    )>
  |
    <(
      \s*
      [^<>\s]+                                      # Tag
      \s*
      (?:$ATTR_RE)*                                 # Attributes
    )>
  |
    (<)                                             # Runaway "<"
  )??
/xis;

I just messed up with it to match if closing tag with or without slash > or />.

Collectives™ on Stack Overflow

Find and replace xml tags in html string in Perl with regex

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related