1

I need to find and replace xml tags inside html string which is not complete xml that's why I can not use xml parser to deal with it. So I need to manually find the xml tags and replace them with content inside these html strings.

Example of html string containing the xml tags:

some text<p>hello p</p>
<vars type="text" name="fname" age="64" style="<b>color='red'</b>
Class::SubClass->color" /> other text or html open tags like <p><table><tr>

So I need to find the xml "vars" tags with their variable number of optional attributes and replace them with contents.

2 Answers 2

2

Do not use regular expressions for parsing HTML. Instead use an actual HTML Parser like Mojo::DOM. There's a nice 8 minute video about using this module at mojocast episode 5.

The following takes your html, and translates your special vars tag into some new text.

use strict;
use warnings;

use Mojo::DOM;

# Parse
my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $var ($dom->find('vars')->each) {
    my $type = $var->{type};
    my $name = $var->{name};

    $var->replace("<b>name is $name</b> + <i>type is $type</i>");
}

print $dom;

__DATA__
<html>
<head>
<title>Always use a parser, not a regex</title>
</head>
<body>
some text<p>hello p</p>
<vars type="text" name="fname" age="64" style="<b>color='red'</b>
Class::SubClass->color" /> other text or html open tags like <p><table><tr><td></td></tr></table>

</body></html>

Outputs:

<html>
<head>
<title>Always use a parser, not a regex</title>
</head>
<body>
some text<p>hello p</p>
<b>name is fname</b> + <i>type is text</i> other text or html open tags like <p></p><table><tr><td></td></tr></table>

</body></html>
Sign up to request clarification or add additional context in comments.

3 Comments

Your solution looks very nice, the only downside is this module will load entire framework Mojo, I am already using Moose in my app so It will be heavy to load another framework for such small thing, do you know any other standalone Perl module that I can use instead of this one? the reason I am trying to use regex is to avoid tons of modules for small task.
Regarding "Do not use regular expressions for parsing HTML", I looked at Mojo::DOM::HTML and Mojo::DOM::CSS which Mojo::DOM is built on, they both use regular experssions to parse the input. I also looked at XML::TreePP and found also it uses regex to parse the entire document. So I think using regex is all fine.
Yes, of course using regex to parse html is doable. And if you know enough to look at the source of those modules, you're more capable than most who'd try. Overall, it's just wiser to not recreate the wheel, especially since that is a much harder project to do well than it would initially seem. Nevertheless, if you have performance concerns, than I can understand your desire to try.
0

Looking at some Perl parsers for XML and HTML like Mojo::DOM as pointed by Miller answer above and also looking at XML::TreePP, I found they are using regex to parse the entire contents, so I tried their regex and got good results just may need some optimizations.

Here is what I did:

my $text =<<'XHTML';
some text
<p>hello p</p>
<vars  type="text" name= "fname" single='single quoted' unqouted=noquotes hastags=" <b>color='red'</b> Class::SubClass->color"/>
other text or html open tags like
<vars type="text" name= "lname" single1='single quoted' unqouted1=noquotes hastags1=" <b>bgcolor='red'</b> Class::SubClass->bgcolor">
<table><tr>
<vars name="mname" />
XHTML

while ( $text =~ m{(<vars\s+([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)/?>)}sxgi ) {
    my $match = $1;
    my $args = $2;
    #print "[[$match]] \n{{$args}}\n\n";

    #parse name=value attributes, values may be double or single quoted or unquoted
    while ( $args =~ m/([^<>=\s\/]+|\/)(?:\s*=\s*(?:"([^"]*?)"|'([^']*?)'|([^>\s\/]*)))?\s*/sxgi ) {
        my $name = $1;
        #any better solution with regex above to just get $2
        my $value = $2? $2: ($3? $3 : $4);
        print "$name=$value\n";
    }
    print "\n";
}

and here is the output as expected:

type=text
name=fname
single=single quoted
unqouted=noquotes
hastags= <b>color='red'</b> Class::SubClass->color

type=text
name=lname
single1=single quoted
unqouted1=noquotes
hastags1= <b>bgcolor='red'</b> Class::SubClass->bgcolor

name=mname

of course the variable $match in the code has the entire match so I can replace it with my contents.

the second regex that matches the attributes needs optimizations, I am not satisfied with this line :

my $value = $2? $2: ($3? $3 : $4);

can the regex be modified to just get the attribute value in $2.

The regex as used in Mojo::Dom is

my $ATTR_RE = qr/
  ([^<>=\s\/]+|\/)   # Key
  (?:
    \s*=\s*
    (?:
      "([^"]*?)"     # Quotation marks
    |
      '([^']*?)'     # Apostrophes
    |
      ([^>\s\/]*)    # Unquoted
    )
  )?
  \s*
/x;
my $END_RE   = qr!^\s*/\s*(.+)!;
my $TOKEN_RE = qr/
  ([^<]+)?                                          # Text
  (?:
    <\?(.*?)\?>                                     # Processing Instruction
  |
    <!--(.*?)--\s*>                                 # Comment
  |
    <!\[CDATA\[(.*?)\]\]>                           # CDATA
  |
    <!DOCTYPE(
      \s+\w+
      (?:(?:\s+\w+)?(?:\s+(?:"[^"]*"|'[^']*'))+)?   # External ID
      (?:\s+\[.+?\])?                               # Int Subset
      \s*
    )>
  |
    <(
      \s*
      [^<>\s]+                                      # Tag
      \s*
      (?:$ATTR_RE)*                                 # Attributes
    )>
  |
    (<)                                             # Runaway "<"
  )??
/xis;

I just messed up with it to match if closing tag with or without slash > or />.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.