Looking at some Perl parsers for XML and HTML like Mojo::DOM as pointed by Miller answer above and also looking at XML::TreePP, I found they are using regex to parse the entire contents, so I tried their regex and got good results just may need some optimizations.
Here is what I did:
my $text =<<'XHTML';
some text
<p>hello p</p>
<vars type="text" name= "fname" single='single quoted' unqouted=noquotes hastags=" <b>color='red'</b> Class::SubClass->color"/>
other text or html open tags like
<vars type="text" name= "lname" single1='single quoted' unqouted1=noquotes hastags1=" <b>bgcolor='red'</b> Class::SubClass->bgcolor">
<table><tr>
<vars name="mname" />
XHTML
while ( $text =~ m{(<vars\s+([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)/?>)}sxgi ) {
my $match = $1;
my $args = $2;
#print "[[$match]] \n{{$args}}\n\n";
#parse name=value attributes, values may be double or single quoted or unquoted
while ( $args =~ m/([^<>=\s\/]+|\/)(?:\s*=\s*(?:"([^"]*?)"|'([^']*?)'|([^>\s\/]*)))?\s*/sxgi ) {
my $name = $1;
#any better solution with regex above to just get $2
my $value = $2? $2: ($3? $3 : $4);
print "$name=$value\n";
}
print "\n";
}
and here is the output as expected:
type=text
name=fname
single=single quoted
unqouted=noquotes
hastags= <b>color='red'</b> Class::SubClass->color
type=text
name=lname
single1=single quoted
unqouted1=noquotes
hastags1= <b>bgcolor='red'</b> Class::SubClass->bgcolor
name=mname
of course the variable $match in the code has the entire match so I can replace it with my contents.
the second regex that matches the attributes needs optimizations, I am not satisfied with this line :
my $value = $2? $2: ($3? $3 : $4);
can the regex be modified to just get the attribute value in $2.
The regex as used in Mojo::Dom is
my $ATTR_RE = qr/
([^<>=\s\/]+|\/) # Key
(?:
\s*=\s*
(?:
"([^"]*?)" # Quotation marks
|
'([^']*?)' # Apostrophes
|
([^>\s\/]*) # Unquoted
)
)?
\s*
/x;
my $END_RE = qr!^\s*/\s*(.+)!;
my $TOKEN_RE = qr/
([^<]+)? # Text
(?:
<\?(.*?)\?> # Processing Instruction
|
<!--(.*?)--\s*> # Comment
|
<!\[CDATA\[(.*?)\]\]> # CDATA
|
<!DOCTYPE(
\s+\w+
(?:(?:\s+\w+)?(?:\s+(?:"[^"]*"|'[^']*'))+)? # External ID
(?:\s+\[.+?\])? # Int Subset
\s*
)>
|
<(
\s*
[^<>\s]+ # Tag
\s*
(?:$ATTR_RE)* # Attributes
)>
|
(<) # Runaway "<"
)??
/xis;
I just messed up with it to match if closing tag with or without slash > or />.