Perl Regular Expression to extract value from nested html tags

Question

$match = q(<a href="#google"><h1><b>Google</b></h1></a>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";

OUTPUT: Google</b></h1>

It Should be : Google

Unable to extract value from link using Regex in Perl, it could have one more or less nesting:

<h1><b><i>Google</i></b></h1>

Please Try this:

1) <td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>

2) <a href="http://www.hp.com"><h1><b>HP</b></h1></a>

3) <a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);

4) <a href="#cite_note-1"><span>[</span>1<span>]</span></a>

OUTPUT:

Unix shell

HP

generic

[1]

Your expression says "take everything until the closing </a>", and that's what you get. You need to use <\/b> — Floris
– Floris, Commented Aug 28, 2013 at 12:59
Perl has many fine HTML parsers (such as this one). Don't use regex. — Quentin
– Quentin, Commented Aug 28, 2013 at 12:59
I know though for extracting value , almost working , failing while excluding closing tags, Any idea? — Hmnshu
– Hmnshu, Commented Aug 28, 2013 at 12:59

amon · Accepted Answer · 2013-08-28 13:11:47Z

5

Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:

use Mojo;

my $dom = Mojo::DOM->new(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->at('a[href="#google"]')->all_text, "\n";

Or with HTML::TreeBuilder::XPath:

use HTML::TreeBuilder::XPath;

my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->findvalue('//a[@href="#google"]'), "\n";

edited Aug 28, 2013 at 13:11

answered Aug 28, 2013 at 13:04

amon

57.8k2 gold badges93 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Floris · Accepted Answer · 2013-08-28 13:08:17Z

2

Try this:

if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)

That should take "everything after the href and between the <b>...</b> tags

Instead, to get "everything after the last > and before the first </, you can use

<a.*?href.*?>([^>]*?)<\/

edited Aug 28, 2013 at 13:08

answered Aug 28, 2013 at 13:01

Floris

46.6k7 gold badges73 silver badges128 bronze badges

5 Comments

Hmnshu Over a year ago

Floris , it could have one more or less nesting: <a href="#google"><h1><b><i>Google</i></b></h1></a>

Floris Over a year ago

@user1239790 - I have given a second expression to handle "any nesting".

Hmnshu Over a year ago

Tried following: does not work: if($match =~ /<a.*?href.*?>([^>]*?)<\/a>/){ $title = $1; }else { $title=""; }

Floris Over a year ago

The expression you used is not the expression I gave. What result do you get when you use my second expression? I tested it on regexplanet.com/advanced/perl/index.html and it was fine.

Hmnshu Over a year ago

Working... Just need to remove html tags if any under <a>***</a> and get the value.

RobEarl · Accepted Answer · 2013-08-28 14:02:59Z

0

~~For this simple case you could use:~~ The requirements are no longer simple, look at @amon's answer for how to use an HTML parser.

/<a.*?>([^<]+)</

Match an opening a tag, followed by anything until you find something between > and <.

Though as others have mentioned, you should generally use a HTML parser.

echo '<td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>
<a href="http://www.hp.com"><h1><b>HP</b></h1></a>
<a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic

edited Aug 28, 2013 at 14:02

answered Aug 28, 2013 at 13:04

RobEarl

7,9226 gold badges38 silver badges52 bronze badges

10 Comments

RobEarl Over a year ago

@user1239790 are you starting to see why you shouldn't use regex?

Floris Over a year ago

Shouldn't your expression be /<a.*?>([^>]+)</ rather than /<a.*?>([^<]+)</ - i.e. "no more close brackets, and then an "open end-of tag".?

Hmnshu Over a year ago

Yes, i saw, regex issues with tags, Can't we have regex solution for above?

hwnd Over a year ago

This will grab your text between your nested tags. /(?<=^|>)([^><]+?)(?=<|$)/

Floris Over a year ago

@hwnd yes it does - but it (the expression given in the answer, not the one in your comment) would fail with multiple nested tags.

|

user3676926 · Accepted Answer · 2014-05-26 19:09:21Z

0

I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*

(?<=>)((?:\w+)(?:\s*))(?1)*

Just take the first element of the returned array, ie array[0]

edited May 26, 2014 at 19:09

answered May 26, 2014 at 16:40

user3676926

11 bronze badge

Collectives™ on Stack Overflow

Perl Regular Expression to extract value from nested html tags

4 Answers 4

Comments

5 Comments

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related