1
$match = q(<a href="#google"><h1><b>Google</b></h1></a>);
if($match =~ /<a.*?href.*?><.?>(.*?)<\/a>/){
$title = $1;
}else {
$title="";
}
print"$title";

OUTPUT: Google</b></h1>

It Should be : Google

Unable to extract value from link using Regex in Perl, it could have one more or less nesting:

<h1><b><i>Google</i></b></h1>

Please Try this:

1) <td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>

2) <a href="http://www.hp.com"><h1><b>HP</b></h1></a>

3) <a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);

4) <a href="#cite_note-1"><span>[</span>1<span>]</span></a>

OUTPUT:

Unix shell

HP

generic

[1]

5
  • 9
    Don't use Regex to parse HTML. It's a Bad Idea™. Commented Aug 28, 2013 at 12:57
  • Your expression says "take everything until the closing </a>", and that's what you get. You need to use <\/b> Commented Aug 28, 2013 at 12:59
  • Perl has many fine HTML parsers (such as this one). Don't use regex. Commented Aug 28, 2013 at 12:59
  • I know though for extracting value , almost working , failing while excluding closing tags, Any idea? Commented Aug 28, 2013 at 12:59
  • 1
    stackoverflow.com/a/1732454 Commented Aug 28, 2013 at 13:11

4 Answers 4

5

Don't use regexes, as mentioned in the comments. I am especially fond of the Mojo suite, which allows me to use CSS selectors:

use Mojo;

my $dom = Mojo::DOM->new(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->at('a[href="#google"]')->all_text, "\n";

Or with HTML::TreeBuilder::XPath:

use HTML::TreeBuilder::XPath;

my $dom = HTML::TreeBuilder::XPath->new_from_content(q(<a href="#google"><h1><b>Google</b></h1></a>));

print $dom->findvalue('//a[@href="#google"]'), "\n";
Sign up to request clarification or add additional context in comments.

Comments

2

Try this:

if($match =~ /<a.*?href.*?><b>(.*?)<\/b>/)

That should take "everything after the href and between the <b>...</b> tags

Instead, to get "everything after the last > and before the first </, you can use

<a.*?href.*?>([^>]*?)<\/

5 Comments

Floris , it could have one more or less nesting: <a href="#google"><h1><b><i>Google</i></b></h1></a>
@user1239790 - I have given a second expression to handle "any nesting".
Tried following: does not work: if($match =~ /<a.*?href.*?>([^>]*?)<\/a>/){ $title = $1; }else { $title=""; }
The expression you used is not the expression I gave. What result do you get when you use my second expression? I tested it on regexplanet.com/advanced/perl/index.html and it was fine.
Working... Just need to remove html tags if any under <a>***</a> and get the value.
0

For this simple case you could use: The requirements are no longer simple, look at @amon's answer for how to use an HTML parser.


/<a.*?>([^<]+)</

Match an opening a tag, followed by anything until you find something between > and <.

Though as others have mentioned, you should generally use a HTML parser.

echo '<td><a href="/wiki/Unix_shell" title="Unix shell">Unix shell</a>
<a href="http://www.hp.com"><h1><b>HP</b></h1></a>
<a href="/wiki/Generic_programming" title="Generic programming">generic</a></td>);' | perl -ne '/<a.*?>([^<]+)</; print "$1\n"'
Unix shell
HP
generic

10 Comments

@user1239790 are you starting to see why you shouldn't use regex?
Shouldn't your expression be /<a.*?>([^>]+)</ rather than /<a.*?>([^<]+)</ - i.e. "no more close brackets, and then an "open end-of tag".?
Yes, i saw, regex issues with tags, Can't we have regex solution for above?
This will grab your text between your nested tags. /(?<=^|>)([^><]+?)(?=<|$)/
@hwnd yes it does - but it (the expression given in the answer, not the one in your comment) would fail with multiple nested tags.
|
0

I came up with this regex that works for all your sampled inputs under PCRE. This regex is equivalent to a regular grammar with the tail-recursive pattern (?1)*

(?<=>)((?:\w+)(?:\s*))(?1)*

Just take the first element of the returned array, ie array[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.