Java Regex to get the text from HTML anchor (<a>...</a>) tags

Question

I'm trying to get a text within a certain tag. So if I have:

<a href="http://something.com">Found<a/>

I want to be able to retrieve the Found text.

I'm trying to do it using regex. I am able to do it if the <a href="http://something.com> stays the same but it doesn't.

So far I have this:

Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );

I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.

Don't parse HTML with regex. Use a proper XML/HTML parser... — ircmaxell
– ircmaxell, Commented Jan 7, 2011 at 18:05
thanks for the reply, ill look into it =D but im not doing it for a lot of html tags its only for this one tag which occurs 15 times...is that still bad? — BeginnerPro
– BeginnerPro, Commented Jan 7, 2011 at 18:24
Java’s regexes are not powerful enough to parse HTML; other languages’, however, are. Why anyone in their right mind would use Java for regex work is utterly beyond me. — tchrist
– tchrist, Commented Feb 16, 2011 at 17:10

Alexander Zhugastrov · Accepted Answer · 2011-02-16 15:07:19Z

6

As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try

Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group(1)
}

will iterate over all matches in a string.

It won't handle nested <a> tags and ignores all the attributes inside the tag.

edited Feb 16, 2011 at 15:07

Alexander Zhugastrov

13.1k2 gold badges22 silver badges22 bronze badges

answered Jan 7, 2011 at 18:17

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:04:02Z

0

str.replaceAll("</?a>", "");

Here is online ideone demo

Here is similar topic : How to remove the tags only from a text ?

edited May 23, 2017 at 12:04

CommunityBot

11 silver badge

answered Jan 7, 2011 at 18:16

user467871

1 Comment

Bill the Lizard Over a year ago

This ignores the href and any other attributes.

Collectives™ on Stack Overflow

Java Regex to get the text from HTML anchor (<a>...</a>) tags

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related