6

I'm trying to get a text within a certain tag. So if I have:

<a href="http://something.com">Found<a/>

I want to be able to retrieve the Found text.

I'm trying to do it using regex. I am able to do it if the <a href="http://something.com> stays the same but it doesn't.

So far I have this:

Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );

I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.

3
  • 9
    Don't parse HTML with regex. Use a proper XML/HTML parser... Commented Jan 7, 2011 at 18:05
  • thanks for the reply, ill look into it =D but im not doing it for a lot of html tags its only for this one tag which occurs 15 times...is that still bad? Commented Jan 7, 2011 at 18:24
  • Java’s regexes are not powerful enough to parse HTML; other languages’, however, are. Why anyone in their right mind would use Java for regex work is utterly beyond me. Commented Feb 16, 2011 at 17:10

2 Answers 2

6

As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try

Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group(1)
} 

will iterate over all matches in a string.

It won't handle nested <a> tags and ignores all the attributes inside the tag.

Sign up to request clarification or add additional context in comments.

Comments

0
str.replaceAll("</?a>", "");

Here is online ideone demo

Here is similar topic : How to remove the tags only from a text ?

1 Comment

This ignores the href and any other attributes.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.