2

Having for example such a string:

<a href="LINK_1" class="am"> Some Text</a>.. ANYTHING ..<a href="LINK_2" class="am"> Some Text</a><a href="SEARCHED_HREF_TO_EXTRACT" class="am"> SEARCHED_TEXT</a>..

I need to extract from a HTML link a 'href' attribute value, from a link which contains some searched word like 'SEARCHED_TEXT' in example. Could you please advice, how to do it correctly? Would not ask if not sent much time already =)

I went till this, but unhopefully it works incorrectly..

String str = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";
Pattern pattern = Pattern.compile("<a.*?href=\"(.*?)\".*SEARCHED_TEXT</a>");
Matcher matcher = pattern.matcher(str);

while (matcher.find()) {
    System.out.println(matcher.group(0)); // matched whole string
    System.out.println(matcher.group(1)); // should be SEARCHED_HREF_TO_EXTRAC

I see that I need some negotation after href="(.*?)" to accept all symbols except

</a>

to find correct HREF, but can't make it work :(

3
  • you should use a HTML Parser not regex Commented Feb 20, 2016 at 23:25
  • 2
    Mandatory link: stackoverflow.com/questions/1732348/… Commented Feb 20, 2016 at 23:27
  • 1
    Try "<a.*?href=\"([^\"]*)\"[^>]*>\\s+SEARCHED_TEXT</a>", Commented Feb 20, 2016 at 23:29

2 Answers 2

1

Don't use regex here as it is not proper tool to handle nested structures (at last regex flavor used in Java since it doesn't support recursion) like HTML/XML
(more info: Can you provide some examples of why it is hard to parse XML and HTML with a regex?).

Proper tool is HTML/XML parser. I would probably choose jsoup because of its simplicity and CSS query support.

So your code could look like:

String html = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a:contains(SEARCHED_TEXT)"); //contains is case-insensitive
System.out.println(links.attr("href"));

or if you expect to find many links iterate over found Elements and get href attribute from each of them:

for(Element link : links){
    System.out.println(link.attr("href"));
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that's much better :)
1

Well, if I'm reading correctly, you want to extract href of links who's text matches a search term.

if this is the case, it can be achieved with slight modification of regex

    String str = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";

    Pattern regex = Pattern.compile("<a\\s*href=[\"']([^'\"]+?)[\"'][^>]*?>\\s*SEARCHED_TEXT\\s*</a>", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(str);
    while (regexMatcher.find()) {
        System.out.println(regexMatcher.group(1));
    }

above code snippet, will extract only SEARCHED_HREF_TO_EXTRACT.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.