Regex, extract a href attribute from HTML with special name

Question

Having for example such a string:

<a href="LINK_1" class="am"> Some Text</a>.. ANYTHING ..<a href="LINK_2" class="am"> Some Text</a><a href="SEARCHED_HREF_TO_EXTRACT" class="am"> SEARCHED_TEXT</a>..

I need to extract from a HTML link a 'href' attribute value, from a link which contains some searched word like 'SEARCHED_TEXT' in example. Could you please advice, how to do it correctly? Would not ask if not sent much time already =)

I went till this, but unhopefully it works incorrectly..

String str = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";
Pattern pattern = Pattern.compile("<a.*?href=\"(.*?)\".*SEARCHED_TEXT</a>");
Matcher matcher = pattern.matcher(str);

while (matcher.find()) {
    System.out.println(matcher.group(0)); // matched whole string
    System.out.println(matcher.group(1)); // should be SEARCHED_HREF_TO_EXTRAC

I see that I need some negotation after href="(.*?)" to accept all symbols except

</a>

to find correct HREF, but can't make it work :(

you should use a HTML Parser not regex

Jens
– Jens

2016-02-20 23:25:51 +00:00
Commented Feb 20, 2016 at 23:25 — Jens
– Jens, Commented Feb 20, 2016 at 23:25
Mandatory link: stackoverflow.com/questions/1732348/…

Pshemo
– Pshemo

2016-02-20 23:27:34 +00:00
Commented Feb 20, 2016 at 23:27 — Pshemo
– Pshemo, Commented Feb 20, 2016 at 23:27
Try "<a.*?href=\"([^\"]*)\"[^>]*>\\s+SEARCHED_TEXT</a>",

user4910279
– user4910279

2016-02-20 23:29:43 +00:00
Commented Feb 20, 2016 at 23:29 — user4910279
– user4910279, Commented Feb 20, 2016 at 23:29

Pshemo · Accepted Answer · 2016-02-20 23:41:56Z

1

Don't use regex here as it is not proper tool to handle nested structures (at last regex flavor used in Java since it doesn't support recursion) like HTML/XML
(more info: Can you provide some examples of why it is hard to parse XML and HTML with a regex?).

Proper tool is HTML/XML parser. I would probably choose jsoup because of its simplicity and CSS query support.

So your code could look like:

String html = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a:contains(SEARCHED_TEXT)"); //contains is case-insensitive
System.out.println(links.attr("href"));

or if you expect to find many links iterate over found Elements and get href attribute from each of them:

for(Element link : links){
    System.out.println(link.attr("href"));
}

edited Feb 20, 2016 at 23:41

answered Feb 20, 2016 at 23:36

Pshemo

125k25 gold badges194 silver badges280 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

whatswrong Over a year ago

Thanks, that's much better :)

Saleem · Accepted Answer · 2016-02-21 01:47:47Z

Well, if I'm reading correctly, you want to extract href of links who's text matches a search term.

if this is the case, it can be achieved with slight modification of regex

    String str = "<a href=\"LINK_1\" class=\"am\"> Some Text</a>.. ANYTHING ..<a href=\"LINK_2\" class=\"am\"> Some Text</a><a href=\"SEARCHED_HREF_TO_EXTRACT\" class=\"am\"> SEARCHED_TEXT</a>";

    Pattern regex = Pattern.compile("<a\\s*href=[\"']([^'\"]+?)[\"'][^>]*?>\\s*SEARCHED_TEXT\\s*</a>", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    Matcher regexMatcher = regex.matcher(str);
    while (regexMatcher.find()) {
        System.out.println(regexMatcher.group(1));
    }

above code snippet, will extract only SEARCHED_HREF_TO_EXTRACT.

Collectives™ on Stack Overflow

Regex, extract a href attribute from HTML with special name

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related