2

Sorry if this has been asked before, but I couldn't find any answers on the web. I'm having a hard time figuring out the inverse to this regex:

"\"[^>]*\">"

I want to use replaceAll to replace everything except the link. So if I had a tag similar to this:

<p><a href="http://www.google.com">Google</a></p>

I need a regex that would satisfy this:

s.replaceAll(regex, "");

to give me this output:

http://www.google.com

I know there are better ways to do this, but I have to use a regex. Any help is really appreciated, thanks!

4 Answers 4

16

You do not have to use replaceAll. Better use pattern groups like the following:

Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(html);
String url = null;
if (m.find()) {
    url = m.group(1); // this variable should contain the link URL
}

If you have several links into your HTML perform m.find() in loop.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, it was hard for me to implement it because I was already using a pattern/matcher to find specific links that end in .htm and .html.
0

If you always have one such link in a string, try this:

"(^[^\"]*\")|(\"[^\"]*)$"

1 Comment

This worked, but failed when the href tag had 'id=' before the link. I should've added that to my question, sorry.
0

Use the method to get a map of all the properties of a HTML tag. Create a simple way to find all the properties of an HTML, like...

    Pattern linkPattern = Pattern.compile("<a (.*?)>");
    Matcher linkMatcher = linkPattern.matcher(in);
    while (linkMatcher.find()) { parseProperties(linkMatcher.group(1)).toString(); }

Get properties:

private static final Pattern PARSE_PATTERN = Pattern.compile("\\s*?(\\w*?)\\s*?=\\s*?\"(.*?)\"");

public static Map<String, String> parseProperties (String in) {

  Map<String, String> out = new HashMap<>();

  // Create matcher based on parsing pattern
  Matcher matcher = PARSE_PATTERN.matcher(in);

  // Populate map
  while (matcher.find()) { out.put(matcher.group(1), matcher.group(2)); }

  return out; 
}

Comments

-1

you can checkout http://regexlib.com/ for all the regex help you need. And the one below is for url :

^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$

4 Comments

The way it's currently written, that regex wouldn't work for site with country codes like winchester.us, amazon.co.uk, amazon.ca, etc.
you are absolutely right. I've made a mistake by imposing my practice.
Also, doesn't work with Java 6.0, at least not in the replaceAll method.
@user1070866, then that's the cherry on top for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.