2

Greetings all. I am using the following regex to detect urls in a string and wrap them inside the < a > tag

public static String detectUrls(String text) {

        String newText = text
                .replaceAll("(?:https?|ftps?|http?)://[\\w/%.-?&=]+",
                        "<a href='$0'>$0</a>").replaceAll(
                        "(www\\.)[\\w/%.-?&=]+", "<a href='http://$0'>$0</a>");
        return newText;
    }

i have a problem that the following links are not detected correctly: i am not that good with regex, so please advise.

http://code.google.com/p/shindig-dnd/

http://confluence.atlassian.com/display/GADGETDEV/Gadgets+and+JIRA+Portlets

www.liferay.com/web/raymond.auge/blog/

(www.opensocial.org/)

http://www.google.com

1

3 Answers 3

3

I'm using this:

private static final String URL_REGEX = 
   "http(s)?://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&amp;\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";

Matcher matcher = URL_PATTERN.matcher(text);
text = matcher.replaceAll("<a href=\"$0\">$0</a>");
return text;
Sign up to request clarification or add additional context in comments.

2 Comments

Declare & instead of &amp; would suffice because a, m and p are already in the range a-z and ; is delared twice.
this pattern works fine for most cases, but didn't catch this case: (www.opensocial.org)
2

The problem you have is that you are using - within a character group ([]) without escaping it, which is being used to define the range .-? (i.e. the characters ./0123456789:;<=>?). Either escape it \\- or put it at the end of the character class so that it doesn't complete a range.

public static String detectUrls(String text) {
    String newText = text
            .replaceAll("(?:https?|ftps?|http?)://[\\w/%.\\-?&=]+",
                    "<a href='$0'>$0</a>").replaceAll(
                    "(www\\.)[\\w/%.\\-?&=]+", "<a href='http://$0'>$0</a>");
    return newText;
}

10 Comments

@marcog: there's actually one pattern that's still not catched: something like http: //www.google.com
@sword Is that space after http: a typo?
@marcog,yes i meant to add it coz without the space the editor will convert it to google.com so i add this to skip the editor formatting, and you know what i want to say right ?
@marcog, what do you suggest ?
@sword Swap the replaceAll() calls around and use negative lookbehind. Here's it working: http://ideone.com/Dj6ew with one minor issue - it also adds http:// in front of the displayed URL. This is a limitation of regular expressions, and to fix it you'll have to parse the text in one pass without regular expressions.
|
1

As marcog said, you should escape the - and to match the last 2 examples you gave, you have to make the http optionnal. Also http? matches htt wich is not a correct protocol.

So the regex will be:

"(?:(?:https?|ftps?)://)?[\\w/%.?&=-]+"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.