Perfect URL validation regex in Java

Question

I've found that page: https://mathiasbynens.be/demo/url-regex where different regular expressions for URL validation and their possibilities are nicely listed. Diego Perini's regex is the most powerful one and I would like to use it in Java. However it doesn't work if I use it that way:

public class URLValidation {
    // "\" replaced by "\\"
    private static Pattern REGEX = Pattern.compile("_^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!10(?:\\.\\d{1,3}){3})(?!127(?:\\.\\d{1,3}){3})(?!169\\.254(?:\\.\\d{1,3}){2})(?!192\\.168(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}0-9]+-?)*[a-z\\x{00a1}-\\x{ffff}0-9]+)*(?:\\.(?:[a-z\\x{00a1}-\\x{ffff}]{2,})))(?::\\d{2,5})?(?:/[^\\s]*)?$_iuS");

    private static String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };

    public static void main(String[] args) throws Exception {
        for (String url : URLs) {
            Matcher matcher = REGEX.matcher(url);
            if (matcher.find()) {
                System.out.println(matcher.group());
            }}}}

This code outputs nothing, however it should output the first three URLs in the array. How to compile the regex properly to get the code working?

upd: Thanks for the proposals. I tested your regexes in the real application. What I do there is iterate through log files and look for URL in each line. A log files have timestamps and usernames enclosed in [] and <> respectively and sometimes can contain special insivible characters responsible for formatting (color, boldness, etc) like \u0003. The regex seems to have problem with that type of strings: http://ideone.com/WEcgBY

upd2: And how about a regex finding all URLs in a line if it contains several? For example to use it like this:

String[] urlsFromLine = REGEX.split(line);
for (String url : urlsFromLine) {
    System.out.println(url);
}

Wiktor Stribiżew · Accepted Answer · 2015-07-15 21:53:17Z

4

Use this version:

"(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$"

You did not have to double the slashes, add regex delimiters, modifiers at the end of the pattern, and turn \u to \x notation.

See IDEONE demo:

String[] URLs = new String[] { "http://foo.com/blah_blah", "http://foo.com/blah_blah/", "http://foo.com/blah_blah_(wikipedia)", "http://foo.bar?q=Spaces should be encoded" };
Pattern REGEX = Pattern.compile("(?i)^(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?$");
for (String url : URLs) {
    Matcher matcher = REGEX.matcher(url);
    if (matcher.find()) {
       System.out.println(matcher.group());
    }
}

Output:

http://foo.com/blah_blah
http://foo.com/blah_blah/
http://foo.com/blah_blah_(wikipedia)

UPDATE

To match URLs in larger texts, you need to replace ^ and $ with \\b:

Pattern REGEX = Pattern.compile("(?i)\\b(?:(?:https?|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?!(?:10|127)(?:\\.\\d{1,3}){3})(?!(?:169\\.254|192\\.168)(?:\\.\\d{1,3}){2})(?!172\\.(?:1[6-9]|2\\d|3[0-1])(?:\\.\\d{1,3}){2})(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[1-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]-*)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,}))\\.?)(?::\\d{2,5})?(?:[/?#]\\S*)?\\b");

See another demo

edited Jul 15, 2015 at 21:53

answered Jul 15, 2015 at 20:57

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Wiktor Stribiżew Over a year ago

That means you need to adapt this regex to match URLs inside larger strings. You need to replace ^ and $ with \\b, a word boundary.

Danny Lo Over a year ago

This is IDEONE who replaces real URLs with placeholders. I'll give a word boundary a try.

Danny Lo Over a year ago

I have the next requirement for you :)

Wiktor Stribiżew Over a year ago

Do not use split, it just does not work in this case.

Danny Lo Over a year ago

Ahh. But I could split a string using "\\s" and then evaluate the resulting strings with the monster regex, right?

|

Collectives™ on Stack Overflow

Perfect URL validation regex in Java

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related