12

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

along with the query string parameters.

Thanks! I wish I really knew Regular expressions better.

Cheers,

2
  • If the text documents are written by humans, you might find things like example.com, with punctuation immediately after the URL. Do you want an accepted answer to handle this, or is this not relevant? Commented Nov 26, 2009 at 22:54
  • You haven't accepted any answer to this question. Are none of the solutions suitable for you? What's the problem? Commented Nov 27, 2009 at 21:54

4 Answers 4

27

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

If you don't mind it picking up email addresses, you can replace the authority portion (\\w+:\\w+@)? with (\\w+(:\\w+)?@)? , if you want it to not pickup email addresses, then you'd need to add some other checks.
5

All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you really want to use RegEx with Java, try Automaton

1 Comment

Indeed, it is. Sometimes you only need basic parsing, and although the OP wanted a regex, this was the anser that saved me. Thank you.
3

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

Comments

1

This tests a certain line if it is a URL

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.