Extracting URLs from a text document using Java + Regular Expressions

Question

I'm trying to create a regular expression to extract URLs from text documents using Java, but thus far I've been unsuccessful. The two cases I'm looking to capture are listed below:

URLs that start with http:// URLs that start with www. (Missing the protocol from the front)

along with the query string parameters.

Thanks! I wish I really knew Regular expressions better.

Cheers,

If the text documents are written by humans, you might find things like example.com, with punctuation immediately after the URL. Do you want an accepted answer to handle this, or is this not relevant? — Mark Byers
– Mark Byers, Commented Nov 26, 2009 at 22:54
You haven't accepted any answer to this question. Are none of the solutions suitable for you? What's the problem? — Mark Byers
– Mark Byers, Commented Nov 27, 2009 at 21:54

Philip Daubmeier · Accepted Answer · 2009-11-26 23:48:49Z

27

If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

import java.util.*;
import java.util.regex.*;

class FindUrls
{
    public static List<String> extractUrls(String input) {
        List<String> result = new ArrayList<String>();

        Pattern pattern = Pattern.compile(
            "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
            "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
            "|mil|biz|info|mobi|name|aero|jobs|museum" + 
            "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
            "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
            "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
            "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
            "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
            "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");

        Matcher matcher = pattern.matcher(input);
        while (matcher.find()) {
            result.add(matcher.group());
        }

        return result;
    }
}

answered Nov 26, 2009 at 23:48

Philip Daubmeier

15.1k6 gold badges45 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GreenKiwi Over a year ago

If you don't mind it picking up email addresses, you can replace the authority portion (\\w+:\\w+@)? with (\\w+(:\\w+)?@)? , if you want it to not pickup email addresses, then you'd need to add some other checks.

Fuad Efendi · Accepted Answer · 2013-01-17 17:47:22Z

5

All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

If you really want to use RegEx with Java, try Automaton

answered Jan 17, 2013 at 17:47

Fuad Efendi

791 silver badge1 bronze badge

1 Comment

Henrique de Sousa Over a year ago

Indeed, it is. Sometimes you only need basic parsing, and although the OP wanted a regex, this was the anser that saved me. Thank you.

DVK · Accepted Answer · 2009-11-26 23:00:59Z

3

This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

edited Nov 26, 2009 at 23:00

answered Nov 26, 2009 at 22:55

DVK

130k33 gold badges219 silver badges337 bronze badges

Comments

jutky · Accepted Answer · 2009-11-26 23:00:31Z

1

This tests a certain line if it is a URL

Pattern p = Pattern.compile("http://.*|www\\..*");
Matcher m = p.matcher("http://..."); // put here the line you want to check
if(m.matches()){
    so something
}

answered Nov 26, 2009 at 23:00

jutky

3,9836 gold badges35 silver badges45 bronze badges

Collectives™ on Stack Overflow

Extracting URLs from a text document using Java + Regular Expressions

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related