3

I have a simple regular expression that matches some URL and it works fine however I'd like to refine it a bit so it excludes a URL containing a certain word.

My Patter: (http:[A-z0-9./~%]+)

IE:

http://maps.google.com/maps
http://www.google.com/flights/gwsredirect
http://slav0nic.org.ua/static/books/python/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/doc/
http://webcache.googleusercontent.com/search
http://www.python.org/ftp/python/

Give the list of URL above matched by my pattern, I'd like to refine my pattern to exclude URL containing the word for example google

I tried using non capturing groups but was unsuccessful, maybe I'm missing something.

ADDITIONAL INFORMATION

Maybe my description wasn't clear.

Okay I have a file of data grabbed from a URL then I use the pattern I've provided with extract the list of links given but as you can see the pattern is returning all links it's doing more than I want it to do. So I want to refine it to not give me links containing a certain word ie: google

Thus after I parse the data instead of returning the list of links above it would instead return the following:

http://slav0nic.org.ua/static/books/python/
http://www.python.org/ftp/python/doc/
http://www.python.org/ftp/python/

enter image description here

All help are appreciated, thank you!

3
  • 1
    Why do you have URL starting with http://http://? Commented Jan 6, 2012 at 9:03
  • you can use string contains method in java after verify with regex Commented Jan 6, 2012 at 9:06
  • See this question - stackoverflow.com/questions/406230/… Commented Jan 6, 2012 at 9:10

3 Answers 3

2

Try this:

(http:(?![^"\s]*google)[^"\s]+)["\s]

The key difference to the solutions posted earlier is that I control the length of the match for searching.

Sign up to request clarification or add additional context in comments.

Comments

1

Try this:

(http:(?!.*google).*)

Source: similar questions

EDIT: (this works, tested it)

public static void main( String[] args ) {

    final Pattern p = Pattern.compile( "(http:(?!.*google).*)" );
    final String[] in = new String[]{
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/",
    };

    for ( final String s : in ) {    
      final Matcher m = p.matcher( s );
      System.out.print( s );
      if ( m.find() ) {
        System.out.println( " true" );
      } else {
        System.out.println( " false" );
      }
    }
}

OUTPUT:

http://maps.google.com/maps false
http://www.google.com/flights/gwsredirect false
http://slav0nic.org.ua/static/books/python/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/doc/ true
http://webcache.googleusercontent.com/search false
http://www.python.org/ftp/python/ true

2 Comments

Tried this before I asked the question.
Pay attention to the difference of matching and searching!
0

Modify your regex to capture the hostname and use .contains():

public final class TestMatch
{
    private static final List<String> urls = Arrays.asList(
        "http://maps.google.com/maps",
        "http://www.google.com/flights/gwsredirect",
        "http://slav0nic.org.ua/static/books/python/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/doc/",
        "http://webcache.googleusercontent.com/search",
        "http://www.python.org/ftp/python/"
    );

    private static final Pattern p
        = Pattern.compile("^http://([^/]+)/");

    private static final int TRIES = 50000;

    public static void main(final String... args)
    {
        for (final String url: urls)
            System.out.printf("%s: %b\n", url, regexIsOK(url));

        long start, end;

        start = System.currentTimeMillis();
        for (int i = 0; i < TRIES; i++)
            for (final String url: urls)
                regexIsOK(url);
        end = System.currentTimeMillis();

        System.out.println("Time taken: " + (end - start) + " ms");
        System.exit(0);
    }

    private static boolean regexIsOK(final String url)
    {
        final Matcher m = p.matcher(url);

        return m.find() && !m.group(1).contains("google");
    }
}

Sample output:

http://maps.google.com/maps: false
http://www.google.com/flights/gwsredirect: false
http://slav0nic.org.ua/static/books/python/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/doc/: true
http://webcache.googleusercontent.com/search: false
http://www.python.org/ftp/python/: true
Time taken: 258 ms

4 Comments

I'm sorry this isn't what I'm looking for, if I do it this way I will be doing more work than I need to do. The list of URLS aren't known I use Regex to get them but my pattern returns more than I want so I want to refine that exact pattern to return not containing a certain word.
As you wish, but then why not just use .contains() after you have matched your regex (which does allow invalid URLs BTW -- URI will allow you to detect that)?
But that will cause a huge overhead going through each URL one by one. What you think will happen if lets say I have about 10,000 or more URL. It must be a one shot thing RegEx must return exactly what I want from the match.
OK, look at the solution -- cheap, isn't it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.