1

I am trying to get a regex expression to match a specific url format. Specifically the api urls for stackexchange. For example I want both of these to match:

http://api.stackoverflow.com/1.1/questions/1234/answers  
http://api.physics.stackexchange.com/1.0/questions/5678/answers

Where

  • everything not in bold must identical.
  • The first bold part, can only be made of a to z, and either one or no full stop.
    • Also it would be good, if there is one full stop the word "stackexchange" must follow. However this isn't crucial.
  • The second bold part can only be a 1 or a 0.
  • The last bold part can be only numbers 0 to 9, and can be any length
  • There can't be anything at all before or after the url, not even a trailing slash

3 Answers 3

5
Pattern.compile("^(?i:http://api\\.(?:[a-z]+(?:\\.stackexchange)?)\\.com)/1\\.[01]/questions/[0-9]+/answers\\z")

The ^ makes sure it starts at the start of input, and the \\z makes sure it ends at the end of input. All the dots are escaped so they are literal. The (?i:...) part makes the domain and scheme case-insensitive as per the URL spec. The [01] only matches the characters 0 or 1. The [0-9]+ matches 1 or more Arabic digits. The rest is self explanatory.

Sign up to request clarification or add additional context in comments.

Comments

1
^http://api[.][a-z]+([.]stackexchange)?[.]com/1[.][01]/questions/[0-9]+/answers$

^ matches start-of-string, $ matches end-of-line, [.] is an alternative way to escape the dot than a backslash (which itself would need to be escaped as \\.).

4 Comments

$ in Java regex does not guarantee a match at the end. From download.oracle.com/javase/6/docs/api/java/util/regex/… . For example, Pattern.compile("foo$") will match "foo\n".
Shouldn't make a difference in the OP's case, multiline URLs are a freaky thing to see.
you're right. Line separators are not allowed unescaped in URLs, but the OP does not make it clear whether the string has a priori been validated as a URL.
The reason it needs to be strict is because it also being used as a tag to identify another object. Some objects will have the same tag, and they need to be indentical else they wont group correctly, in other words of I need to get all objects with a specific URL and some objects have a break line on the end or a trailing slash for some reason they won't be included.
0

This tested Java program has a commented regex which should do the trick:

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String s = "http://api.stackoverflow.com/1.1/questions/1234/answers";

        Pattern p = Pattern.compile(
            "http://api\\.              # Scheme and api subdomain.\n" +
            "(?:                        # Group for domain alternatives.\n" +
            "  stackoverflow            # Either one\n" +
            "| physics\\.stackexchange  # or the other\n" +
            ")                          # End group for domain alternatives.\n" +
            "\\.com                     # TLD\n" +
            "/1\\.[01]                  # Either 1.0 or 1.1\n" +
            "/questions/\\d+/answers    # Rest of path.", 
            Pattern.COMMENTS);
        Matcher m = p.matcher(s);
        if (m.matches()) {
            System.out.print("Match found.\n");
        } else {
            System.out.print("No match found.\n");
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.