1

I know this question might be answered by many people, but I am trying to build the regex with following criteria -

Validate the URL entered to include optional http:// or https:// followed by optional www. followed by valid domain (containing only a-z, A-Z or -) OR an ip-address followed by optional port-number followed by optional path and no query parameters

I need to test the URL to not include special characters for XSS ingestion and no query string params.

I am using the following regex pattern in Java -

"^(http:\\/\\/|https:\\/\\/)?(www.)?([a-zA-Z0-9]+).[a-zA-Z0-9]*.[a-z]{3}.?([a-z]+)?$"
0

5 Answers 5

3

Use the following regex:

^((?:http:\/\/)|(?:https:\/\/))(www.)?((?:[a-zA-Z0-9]+\.[a-z]{3})|(?:\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?::\d+)?))([\/a-zA-Z0-9\.]*)$

DEMO

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

2

Let's break it down:

  • Must start with optional http:// or https://. So we need to write a choice: it is either http or https, followed by ://. "Either" is written with | inside a group. So "either http or https" becomes (http|https). Then this must be followed by ://. None of this character are special so we don't need to escape them. We then get (http|https)://. Finally, all of this is optional: that means it can only occur 0 or 1 times. This is written using ?. We get: ((http|https)://)?.
  • Followed by a valid domain (containing only a-z, A-Z or -) or an ip-address followed by optional port-number
    • Case valid domain: a domain is valid if it contains at least one of a-z, A-Z or -. This is written using ([a-zA-Z-])+. + means "at least one" and [a-zA-Z-] represents the matching character classes.
    • Case IP address and port: an IP address is of the form XXX.XXX.XXX.XXX, where each X can appear one to three times. This is written as \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} (we could write it better but I keep it plain simple). {1,3} means "one to three times" and \d means every single digit (it is the same as the character class [0-9]). \. is used to escape the special character .. Then, the port are some digits prepended by :: this is written as :\d+; and since it is optional, we wrap ? around it to arrive to (:\d+)?. So finally, we have: \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d+)?.
    • Final combination: it is either a valid domain or a valid IP address. So we get (([a-zA-Z-])+|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d+)?)), i.e. (validDomainExpression|IPAdressExpression).
  • Followed by an optional path and no query parameters. This means that we can accept any character except ?. This is written as [^?]. Inside a character class, ? is no longer a special character and simply means the ? character; ^ means "not", i.e. "everything but". So an optional sequence of "everything but ?" is written as ([^?]*)?.

Final regex:

^((http|https)://)?(www.)?(([a-zA-Z-])+|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d+)?))([^?]*)?$

1 Comment

0

You could use on of this online java regex testers to verify your regex's.

I've tried your regex with the first one and worked. Also with first link you can test more than one input at the time.

Tip. There's no need to scape symbols like in java code to test the regex on this websites.

Comments

0

Thanks for your valuable replies.

I am now finally be able to resolve my regex properly.

Regex used:

    ^((http:\\/\\/)|(https:\\/\\/))?(www.)?(([a-zA-Z0-9-]+)([\\.a-zA-Z0-9-]*)|(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}))?(:\\d+)?([\\/a-zA-Z0-9\\.-]*)$

Test Cases:

    1. "http://net.tutsplus.com/about", result: true
    2. "http://net.tutsplus.com/", result: true
    3. "www.google.com", result: true
    4. "https://regex101.com/", result: true
    5. "http://makeit.london", result: true
    6. "http://makeit.london/", result: true
    7. "http://make-it.london/", result: true
    8. "http://localhost:8080/demoapp/test/", result: true
    9. "test.sub-domain3.sub-domain2.sub-domain1.domain-tld-0:8080/demoapp/test/mypage.html", result: true
    10. "https://test.london:80/test-domain/test-path/test.html", result: true
    11. "https://127.0.0.1:8080/demoapp/test-path/mypage.html", result: true
    12. "https://127.0.0.1:8080/demoapp/test/mypage.html", result: true
    13. "http://www.example.com/mypage.aspx", result: true
    14. "http://www.192.168.2.3:231/mybranch/mypage.aspx", result: true
    15. "http://example.com/somedir/somefile/", result: true
    16. "https://192.213.23.12:231/branch/mypage/", result: true
    17. "https://192.213.23.12:231/branch/mypage", result: true
    18. "https://192.213.23.12:/branch/mypage/", result: false
    19. "javascript('XSS');", result: false
    20. "javascript(\"XSS\");", result: false
    21. "javascript(1);", result: false
    22. "http://a.b/\"onerror=\"javascript:alert(1);", result: false
    23. ",", result: false
    24. "https://Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla ac tincidunt metus. Praesent ante ligula, maximus eu quam eu, blandit auctor mi. Morbi viverra a lectus luctus tincidunt. Morbi tristique condimentum eleifend. Maecenas mattis auctor ligula, quis placerat leo luctus et. Etiam euismod massa sit amet nisl porta aliquet. Nunc commodo aliquam orci vitae feugiat. Mauris efficitur sem eget ante vestibulum vestibulum et sed ex. Quisque enim ipsum, dapibus ut interdum in, consequat at tortor. Aenean cursus tellus arcu, id placerat nisl finibus quis. Sed ullamcorper imperdiet sapien et cursus. In posuere nisl mauris.", result: false
    25. "<script></script>", result: false
    26. "https://192.213.23.12:231/branch/mypage/?foo=bar", result: false
    27. "https://192.213.23.12:231/branch/mypage?foo=bar", result: false
    28. "http://example.com/somedir/somefile/?foo1=bar1&foo2=bar2", result: false
    29. "https://127.0.0.1:8080/demoapp/test/mypage.html?foo=bar", result: false
    30. "http://make-it.london:8080/demoapp/test/mypage.html?foo=bar", result: false
    31. "http://google.com/some/file!.html", result: false
    32. "test.sub-domain3.sub-domain2.sub-domain1.domain-tld-0:8080/demoapp/test/mypage.html?foo1=bar1&foo2=bar2", result: false

Comments

0

Thanks for @Tunaki's patient answer.

however, the regex does not work.

The correct version is

/^((http|https):\/\/)?(www.)?(([a-zA-Z-])+|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(:\d+)?))([^?]*)?$/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.