8

I'm not a regex expert and I'm breaking my head trying to do one that seems very simple and works in python 2.7: validate the path of an URL (no hostname) without the query string. In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

I found this post that is very similar to what I need but for me isn't working at all, I can test with for example aaa and it will return true even if it doesn't start with /.

The current regex that I have kinda working is this one:

[^/+a-zA-Z0-9.-]

but it doesn't work with paths that don't start with /. For example:

  • /aaa -> true, this is ok
  • /aaa/bbb -> true, this is ok
  • /aaa?q=x -> false, this is ok
  • aaa -> true, this is NOT ok
0

4 Answers 4

6

The regex you've defined is a character class. Instead, try:

^\/[/.a-zA-Z0-9-]+$
Sign up to request clarification or add additional context in comments.

4 Comments

The difference here being, that you require the string to begin with / (^ means begin-of-line) and that you've added the + and end-of-line anchor $. Which makes the whole character-class capture optional
^\/[\/\.a-zA-Z0-9\-]+$, the /, ., and - should be escaped.
@Jon - You are right about escaping / in many implementations, but in Python, it need not be escaped. . need not be escaped either, as long as it's within a character class, at least in any implementation I know of. And - need not be escaped if it is specified first or last in a character class. Of course, not that it hurts to escape "too much," but just wanted to clarify these details.
you might want to use \A and \z to match the beginning and end of a potentially multi-line string
3

In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

You are missing some characters that are valid in URLs

import string
import urllib
import urlparse

valid_chars = string.letters + string.digits + '/.-~'
valid_paths = []

urls = ['http://www.my.uni.edu/info/matriculation/enroling.html',
    'http://info.my.org/AboutUs/Phonebook',
    'http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98',
    'http://www.my.org/462F4F2D4241522A314159265358979323846',
        'http://www.myu.edu/org/admin/people#andy',
        'http://www.w3.org/RDB/EMP?*%20where%20name%%3Ddobbins']

for i in urls:
   path = urllib.unquote(urlparse.urlparse(i).path)
   if path[0] == '/' and len([i for i in path if i in valid_chars]) == len(path):
        valid_paths.append(path)

Comments

0

Try this:

^(?:/[a-zA-Z0-9.-&&[^/]]*)+$

Seems to work. See the picture: enter image description here

Comments

0

Try posting some more code. I can't figure out how you're using your regex from your question. What's confusing me is, your re expression [^/+a-zA-Z0-9.-] basically says:

Match a single character if it is:

not a / or a-z (caps and lower both) or 0-9 or a dot or a dash

It doesn't quite make sense to me without knowing how you use it, as it only matches a single charactre and not a whole URL string.

I'm not sure I understand why you cannot start with a /.

1 Comment

Because it's an url path. Right after host, there must be a slash. Let's say you have var ajax_path='example.com'+unvalidated_path . If the path doesn't start with /, malicious user could insert ".", like .externaldomain.com and perform XSS attack.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.