2

I'd like to replace the below regex with a unicode-friendly version that will catch things like http://➡.ws and other non-ascii IRIs. The purpose is to grab these out of users' text and encode and html-ize them into real links.

Python provides a re.UNICODE flag which changes the meaning of \w, but that's not super helpful in this case (that I can see) because it is defined as "alphanumeric characters and underscore" and not all of my below character classes include underscore.

domain_regex = re.compile(r"""
    (
        (https?://)
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        [a-zA-Z]{2,4}
    )
    | # begins with an http scheme followed by a domain, or
    (
        (?<!   # negative look-behind
            [0-9a-zA-Z.@-]
        )
        (
            [0-9a-zA-Z]
            [0-9a-zA-Z_-]*
            \.
        )+
        # top-level domain names
        com|ca|net|org|edu|gov|biz|info|mobi|name|
        us|uk|fr|au|be|ch|de|es|eu|it|tv|cn|jp
    )
""", re.VERBOSE)

More non-ascii domains:

1

2 Answers 2

5

If you want to write "\w except underscore" you can do so using a negated character class:

[^\W_]
Sign up to request clarification or add additional context in comments.

Comments

0

As buckley noted, "Python regex matching Unicode properties" presents some alternatives to use regex + unicode in Python. If what you want is just alphanumeric, alphanumeric + underscore or letters only, maybe it's easier to stick with Mark Byers suggestion ([^\W_], \w and [^\W\d_] respectively, with re.UNICODE active; Edit: got the order wrong...).

Otherwise, look up which character classes are valid as a IRI part and either use a regex engine that supports unicode character classes, or - if you need a pure python solution - I'd suggest the code I provided in an answer to that question (or a similar solution).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.