Python truncates a valid regex pattern string

Question

Honestly, trying to find a solution to this problem has been driving me insane, because every answer is either about using regex to truncate a string, or regex patterns having a max length (in which case, shouldn't it throw an error, not truncate the pattern string?)

Anyways. I'm using a regex pattern supplied by my employer. The intent is to match only the host name in any url string (so like python.org from https://docs.python.org/3/howto/regex.html). I've seen recommendations to use urllib.parse, but it doesn't strip out the hostname properly if there is a subdomain. Here is the regex string I was given to use:

\b(([a-zA-Z0-9\-_]+)\.)+
(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|
pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|
bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|
js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|
i2p|technology|xn--p1ai|com#|moscow|technology)

It's very long. If I place it into a regex checker such as https://pythex.org, it happily tells me that it works perfectly. However, if I use either a Python shell or the Python interpreter, compiling it and then returning the compiled pattern gives me this:

re.compile('\\b(([a-zA-Z0-9\\-_]+)\\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|
ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|
pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|clos)

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things? The goal is to do something like this:

https://docs.python.org/3/library/socket.html -> python.org
www.example.info                              -> example.info
docs.google.com                               -> google.com

Out[227]: re.compile(r'\x08(([a-zA-Z0-9\-_]+)\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|i2p|technology|xn--p1ai|com#|moscow|technology)', re.UNICODE) . You sure you didn't make a typo somewhere? — Uvar
– Uvar, Commented Nov 8, 2017 at 19:17
In this case use urllib and build the code to strip the domain name the way you want. But please stop this cheat. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Nov 8, 2017 at 19:21
The pattern's string representation may be truncated, but the pattern still works as expected. Have you actually used it? — Aran-Fey
– Aran-Fey, Commented Nov 8, 2017 at 19:22
Please calm down, folks. I did not write the regex. But that's a good point about using urllib and then building from there. — K. Whitt
– K. Whitt, Commented Nov 8, 2017 at 19:31
As for typos, I definitely could have made a mistake. It's long, and keeping track of things is very difficult in it. That's one reason I was looking for another solution. Maybe it has to do with a hard wrap length in my IDE? — K. Whitt
– K. Whitt, Commented Nov 8, 2017 at 19:33

HSchmachty · Accepted Answer · 2017-11-08 19:53:22Z

1

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things?

Python has a regex pattern limit. See this and this. Questions where max limit is reached.

suggest a better way to do things?

Casimir's comment is right though, urllib.parse's urlparse would achieve your intended result in a much neater fashion.

This answer is probably a combination of urlparse and however you determined what is an extension and what isn't. This may help: Get root domain.

answered Nov 8, 2017 at 19:53

HSchmachty

3061 silver badge14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

K. Whitt Over a year ago

Hey, thanks! I'll have to read those links. Regex has always been a confusing beast for me. As far as the parsing itself, I found a perfect solution in tldextract. It separates any url into subdomain, domain, and suffix reliably.

HSchmachty Over a year ago

Glad you found the perfect solution! I'll have to check it out too.

Collectives™ on Stack Overflow

Python truncates a valid regex pattern string

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related