-1

Honestly, trying to find a solution to this problem has been driving me insane, because every answer is either about using regex to truncate a string, or regex patterns having a max length (in which case, shouldn't it throw an error, not truncate the pattern string?)

Anyways. I'm using a regex pattern supplied by my employer. The intent is to match only the host name in any url string (so like python.org from https://docs.python.org/3/howto/regex.html). I've seen recommendations to use urllib.parse, but it doesn't strip out the hostname properly if there is a subdomain. Here is the regex string I was given to use:

\b(([a-zA-Z0-9\-_]+)\.)+
(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|
pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|
bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|
js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|
i2p|technology|xn--p1ai|com#|moscow|technology)

It's very long. If I place it into a regex checker such as https://pythex.org, it happily tells me that it works perfectly. However, if I use either a Python shell or the Python interpreter, compiling it and then returning the compiled pattern gives me this:

re.compile('\\b(([a-zA-Z0-9\\-_]+)\\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|
ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|
pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|clos)

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things? The goal is to do something like this:

https://docs.python.org/3/library/socket.html -> python.org
www.example.info                              -> example.info
docs.google.com                               -> google.com
6
  • 1
    Out[227]: re.compile(r'\x08(([a-zA-Z0-9\-_]+)\.)+(?!exe|php|dll|doc|docx|txt|rtf|odt|xls|xlsx|ppt|pptx|bin|pcap|ioc|pdf|mdb|asp|html|xml|jpg|gif|png|lnk|log|vbs|lco|bat|shell|quit|pdb|vbp|bdoda|bsspx|save|cpl|wav|tmp|close|py|ico|ini|sleep|run|dat|scr|jar|jxr|apt|w32|css|js|xpi|class|apk|rar|zip|hlp|tmp|cpp|crl|cfg|cer|plg|tmp)([a-zA-Z]{2,5}|support|report|i2p|technology|xn--p1ai|com#|moscow|technology)', re.UNICODE) . You sure you didn't make a typo somewhere? Commented Nov 8, 2017 at 19:17
  • 2
    In this case use urllib and build the code to strip the domain name the way you want. But please stop this cheat. Commented Nov 8, 2017 at 19:21
  • 2
    The pattern's string representation may be truncated, but the pattern still works as expected. Have you actually used it? Commented Nov 8, 2017 at 19:22
  • Please calm down, folks. I did not write the regex. But that's a good point about using urllib and then building from there. Commented Nov 8, 2017 at 19:31
  • As for typos, I definitely could have made a mistake. It's long, and keeping track of things is very difficult in it. That's one reason I was looking for another solution. Maybe it has to do with a hard wrap length in my IDE? Commented Nov 8, 2017 at 19:33

1 Answer 1

1

Can someone tell me why it's being truncated (for my own knowledge), and suggest a better way to do things?

Python has a regex pattern limit. See this and this. Questions where max limit is reached.

suggest a better way to do things?

Casimir's comment is right though, urllib.parse's urlparse would achieve your intended result in a much neater fashion.

This answer is probably a combination of urlparse and however you determined what is an extension and what isn't. This may help: Get root domain.

Sign up to request clarification or add additional context in comments.

2 Comments

Hey, thanks! I'll have to read those links. Regex has always been a confusing beast for me. As far as the parsing itself, I found a perfect solution in tldextract. It separates any url into subdomain, domain, and suffix reliably.
Glad you found the perfect solution! I'll have to check it out too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.