1

Regex newbie here. I have a bunch of URLs from which I need to extract some substrings for which I am using regular expression.

Ex: If my URL is https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-US, I need to extract 1. vt-hokie-stone-theme part and 2. enmbbbhbkojhbkbolmfgbmlcgpkjjlja part from this url into two seperate variables.

The initial part of my URL always remains constant, so I built the following regular expression detail\/([a-z0-9\-]+)\/([a-z]+) and I am trying to mach on http://www.pythonregex.com/

I see that regex.findall(string) gives me what I want but I have following questions:

  1. I want them in two seperate variables, instead of having them as a list format in a single variable. How do I do it?

  2. Also, while checking on pythonregex, the regex.findall(string) command gives the output as [(u'vt-hokie-stone-theme', u'enmbbbhbkojhbkbolmfgbmlcgpkjjlja')]. I understand that the preceding u means unicode but I don't want it in my output. How do I remove it?

3 Answers 3

3
  1. You can use tuple/list assignment syntax to achieve this:

    try:
        var1, var2 = re.search(r"detail\/([a-z0-9\-]+)\/([a-z]+)", my_url).groups()
    except AttributeError:
        var1 = var2 = ""
    
  2. The unicode strings are seen only in the website's answers, and in raw python the return values will be normal strings. So, you don't have to worry about it.

Sign up to request clarification or add additional context in comments.

3 Comments

This will break if the regexp doesn't match.
Thanks for noticing. Edited the answer to handle the exception in case the regexp doesn't match.
Thanks, that's helpful! :)
0
  1. I personally don't see the issue in just setting the variables from the first index of the findall() array. But, if you're confident that your regex is going to always match the exact url string, you can try re.match:

    In [22]: regex = re.compile('a(bc)(cd)')

    In [23]: regex.match('abccd').groups()

    Out[23]: ('bc', 'cd')

  2. What's the issue with unicode? Why don't you want to keep it? I know the regex will return only ascii anyway, so that's not an issue. Either way, if it's really important to make them be regular strings, just cast it to a string.

    str(u'abc') == 'abc'

Comments

0

You can use below regex to achieve the same. If you are certain of the format of the URL, you can try something like below. Note that the last .* regex capturing th groups base is non-greedy and the .* regex capturing the group theme is non-greedy.

>>> var = 'https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U'

>>> match = re.match(r"(?P<base>.*/webstore/.*?/)(?P<theme>.*?)/(?P<tail>.*)",var);
>>> if match:
       ...    print match.group('base')
       ...    print match.group('theme')
       ...    print match.group('tail')

https://chrome.google.com/webstore/detail/
vt-hokie-stone-theme
enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.