Extracting substring from URL using regex

Question

Regex newbie here. I have a bunch of URLs from which I need to extract some substrings for which I am using regular expression.

Ex: If my URL is https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-US, I need to extract 1. vt-hokie-stone-theme part and 2. enmbbbhbkojhbkbolmfgbmlcgpkjjlja part from this url into two seperate variables.

The initial part of my URL always remains constant, so I built the following regular expression detail\/([a-z0-9\-]+)\/([a-z]+) and I am trying to mach on http://www.pythonregex.com/

I see that regex.findall(string) gives me what I want but I have following questions:

I want them in two seperate variables, instead of having them as a list format in a single variable. How do I do it?
Also, while checking on pythonregex, the regex.findall(string) command gives the output as [(u'vt-hokie-stone-theme', u'enmbbbhbkojhbkbolmfgbmlcgpkjjlja')]. I understand that the preceding u means unicode but I don't want it in my output. How do I remove it?

Sunny Nanda · Accepted Answer · 2014-01-26 07:05:55Z

3

You can use tuple/list assignment syntax to achieve this:

try:
    var1, var2 = re.search(r"detail\/([a-z0-9\-]+)\/([a-z]+)", my_url).groups()
except AttributeError:
    var1 = var2 = ""

The unicode strings are seen only in the website's answers, and in raw python the return values will be normal strings. So, you don't have to worry about it.

answered Jan 26, 2014 at 7:05

Sunny Nanda

2,3821 gold badge16 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

michaelmeyer Over a year ago

This will break if the regexp doesn't match.

Sunny Nanda Over a year ago

Thanks for noticing. Edited the answer to handle the exception in case the regexp doesn't match.

TheRookierLearner Over a year ago

Thanks, that's helpful! :)

limasxgoesto0 · Accepted Answer · 2014-01-26 07:04:30Z

0

I personally don't see the issue in just setting the variables from the first index of the findall() array. But, if you're confident that your regex is going to always match the exact url string, you can try re.match:

In [22]: regex = re.compile('a(bc)(cd)')

In [23]: regex.match('abccd').groups()

Out[23]: ('bc', 'cd')
What's the issue with unicode? Why don't you want to keep it? I know the regex will return only ascii anyway, so that's not an issue. Either way, if it's really important to make them be regular strings, just cast it to a string.

str(u'abc') == 'abc'

answered Jan 26, 2014 at 7:04

limasxgoesto0

4,8538 gold badges35 silver badges39 bronze badges

Comments

sateesh · Accepted Answer · 2014-01-26 07:53:23Z

0

You can use below regex to achieve the same. If you are certain of the format of the URL, you can try something like below. Note that the last .* regex capturing th groups base is non-greedy and the .* regex capturing the group theme is non-greedy.

>>> var = 'https://chrome.google.com/webstore/detail/vt-hokie-stone-theme/enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U'

>>> match = re.match(r"(?P<base>.*/webstore/.*?/)(?P<theme>.*?)/(?P<tail>.*)",var);
>>> if match:
       ...    print match.group('base')
       ...    print match.group('theme')
       ...    print match.group('tail')

https://chrome.google.com/webstore/detail/
vt-hokie-stone-theme
enmbbbhbkojhbkbolmfgbmlcgpkjjlja?hl=en-U

answered Jan 26, 2014 at 7:53

sateesh

28.9k7 gold badges38 silver badges45 bronze badges

Collectives™ on Stack Overflow

Extracting substring from URL using regex

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related