2

I am trying to parse a url string using RE, here is my pattern qid=(.*?)&+? it does find the query string but if there is no & at the end of the url then it fails!

please take a look at the pythex.org page where i am trying to achieve the value of the query string for "qid".

2 Answers 2

5

You can (and probably should) solve it with urlparse instead:

>>> from urlparse import urlparse, parse_qs
>>> s = "https://xx.com/question/index?qid=2ss2830AA38Wng"
>>> parse_qs(urlparse(s).query)['qid'][0]
'2ss2830AA38Wng'

As for the regular expression approach, you can check if there is & or the end of the string:

qid=(.*?)(?:&|$)

(?:...) here is a non-capturing group.

Sign up to request clarification or add additional context in comments.

Comments

3

I agree with @alecxe that this is best handled with a urlparse. However, here are are some re options. The main trick is using the lookbehind, (?<=...) and lookahead, (?=...) assertions.

The general pattern is: return something with 'qid=' behind it, and zero or one '&' ahead of it: '(?<=qid=)some_pattern(?=&)?'

If you disable multiline, and then process the urls individually, this will work for any values of the qid variable: '(?<=qid=)([^&]*)(?=&)?'

However, if you have to use multiline mode, then you need to also avoid matching the newline characters. Let's assume it is '\n' (but of course, different encodings use different newline characters). Then you could use: '(?<=qid=)([^&\n]*)(?=&)?'

And lastly, if you are sure your qid variable will only store alpha-numerica values, you could avoid the uncertainty about the newline character, and just match alphanumeric values: '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

import re

# Single line version
s_1 = 'https://xx.com/question/index?qid=2ss2830AA38Wng'
s_2 = 'https://xx.com/question/index?qid=2ff38Wng&a=aubb&d=ajfbjhcbha'
q_1 = '(?<=qid=)([^&]*)(?=&)?'

print re.findall(q_1, s_1)
print re.findall(q_1, s_2)

# Multiline version V1
s_m = s_1 + '\n' + s_2
q_m = '(?<=qid=)([^&\n]*)(?=&)?'

print re.findall(q_m, s_m)

# Multiline version V2
q_m_2 = '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

print re.findall(q_m_2, s_m)

Running this prints:

Single Line Verison
['2ss2830AA38Wng']
['2ff38Wng']

Multiline version V1
['2ss2830AA38Wng', '2ff38Wng']

Multiline version V2
['2ss2830AA38Wng', '2ff38Wng']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.