Parsing query string using regular expression in python

Question

I am trying to parse a url string using RE, here is my pattern qid=(.*?)&+? it does find the query string but if there is no & at the end of the url then it fails!

please take a look at the pythex.org page where i am trying to achieve the value of the query string for "qid".

alecxe · Accepted Answer · 2016-05-24 17:36:51Z

5

You can (and probably should) solve it with urlparse instead:

>>> from urlparse import urlparse, parse_qs
>>> s = "https://xx.com/question/index?qid=2ss2830AA38Wng"
>>> parse_qs(urlparse(s).query)['qid'][0]
'2ss2830AA38Wng'

As for the regular expression approach, you can check if there is & or the end of the string:

qid=(.*?)(?:&|$)

(?:...) here is a non-capturing group.

answered May 24, 2016 at 17:36

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

andrew · Accepted Answer · 2016-05-24 18:50:21Z

I agree with @alecxe that this is best handled with a urlparse. However, here are are some re options. The main trick is using the lookbehind, (?<=...) and lookahead, (?=...) assertions.

The general pattern is: return something with 'qid=' behind it, and zero or one '&' ahead of it: '(?<=qid=)some_pattern(?=&)?'

If you disable multiline, and then process the urls individually, this will work for any values of the qid variable: '(?<=qid=)([^&]*)(?=&)?'

However, if you have to use multiline mode, then you need to also avoid matching the newline characters. Let's assume it is '\n' (but of course, different encodings use different newline characters). Then you could use: '(?<=qid=)([^&\n]*)(?=&)?'

And lastly, if you are sure your qid variable will only store alpha-numerica values, you could avoid the uncertainty about the newline character, and just match alphanumeric values: '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

import re

# Single line version
s_1 = 'https://xx.com/question/index?qid=2ss2830AA38Wng'
s_2 = 'https://xx.com/question/index?qid=2ff38Wng&a=aubb&d=ajfbjhcbha'
q_1 = '(?<=qid=)([^&]*)(?=&)?'

print re.findall(q_1, s_1)
print re.findall(q_1, s_2)

# Multiline version V1
s_m = s_1 + '\n' + s_2
q_m = '(?<=qid=)([^&\n]*)(?=&)?'

print re.findall(q_m, s_m)

# Multiline version V2
q_m_2 = '(?<=qid=)([A-Za-z0-9]*)(?=&)?'

print re.findall(q_m_2, s_m)

Running this prints:

Single Line Verison
['2ss2830AA38Wng']
['2ff38Wng']

Multiline version V1
['2ss2830AA38Wng', '2ff38Wng']

Multiline version V2
['2ss2830AA38Wng', '2ff38Wng']

Collectives™ on Stack Overflow

Parsing query string using regular expression in python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related