Parsing URL with regex

Question

I'm trying to combine if else inside my regular expression, basically if some patterns exists in the string, capture one pattern, if not, capture another.

The string is: 'https://www.searchpage.com/searchcompany.aspx?companyId=41490234&page=0&leftlink=true" and I want to extract staff around the '?"

So if '?' is detected inside the string, the regular expression should capture everything after the '?' mark; if not, then just capture from the beginning.

I used:'(.*\?.*)?(\?.*&.*)|(^&.*)' But it didn't work...

Any suggestion?

Thanks!

If you can guarantee that there won't be any other question marks later, you could use something like r".*?\??([^?]+)". — Tom Hunt
– Tom Hunt, Commented Feb 19, 2015 at 22:18
thanks for reply. But this still captures the 'search..' part. But I actually want to capture it happens when there's no question mark detected.. — JudyJiang
– JudyJiang, Commented Feb 19, 2015 at 22:20
Why not use urlparse? It allows you to get all the parts of the URL. — Open AI - Opting Out
– Open AI - Opting Out, Commented Feb 19, 2015 at 22:21

Open AI - Opting Out · Accepted Answer · 2015-02-19 22:32:23Z

5

Use urlparse:

>>> import urlparse
>>> parse_result = urlparse.urlparse('https://www.searchpage.com/searchcompany.aspx?
companyId=41490234&page=0&leftlink=true')

>>> parse_result
ParseResult(scheme='https', netloc='www.searchpage.com', 
path='/searchcompany.aspx', params='', 
query='companyId=41490234&page=0&leftlink=true', fragment='')

>>> urlparse.parse_qs(parse_result.query)
{'leftlink': ['true'], 'page': ['0'], 'companyId': ['41490234']}

The last line is a dictionary of key/value pairs.

answered Feb 19, 2015 at 22:32

Open AI - Opting Out

24.3k7 gold badges65 silver badges102 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Joran Beasley · Accepted Answer · 2015-02-19 22:31:27Z

4

regex might not be the best solution to this problem ...why not just

my_url.split("?",1)

if that is truly all you wish to do

or as others have suggested

from urlparse import urlparse
print urlparse(my_url)

edited Feb 19, 2015 at 22:31

answered Feb 19, 2015 at 22:23

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

1 Comment

JudyJiang Over a year ago

cause I want to parse and extract parts for not only url but also the query and the path. so there's url string as above, but also path string as '/company/Analytics/GetService' and also the query string as 'companyId=4343&type=0&page=11'

Zero Piraeus · Accepted Answer · 2015-02-19 22:26:57Z

2

This regex:

(^[^?]*$|(?<=\?).*)

captures:

^[^?]*$ everything, if there's no ?, or
(?<=\?).* everything after the ?, if there is one

However, you should look into urllib.parse (Python 3) or urlparse (Python 2) if you're working with URLs.

answered Feb 19, 2015 at 22:26

Zero Piraeus

59.7k28 gold badges158 silver badges164 bronze badges

1 Comment

Joran Beasley Over a year ago

yes some famous saying about regular expressions comes to mind here (+1)

Collectives™ on Stack Overflow

Parsing URL with regex

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related