Use Regex to parse out some part of URL using python

Question

Suppose I am having some like as the following,

URL
http://hostname.com/as/ck$st=fa+gw+hw+ek+ei/
http://hostname.com/wqs/ck$st=fasd+/
http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav

I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.

The regex which I wrote with the help of stackoverflow forums is as follows,

re.search(r"[^\w\+ ]([\w\+ ]+\+[\w\+ ]+)(?:[^\w\+ ]|$)", x).group(1)

This one works with the first row. But does not parse anything with second row. Also in the third row, I want to check for multiple patterns like this in the row. The current regex checks only for one pattern.

My output should be,

parsed
fa+gw+hw+ek+ei
fasd
fa+gq+hf+kg+is gl+jh+ke+oj+kp

Can anybody help me to modify the regex which is already there to suit this needs?

Thanks

Christian Ternus · Accepted Answer · 2016-08-25 00:49:42Z

2

I used regexr to come up with this (regexr link):

([\w\+]*\+[\w\+]*)(?:[^\w\+]|$)

Matches:

fa+gw+hw+ek+ei fasd+ fa+gq+hf+kg+is gl+jh+ke+oj+kp

EDIT: Instead of using re.search, try using re.findall instead:

>>> s = "http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav"
>>> re.findall("([\w\+]+\+[\w\+]*)(?:[^\w\+]|$)", s)
['fa+gq+hf+kg+is', 'gl+jh+ke+oj+kp']

edited Aug 25, 2016 at 0:49

answered Aug 24, 2016 at 23:28

Christian Ternus

8,48227 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Observer Over a year ago

this one is not working for the third row. it gives only fa+gq+hf+kg+is as output.. I want, fa+gq+hf+kg+is gl+jh+ke+oj+kp as output. can you help me in doing that?

Observer Over a year ago

@chisrtian this doesnt work with hostname.com/wqs/ck$st=+fasd . How can we make this work with this one as well

Christian Ternus Over a year ago

I tweaked it slightly to change the first [\w\+]+ to [\w\+]*. Try that.

John · Accepted Answer · 2016-08-24 23:23:24Z

0

If you change [^\w\+ ]([\w\+ ]+\+[\w\+ ]+)(?:[^\w\+ ]|$) to [^\w\+ ]([\w\+ ]+\+[\w\+ ]*)(?:[^\w\+ ]|$) it will match the second URL as well.

It will include the trailing '+', which isn't included in your desired output but does seem to meet the criteria you had mentioned, so this may take some modifying if you don't want any trailing '+'s.

answered Aug 24, 2016 at 23:23

John

2,42516 silver badges21 bronze badges

1 Comment

Observer Over a year ago

@this one is not working for the third row. it gives only fa+gq+hf+kg+is as output.. I want, fa+gq+hf+kg+is gl+jh+ke+oj+kp as output. can you help me in doing that?

BPL · Accepted Answer · 2016-08-24 23:37:29Z

0

After trying to use unsuccesfully urlparse it seems the best way to get the info you want is using regular expressions:

import urlparse
import re

urls = [
    "http://hostname.com/as/ck$st=fa+gw+hw+ek+ei/",
    "http://hostname.com/wqs/ck$st=fasd+/",
    "http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav"
]

for myurl in urls:
    parsed = urlparse.urlparse(myurl)

    print 'scheme  :', parsed.scheme
    print 'netloc  :', parsed.netloc
    print 'path    :', parsed.path
    print 'params  :', parsed.params
    print 'query   :', parsed.query
    print 'fragment:', parsed.fragment
    print 'username:', parsed.username
    print 'password:', parsed.password
    print 'hostname:', parsed.hostname, '(netloc in lower case)'
    print 'port    :', parsed.port

    print urlparse.parse_qs(parsed.query)

    print re.findall(r'([\w\+]+\+[\w\+]*)(?:[^\w\+]|$)', parsed.path)
    print '-' * 80

answered Aug 24, 2016 at 23:37

BPL

9,98512 gold badges69 silver badges135 bronze badges

1 Comment

Observer Over a year ago

this doesnt work with hostname.com/wqs/ck$st=+fasd . How can we make this work with this one as well ? Can you please in this?

Collectives™ on Stack Overflow

Use Regex to parse out some part of URL using python

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related