0

Suppose I am having some like as the following,

URL
http://hostname.com/as/ck$st=fa+gw+hw+ek+ei/
http://hostname.com/wqs/ck$st=fasd+/
http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav

I want to check for first + symbol in the url and move backward until we find a special character such as / or ? or = or any other special character and start from that and go on until we find a space or end of line or & or /.

The regex which I wrote with the help of stackoverflow forums is as follows,

re.search(r"[^\w\+ ]([\w\+ ]+\+[\w\+ ]+)(?:[^\w\+ ]|$)", x).group(1)

This one works with the first row. But does not parse anything with second row. Also in the third row, I want to check for multiple patterns like this in the row. The current regex checks only for one pattern.

My output should be,

parsed
fa+gw+hw+ek+ei
fasd
fa+gq+hf+kg+is gl+jh+ke+oj+kp

Can anybody help me to modify the regex which is already there to suit this needs?

Thanks

0

3 Answers 3

2

I used regexr to come up with this (regexr link):

([\w\+]*\+[\w\+]*)(?:[^\w\+]|$)

Matches:

fa+gw+hw+ek+ei fasd+ fa+gq+hf+kg+is gl+jh+ke+oj+kp

EDIT: Instead of using re.search, try using re.findall instead:

>>> s = "http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav"
>>> re.findall("([\w\+]+\+[\w\+]*)(?:[^\w\+]|$)", s)
['fa+gq+hf+kg+is', 'gl+jh+ke+oj+kp']
Sign up to request clarification or add additional context in comments.

3 Comments

this one is not working for the third row. it gives only fa+gq+hf+kg+is as output.. I want, fa+gq+hf+kg+is gl+jh+ke+oj+kp as output. can you help me in doing that?
@chisrtian this doesnt work with hostname.com/wqs/ck$st=+fasd . How can we make this work with this one as well
I tweaked it slightly to change the first [\w\+]+ to [\w\+]*. Try that.
0

If you change [^\w\+ ]([\w\+ ]+\+[\w\+ ]+)(?:[^\w\+ ]|$) to [^\w\+ ]([\w\+ ]+\+[\w\+ ]*)(?:[^\w\+ ]|$) it will match the second URL as well.

It will include the trailing '+', which isn't included in your desired output but does seem to meet the criteria you had mentioned, so this may take some modifying if you don't want any trailing '+'s.

1 Comment

@this one is not working for the third row. it gives only fa+gq+hf+kg+is as output.. I want, fa+gq+hf+kg+is gl+jh+ke+oj+kp as output. can you help me in doing that?
0

After trying to use unsuccesfully urlparse it seems the best way to get the info you want is using regular expressions:

import urlparse
import re

urls = [
    "http://hostname.com/as/ck$st=fa+gw+hw+ek+ei/",
    "http://hostname.com/wqs/ck$st=fasd+/",
    "http://hostname.com/as/ck$st=fa+gq+hf+kg+is&sadfnlslkdfn&gl+jh+ke+oj+kp sfav"
]

for myurl in urls:
    parsed = urlparse.urlparse(myurl)

    print 'scheme  :', parsed.scheme
    print 'netloc  :', parsed.netloc
    print 'path    :', parsed.path
    print 'params  :', parsed.params
    print 'query   :', parsed.query
    print 'fragment:', parsed.fragment
    print 'username:', parsed.username
    print 'password:', parsed.password
    print 'hostname:', parsed.hostname, '(netloc in lower case)'
    print 'port    :', parsed.port

    print urlparse.parse_qs(parsed.query)

    print re.findall(r'([\w\+]+\+[\w\+]*)(?:[^\w\+]|$)', parsed.path)
    print '-' * 80

1 Comment

this doesnt work with hostname.com/wqs/ck$st=+fasd . How can we make this work with this one as well ? Can you please in this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.