Regex to parse out a part of URL using python

Question

I am having data as follows,

data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html 
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/

I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,

data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q

I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.

Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.

Must this be done with regex? urlparse (as noted in the possible duplicate) does the job splendidly. — Dimitris Fasarakis Hilliard
– Dimitris Fasarakis Hilliard, Commented Aug 30, 2016 at 2:46
@JIm YEs. I have sevaral conditions like this and the URL is not structured enough to parse through urlparse — haimen
– haimen, Commented Aug 30, 2016 at 2:52

Jules Gagnon-Marchand · Accepted Answer · 2016-08-30 03:00:01Z

1

this builds a list of name to extension pairs

import re
results = []
for link in data:
    matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
    results.append((matches.group(1), matches.group(2)))

edited Aug 30, 2016 at 3:00

answered Aug 30, 2016 at 2:41

Jules Gagnon-Marchand

3,8011 gold badge24 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

haimen Over a year ago

it doesnt work for 4 letters such as jpeg, aspx, html

Jules Gagnon-Marchand Over a year ago

removed the extension char limit

Totem · Accepted Answer · 2016-08-30 03:00:26Z

1

This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:

import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html" 

p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)

>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a

There is the assumption that the urls all have the general form you provided.

edited Aug 30, 2016 at 3:00

answered Aug 30, 2016 at 2:55

Totem

7,3795 gold badges43 silver badges67 bronze badges

Comments

Chad Davis · Accepted Answer · 2016-08-30 03:10:13Z

0

You might try this:

data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])

That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:

[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]

Use the .join function to create a string as demonstrated in Totem's answer

answered Aug 30, 2016 at 3:10

Chad Davis

1643 bronze badges

Collectives™ on Stack Overflow

Regex to parse out a part of URL using python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related