-1

I am having data as follows,

data['url']
http://hostname.com/aaa/uploads/2013/11/a-b-c-d.jpg https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/e-f-g-h.gif https://www.aaa.com/
http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html 
http://hostname.com/ddd/uploads/2013/11/w-e-r-t.ico
http://hostname.com/ddd/uploads/2013/11/r-t-y-u.aspx https://www.aaa.com/
http://hostname.com/bbb/uploads/2013/11/t-r-w-q.jpeg https://www.aaa.com/

I want to find out the formats such as .jpg, .gif, .png, .ico, .aspx, .html, .jpeg and parse it out backwards until it finds a "/". Also I want to check for several occurance all through the string. My output should be,

data['parsed']
a-b-c-d
e-f-g-h
e-f-g-h a-a-a-a
w-e-r-t
r-t-y-u
t-r-w-q

I am thinking instead of writing individual commands for each of the formats, is there a way to write everything under a single command.

Can anybody help me in writing for theses commands? I am new to regex and any help would be appreciated.

3
  • Possible duplicate of Python: Get URL path sections Commented Aug 30, 2016 at 2:36
  • Must this be done with regex? urlparse (as noted in the possible duplicate) does the job splendidly. Commented Aug 30, 2016 at 2:46
  • @JIm YEs. I have sevaral conditions like this and the URL is not structured enough to parse through urlparse Commented Aug 30, 2016 at 2:52

3 Answers 3

1

this builds a list of name to extension pairs

import re
results = []
for link in data:
    matches = re.search(r'/(\w-\w-\w-\w)\.(\w{2,})\b', link)
    results.append((matches.group(1), matches.group(2)))
Sign up to request clarification or add additional context in comments.

2 Comments

it doesnt work for 4 letters such as jpeg, aspx, html
removed the extension char limit
1

This pattern returns the file names. I have just used one of your urls to demonstrate, for more, you could simply append the matches to a list of results:

import re
url = "http://hostname.com/ccc/uploads/2013/11/e-f-g-h.png http://hostname.com/ccc/uploads/2013/11/a-a-a-a.html" 

p = r'((?:[a-z]-){3}[a-z]).'
matches = re.findall(p, url)

>>> print('\n'.join(matches))
e-f-g-h
a-a-a-a

There is the assumption that the urls all have the general form you provided.

Comments

0

You might try this:

data['parse'] = re.findall(r'[^/]+\.[a-z]+ ',data['url'])

That will pick out all of the file names with their extensions. If you want to remove the extensions, the code above returns a list which you can then process with list comprehension and re.sub like so:

[re.sub('\.[a-z]+$','',exp) for exp in data['parse']]

Use the .join function to create a string as demonstrated in Totem's answer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.