0

import re, urllib

def get_files(page):
    a = urllib.urlopen(page)
    b = a.read()
    c = re.findall("([a-zA-Z0-9]+\.{1}(jpg|bmp|docx|gif))",b)
    return c 
def main():
    print get_files("http://www.soc.napier.ac.uk/~40001507/CSN08115/cw_webpage/index.html")

if __name__ == "__main__":
    main()

After I ran this code, I had an issue with its regex, hence the answer will be like this:

[('clown.gif', 'gif'), ('sleeper.jpg', 'jpg'), ('StarWarsReview.docx', 'docx'), ('wargames.jpg', 'jpg'), ('nothingtoseehere.docx', 'docx'), ('starwars.jpg', 'jpg'), ('logo.jpg', 'jpg'), ('certified.jpg', 'jpg'), ('clown.gif', 'gif'), ('essays.gif', 'gif'), ('big.jpg', 'jpg'), ('Doc100.docx', 'docx'), ('FavRomComs.docx', 'docx'), ('python.bmp', 'bmp'), ('dingbat.jpg', 'jpg')]

I don't want the result to be like this ('clown.gif', 'gif') all I want it to be like is ['clown.gif','sleeper.jpg'] and so on

Is there anyway to do it?? and get red of the tuple??

2 Answers 2

1

You just need to turn your group into a non-capturing group.

def get_files(page):
    a = urllib.urlopen(page)
    b = a.read()
    c = re.findall("([a-zA-Z0-9]+\.{1}(?:jpg|bmp|docx|gif))", b)
Sign up to request clarification or add additional context in comments.

Comments

0

you are doing double capturing of the extension, try with regex below, ?: means non-capturing group

re.findall("([a-zA-Z0-9]+\.{1}(?:jpg|bmp|docx|gif))", b)

i simplify your regex to below, the {1} seems redundant, and using \w and \d for word and number group

re.findall("([\w\d]+\.(?:jpg|bmp|docx|gif))", b)

1 Comment

I see you reduce the redundant, and it would work without them. thank you

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.