regex issues file extension Python 2.7

Question

import re, urllib

def get_files(page):
    a = urllib.urlopen(page)
    b = a.read()
    c = re.findall("([a-zA-Z0-9]+\.{1}(jpg|bmp|docx|gif))",b)
    return c 
def main():
    print get_files("http://www.soc.napier.ac.uk/~40001507/CSN08115/cw_webpage/index.html")

if __name__ == "__main__":
    main()

After I ran this code, I had an issue with its regex, hence the answer will be like this:

[('clown.gif', 'gif'), ('sleeper.jpg', 'jpg'), ('StarWarsReview.docx', 'docx'), ('wargames.jpg', 'jpg'), ('nothingtoseehere.docx', 'docx'), ('starwars.jpg', 'jpg'), ('logo.jpg', 'jpg'), ('certified.jpg', 'jpg'), ('clown.gif', 'gif'), ('essays.gif', 'gif'), ('big.jpg', 'jpg'), ('Doc100.docx', 'docx'), ('FavRomComs.docx', 'docx'), ('python.bmp', 'bmp'), ('dingbat.jpg', 'jpg')]

I don't want the result to be like this ('clown.gif', 'gif') all I want it to be like is ['clown.gif','sleeper.jpg'] and so on

Is there anyway to do it?? and get red of the tuple??

Community · Accepted Answer · 2017-05-23 12:13:43Z

1

You just need to turn your group into a non-capturing group.

def get_files(page):
    a = urllib.urlopen(page)
    b = a.read()
    c = re.findall("([a-zA-Z0-9]+\.{1}(?:jpg|bmp|docx|gif))", b)

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Nov 22, 2016 at 3:56

metatoaster

19.2k5 gold badges65 silver badges74 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Skycc · Accepted Answer · 2016-11-22 04:05:25Z

0

you are doing double capturing of the extension, try with regex below, ?: means non-capturing group

re.findall("([a-zA-Z0-9]+\.{1}(?:jpg|bmp|docx|gif))", b)

i simplify your regex to below, the {1} seems redundant, and using \w and \d for word and number group

re.findall("([\w\d]+\.(?:jpg|bmp|docx|gif))", b)

answered Nov 22, 2016 at 4:05

Skycc

3,5551 gold badge15 silver badges19 bronze badges

1 Comment

ibr2 Over a year ago

I see you reduce the redundant, and it would work without them. thank you

Collectives™ on Stack Overflow

regex issues file extension Python 2.7

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related