Python regular expressions matching within set

Question

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!

Here is the whole regex if it helps

"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"

Its just basically to look for pictures in a url

MByD · Accepted Answer · 2011-08-15 10:49:18Z

5

Use (jpg|bmp) instead of square brackets.

Square brackets mean - match a character from the set in the square brackets.

Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)

edited Aug 15, 2011 at 10:49

answered Aug 15, 2011 at 10:43

MByD

138k30 gold badges269 silver badges278 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Trent Over a year ago

Tried that, now it only returns the file extension, not the first part of the url

MByD Over a year ago

Then you should rephrase your question.

Trent Over a year ago

I tried your suggestion, still returns only jpg, forgets about preceding matched characters

MByD Over a year ago

Did you try wrapping it wrap it with parenthesis?

stema · Accepted Answer · 2011-08-15 10:51:12Z

3

When you are using [] your are creating a character class that contains all characters between the brackets.

So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...

You should add an anchor for the end of the string to your regex

http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
          ^      ^^

if you need double escaping then every where in your pattern

http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$

to ensure that it checks for the file ending at the very end of the string.

edited Aug 15, 2011 at 10:51

answered Aug 15, 2011 at 10:45

stema

93.5k20 gold badges110 silver badges138 bronze badges

Comments

Shep · Accepted Answer · 2012-04-20 19:02:44Z

If you are searching a list of URLs

urls = [ 'http://some.link.com/path/to/file.jpg',
         'http://some.link.com/path/to/another.png',
         'http://and.another.place.com/path/to/not-image.txt',
       ]

to find ones that match a given pattern you can use:

import re
for url in urls:
   if re.match(r'http://.*(jpg|png|gif)$'):
      print url

which will output

http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png

re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.

If you are getting just the extension, you can use the following:

for url in urls:
   m = re.match(r'http://.*(jpg|png|gif)$')
   print m.group(0)

which will print

('jpg',)
('png',)

You will get just the extensions because that's what was defined as a group.

If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,

response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
    kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
    kdlfjd dkkf aldfkaklfakldfkja df"""

reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)

print reg.groups()

will print

('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)

or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

Collectives™ on Stack Overflow

Python regular expressions matching within set

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related