6

While testing on http://gskinner.com/RegExr/ (online regex tester), the regex [jpg|bmp] returns results when either jpg or bmp exist, however, when I run this regex in python, it only return j or b. How do I make the regex take the whole word "jpg" or "bmp" inside the set ? This may have been asked before however I was not sure how to structure question to find the answer. Thanks !!!

Here is the whole regex if it helps

"http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)"

Its just basically to look for pictures in a url

3 Answers 3

5

Use (jpg|bmp) instead of square brackets.

Square brackets mean - match a character from the set in the square brackets.

Edit - you might want something like that: [^ ].*?(jpg|bmp) or [^ ].*?\.(jpg|bmp)

Sign up to request clarification or add additional context in comments.

4 Comments

Tried that, now it only returns the file extension, not the first part of the url
Then you should rephrase your question.
I tried your suggestion, still returns only jpg, forgets about preceding matched characters
Did you try wrapping it wrap it with parenthesis?
3

When you are using [] your are creating a character class that contains all characters between the brackets.

So your are not matching for jpg or bmp you are matching for either a j or a p or a g or a | ...

You should add an anchor for the end of the string to your regex

http://www\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$
          ^      ^^

if you need double escaping then every where in your pattern

http://www\\S*(?i)\\.(jpg|bmp|png|gif|img|jng|jpeg|jpe|gif|giff)$

to ensure that it checks for the file ending at the very end of the string.

Comments

0

If you are searching a list of URLs

urls = [ 'http://some.link.com/path/to/file.jpg',
         'http://some.link.com/path/to/another.png',
         'http://and.another.place.com/path/to/not-image.txt',
       ]

to find ones that match a given pattern you can use:

import re
for url in urls:
   if re.match(r'http://.*(jpg|png|gif)$'):
      print url

which will output

http://some.link.com/path/to/file.jpg
http://some.link.com/path/to/another.png

re.match() will test for a match at the start of the string and return a match object for the first two links, and None for the third.

If you are getting just the extension, you can use the following:

for url in urls:
   m = re.match(r'http://.*(jpg|png|gif)$')
   print m.group(0)

which will print

('jpg',)
('png',)

You will get just the extensions because that's what was defined as a group.

If you need to find the url in a long string of text (such as returned from wget), you need to use re.search() and enclose the part you are interested in with ( )'s. For example,

response = """dlkjkd dkjfadlfjkd fkdfl kadfjlkadfald ljkdskdfkl adfdf
    kjakldjflkhttp://some.url.com/path/to/file.jpgkaksdj fkdjakjflakdjfad;kadj af
    kdlfjd dkkf aldfkaklfakldfkja df"""

reg = re.search(r'(http:.*/(.*\.(jpg|png|gif)))', response)

print reg.groups()

will print

('http://some.url.com/path/to/file.jpg', 'file.jpg', 'jpg',)

or you can use re.findall or re.finditer in place of re.search to get all of the URL's in the long response. Search will only return the first one.

1 Comment

You're missing the second argument to "match" everywhere.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.