regex help - python - extract all image url from css

Question

I am trying to extract all the image (.jpg, .png, .gif) uri's from css files.

Sample css

.blockpricecont{width:660px;height:75px;background:url('../images/postBack.jpg') 
repeat-x;/*background:url('../images/tabdata.jpg') repeat-x;*/border: 1px solid #B7B7B7;

regex used -

  images = re.compile("(?:\()(?:'|\")?(.*\.jpg('?))", flags=re.IGNORECASE)

The problem is, there are few css classes with commented code in it (/* ---- */) and these comments contain .jpg reference. The output I am getting for the above regex is

output
 ["../images/postBack.jpg') repeat-x;/*background:url('../images/tabdata.jpg'"]

expected output:
 ["../images/postBack.jpg"]

I want my regex to stop at the first match of .jpg but its continuing till the end of the line.

Thanks in advance.

Joran Beasley · Accepted Answer · 2012-09-21 16:28:26Z

5

print re.findall('url\(([^)]+)\)',target_text)

I think that should work

answered Sep 21, 2012 at 16:28

Joran Beasley

114k13 gold badges167 silver badges187 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Joran Beasley Over a year ago

its just matching anything inside url() and returning it .. since the only time you use url is for images... it should just return you a list of images ...

dasdachs Over a year ago

A slight improvement on a great regex: url = re.compile(r'url\(["\']?(?P<url>[^"\']+)["\']?\)'). That way writing re.search(url, some_css).group('url') returns the url.

georg · Accepted Answer · 2012-09-21 16:45:20Z

5

The simplest way would be to eliminate comments before matching:

css = re.sub(r'(?s)/\*.*\*/', '', css)

However, I do agree with Matthew that using a dedicated parser would be better. Here's an example with tinycss:

import tinycss

def urls_from_css(css):
    parser = tinycss.make_parser()
    for r in parser.parse_stylesheet(css).rules:
        for d in r.declarations:
            for tok in d.value:
                if tok.type == 'URI':
                    yield tok.value

for url in urls_from_css(css):
    print url

edited Sep 21, 2012 at 16:45

answered Sep 21, 2012 at 16:33

georg

216k57 gold badges324 silver badges401 bronze badges

Comments

zeffii · Accepted Answer · 2012-09-21 17:02:06Z

1

maybe, this way, first strip comments with re.sub then re.findall the goodies.

example_css = """.blockpricecont{width:660px;height:75px;background:url('../images/postBack.jpg') 
repeat-x;/*background:url('../images/tabdata.jpg') repeat-x;*/border: 1px solid #B7B7B7;"""


import re

css_comments_removed = re.sub(r'\/\*.*?\*\/', '', example_css)

pattern = re.compile(r"(\'.*?\.[a-z]{3}\')")
matches = pattern.findall(css_comments_removed)
for i in matches:
    print(i)

prints

'../images/postBack.jpg'

edited Sep 21, 2012 at 17:02

answered Sep 21, 2012 at 16:46

zeffii

5449 silver badges22 bronze badges

Comments

Matthew Adams · Accepted Answer · 2012-09-21 16:26:40Z

0

This would probably be better suited to a css parser. I haven't used it, but I've seen this one recommended before.

answered Sep 21, 2012 at 16:26

Matthew Adams

10.3k3 gold badges31 silver badges43 bronze badges

3 Comments

Joran Beasley Over a year ago

probably technically ... but finding url is probably ok with regex

Matthew Adams Over a year ago

@JoranBeasley I agree, but I thought it was worth saying.

Krishna Over a year ago

thanks Matthew. I have to parse .php pages too looking out for uri's. I said css here just to keep it simple. Thanks anyway.

Collectives™ on Stack Overflow

regex help - python - extract all image url from css

4 Answers 4

2 Comments

Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related