2

I am trying to extract all the image (.jpg, .png, .gif) uri's from css files.

Sample css

.blockpricecont{width:660px;height:75px;background:url('../images/postBack.jpg') 
repeat-x;/*background:url('../images/tabdata.jpg') repeat-x;*/border: 1px solid #B7B7B7;

regex used -

  images = re.compile("(?:\()(?:'|\")?(.*\.jpg('?))", flags=re.IGNORECASE)

The problem is, there are few css classes with commented code in it (/* ---- */) and these comments contain .jpg reference. The output I am getting for the above regex is

output
 ["../images/postBack.jpg') repeat-x;/*background:url('../images/tabdata.jpg'"]

expected output:
 ["../images/postBack.jpg"]

I want my regex to stop at the first match of .jpg but its continuing till the end of the line.

Thanks in advance.

0

4 Answers 4

5
print re.findall('url\(([^)]+)\)',target_text)

I think that should work

Sign up to request clarification or add additional context in comments.

2 Comments

its just matching anything inside url() and returning it .. since the only time you use url is for images... it should just return you a list of images ...
A slight improvement on a great regex: url = re.compile(r'url\(["\']?(?P<url>[^"\']+)["\']?\)'). That way writing re.search(url, some_css).group('url') returns the url.
5

The simplest way would be to eliminate comments before matching:

css = re.sub(r'(?s)/\*.*\*/', '', css)

However, I do agree with Matthew that using a dedicated parser would be better. Here's an example with tinycss:

import tinycss

def urls_from_css(css):
    parser = tinycss.make_parser()
    for r in parser.parse_stylesheet(css).rules:
        for d in r.declarations:
            for tok in d.value:
                if tok.type == 'URI':
                    yield tok.value

for url in urls_from_css(css):
    print url

Comments

1

maybe, this way, first strip comments with re.sub then re.findall the goodies.

example_css = """.blockpricecont{width:660px;height:75px;background:url('../images/postBack.jpg') 
repeat-x;/*background:url('../images/tabdata.jpg') repeat-x;*/border: 1px solid #B7B7B7;"""


import re

css_comments_removed = re.sub(r'\/\*.*?\*\/', '', example_css)

pattern = re.compile(r"(\'.*?\.[a-z]{3}\')")
matches = pattern.findall(css_comments_removed)
for i in matches:
    print(i)

prints

'../images/postBack.jpg'

Comments

0

This would probably be better suited to a css parser. I haven't used it, but I've seen this one recommended before.

3 Comments

probably technically ... but finding url is probably ok with regex
@JoranBeasley I agree, but I thought it was worth saying.
thanks Matthew. I have to parse .php pages too looking out for uri's. I said css here just to keep it simple. Thanks anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.