3

I'm trying to get all CSS files of an html file from URL.

I know that if I want to get the HTML code it is easy - just using PHP function - file_get_contents.

The question is - if I could search easily inside an a URL of HTML and get from there the files or content of all related CSS files?

Note - I want to build an engine for getting a lot of CSS files, this is why just reading the source is not enough..

Thanks,

3
  • 1
    Are you trying to use PHP to retrieve a page, then parse the page to get a list of CSS files? If so, how does javacript factor into that? (you tagged your question with javascript) Commented Sep 11, 2013 at 17:55
  • 2
    You'll probably need to load the response from the HTML into a DOM parser and start looking for link elements of type text/css, extracting the URL from them, and making new file_get_contents requests for each of them. Beyond that, you'll also need to parse out embedded style tags and inline style attributes throughout the HTML. Commented Sep 11, 2013 at 17:56
  • Josh - I updated my question. of course reading the source is easy but I need it for thousands of websites. Commented Sep 11, 2013 at 18:00

2 Answers 2

9

You could try using http://simplehtmldom.sourceforge.net/ for HTML parsing.

require_once 'SimpleHtmlDom/simple_html_dom.php';

$url = 'www.website-to-scan.com';
$website = file_get_html($url);

// You might need to tweak the selector based on the website you are scanning
// Example: some websites don't set the rel attribute
// others might use less instead of css
//
// Some other options:
// link[href] - Any link with a href attribute (might get favicons and other resources but should catch all the css files)
// link[href="*.css*"] - Might miss files that aren't .css extension but return valid css (e.g.: .less, .php, etc)
// link[type="text/css"] - Might miss stylesheets without this attribute set
foreach ($website->find('link[rel="stylesheet"]') as $stylesheet)
{
    $stylesheet_url = $stylesheet->href;

    // Do something with the URL
}
Sign up to request clarification or add additional context in comments.

Comments

1

You need to parse the HTML tags looking for CSS files. You can do it for example with preg_match - looking for matching regex.

Regex which would find such files might be like this:

\<link .+href="\..+css.+"\>

10 Comments

Using regex to parse HTML is a recipe for disaster. -1
It was a fast comment, didn't think too much about it and I said it's just an example how you can do it. You are right that's not the best idea, but for simple purposes it's just fine. Imo it's overkill to use simpleHtml if you just need to find something simple.
No, it is always a bad idea. HTML is not a regular language, so if you want consistent results (generally what a programmer is shooting for) then you should use the appropriate tools. I agree simpleHTML isn't necessary, since PHP has DomDocument without adding a third party library. However, I don't agree with your sentiment that using bad practices for "simple purposes" is okay. If you want reliable code, you should do it the right way every time.
Well I agree with you 100%. My answer here is then wrong, but what I meant by simple purposes is stuff like parsing only one site, which structure you know and you know it doesn't change. For random pages... it's a bad idea, true.
I do not agree @Chris Baker. It is still a well defined language and matching a css include is quite simpel and should be favoured over using a dom parser. Even a DOM parser could be wrong when HTML is not valid and then a regex perfoms mostly better. Of course i would invest some time and improve the regex to be a little bit fault tolerant but i would go with that solution.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.