3

This is my first post and I'm sorry if I'm doing it wrong but here we go:

I've been working on a project that should scrape values from a website. The values are variables in a javascript array. I'm using the PHP Simple HTML DOM and it works with the normal scripts but not the one stored in CDATA-blocks. Therefore, I'm looking for a way to scrape data within the CDATA-block. Unfortunately, all the help I could find was for XML-files and I'm scraping from a HTML file.

The javascript I'm trying to scrape is a follows:

<script type="text/javascript">
//<![CDATA[
var data = [{"value":8.41,"color":"1C5A0D","text":"17/11"},{"value":9.86,"color":"1C5A0D","text":"18/11"},{"value":7.72,"color":"1C5A0D","text":"19/11"},{"value":9.42,"color":"1C5A0D","text":"20/11"}];
//]]>
</script>

What I need to scrape is the "value"-variable in the var data.

The problem was that I tried to replace the CDATA string on an object. The following code works perfectly :-)

include('simple_html_dom.php');

$lines = file_get_contents('http://www.virtualmanager.com/players/7793477-danijel-pavliuk/training');

$lines = str_replace("//<![CDATA[","",$lines);
$lines = str_replace("//]]>","",$lines);

$html = str_get_html($lines);

foreach($html->find('script') as $element) {
    echo $element->innertext;
}

I will provide you with more information if needed.

6
  • Buffer the HTML text before passing it to the parser, and search-and-replace to remove the //<![CDATA[ and //]]> constructs. They're completely pointless and have been for years. Commented Mar 16, 2013 at 12:52
  • Also: you have weird spaces in the example URL. Is that a typo? Because the problem just could be you can't load the page at all. That is, does $html->find('script') even find anything? Commented Mar 16, 2013 at 12:53
  • That was a typo and I have fixed it now. I have tried removing the CDATA but I get this error subsequent: "Fatal error: Call to a member function find() on a non-object in..." I have updated the post with what I'm doing now. Commented Mar 16, 2013 at 13:06
  • Why are you trying to call str_replace() on a HTML DOM object? What I meant is, download the HTML into a string (using file_get_contents() or curl), then search-and-replace that string, and only then parse that string into HTML, using str_get_html() instead of file_get_html(). Commented Mar 16, 2013 at 13:08
  • Oh, my bad. It seems to work now :-) Commented Mar 16, 2013 at 13:22

1 Answer 1

2

A decent HTML parser shouldn't require Javascript to be wrapped in a CDATA block. If they're throwing it off, just remove them from the HTML before parsing, doing something like this:

  1. Download the HTML file into a string, using file_get_contents() or cURL if your host disabled HTTP support in that function.
  2. Get rid of the //<![CDATA[ and //]]> bits using str_replace()
  3. Parse the HTML from the cleaned string using Simple DOM's str_get_html()
  4. Process the DOM object as before.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.