Parsing CDATA from Javascript

Question

This is my first post and I'm sorry if I'm doing it wrong but here we go:

I've been working on a project that should scrape values from a website. The values are variables in a javascript array. I'm using the PHP Simple HTML DOM and it works with the normal scripts but not the one stored in CDATA-blocks. Therefore, I'm looking for a way to scrape data within the CDATA-block. Unfortunately, all the help I could find was for XML-files and I'm scraping from a HTML file.

The javascript I'm trying to scrape is a follows:

<script type="text/javascript">
//<![CDATA[
var data = [{"value":8.41,"color":"1C5A0D","text":"17/11"},{"value":9.86,"color":"1C5A0D","text":"18/11"},{"value":7.72,"color":"1C5A0D","text":"19/11"},{"value":9.42,"color":"1C5A0D","text":"20/11"}];
//]]>
</script>

What I need to scrape is the "value"-variable in the var data.

The problem was that I tried to replace the CDATA string on an object. The following code works perfectly :-)

include('simple_html_dom.php');

$lines = file_get_contents('http://www.virtualmanager.com/players/7793477-danijel-pavliuk/training');

$lines = str_replace("//<![CDATA[","",$lines);
$lines = str_replace("//]]>","",$lines);

$html = str_get_html($lines);

foreach($html->find('script') as $element) {
    echo $element->innertext;
}

I will provide you with more information if needed.

Buffer the HTML text before passing it to the parser, and search-and-replace to remove the //<![CDATA[ and //]]> constructs. They're completely pointless and have been for years. — millimoose
– millimoose, Commented Mar 16, 2013 at 12:52
Also: you have weird spaces in the example URL. Is that a typo? Because the problem just could be you can't load the page at all. That is, does $html->find('script') even find anything? — millimoose
– millimoose, Commented Mar 16, 2013 at 12:53
That was a typo and I have fixed it now. I have tried removing the CDATA but I get this error subsequent: "Fatal error: Call to a member function find() on a non-object in..." I have updated the post with what I'm doing now. — user1807556
– user1807556, Commented Mar 16, 2013 at 13:06
Why are you trying to call str_replace() on a HTML DOM object? What I meant is, download the HTML into a string (using file_get_contents() or curl), then search-and-replace that string, and only then parse that string into HTML, using str_get_html() instead of file_get_html(). — millimoose
– millimoose, Commented Mar 16, 2013 at 13:08

millimoose · Accepted Answer · 2013-03-16 18:02:18Z

2

A decent HTML parser shouldn't require Javascript to be wrapped in a CDATA block. If they're throwing it off, just remove them from the HTML before parsing, doing something like this:

Download the HTML file into a string, using file_get_contents() or cURL if your host disabled HTTP support in that function.
Get rid of the //<![CDATA[ and //]]> bits using str_replace()
Parse the HTML from the cleaned string using Simple DOM's str_get_html()
Process the DOM object as before.

answered Mar 16, 2013 at 18:02

millimoose

40.1k11 gold badges90 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing CDATA from Javascript

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related