1

I have a full page HTML scraped that have a lot of markup including HTML/CSS/JS code.

example below (stripped content)

<p>blah blah blah html</p>
<script type="text/javascript">window._userData ={"country_code": "PK", "language_code": "en",user:[{"user": {"username": "johndoe", "follows":12,"biography":"blah blah blah","feedback_score":99}}],"another_var":"another value"} </script>
<script> //multiple script tags can be here... </script>
<p>blah blah blah html</p>

Now I want to extract the object in window._userData and then if possible convert that extracted string into PHP object/array.

I have tried a few regular expressions found on SO but couldn't get it working.

I have also tried the similar answer here Regular expression extract a JavaScript variable in PHP

Thanks

4
  • the object you want to exract is incorrect. Commented Jun 13, 2016 at 10:28
  • @splash58 I have added the missing } , Thanks for comment, any solution please? Commented Jun 13, 2016 at 10:30
  • 1
    moreover, it cannot contain spaces and must have all keys in quotes - `{"country_code":"PK","language_code":"en","user":[{"user":{"username": "johndoe","follows":12,"biography":"blah blah blah","feedback_score":99}}],"another_var":"another value"}' Commented Jun 13, 2016 at 10:33
  • /<script[^>]*>\s*window\._userData\s*=\s*([\s\S]*?)<\/script>/ and parse with json Commented Jun 13, 2016 at 10:36

1 Answer 1

2

find by regex

preg_match('/\bwindow\._userData\s*=(.+)(?=;|<\/script)/', $html, $m);

and decode

json_decode(trim($m[1]), true);

But before you should make correct json in that html.

Sign up to request clarification or add additional context in comments.

4 Comments

This is the right anwer, but still you will have problems when the script tag contains more than one JS object and/or the object contains strings with ;. If you can rule that out it will work. edit: JS is not a regular language therefor this answer applies
@JohannesStadler if json contains ; or EOL, its reallly a problem, i don't know how to solve
I think it's not possible with regex. Js is not a regular language so regex has its limits.
@JohannesStadler Yuo are right. Unfortunately, i don't know any library to parse js but js itself :).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.