php file_get_contents - AFTER javascript executes

Question

basically, I am trying to scrape webpages with php but I want to do so after the initial javascript on a page executes - I want access to the DOM after initial ajax requests, etc... is there any way to do this?

What have you tried? Your question is a bit ambiguous. If you can post some trial code, we'll get a clearer picture. — Jonathan M
– Jonathan M, Commented Jun 26, 2012 at 18:55
I think OP wants to grab the contents of a web page, and if it contains JS, it should be executed as if the page was opened in a browser. — madfriend
– madfriend, Commented Jun 26, 2012 at 18:56
i'm using Simple HTML Dom simplehtmldom.sourceforge.net/manual.htm to scrape webpages, but so many webpages today are dynamic and I'd like the initial javascript to execute before grabbing the code... if this makes any sense! — Justin
– Justin, Commented Jun 26, 2012 at 18:57
possible duplicate of Server side browser that can execute JavaScript — Bergi
– Bergi, Commented Jun 26, 2012 at 19:02

Jon · Accepted Answer · 2012-06-26 18:56:57Z

2

Short answer: no.

Scraping a site gives you whatever the server responds with to the HTTP request that you make (from which the "initial" state of the DOM tree is derived, if that content is HTML). It cannot take into account the "current" state of the DOM after it has been modified by Javascript.

answered Jun 26, 2012 at 18:56

Jon

439k85 gold badges756 silver badges820 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

flagoworld Over a year ago

To extend this, JavaScript is a client-side language. It is executed once the page has been received by the browser. PHP is not a browser. In fact, PHP doesn't care about DOM either. It receives data which may or may not be in the form of XML, and you can do with it what you want.

Justin Over a year ago

any hacks around this? load the page in an iframe and then somehow grab the code? and simple libraries that could (attempt to) execute it somewhere along the line? i know these are probably all "no's" but I'm just trying to figure out any way to get the latest data for pages that load dynamically (such as news sites)

Jon Over a year ago

@JustinKrause: You would have to embed a Javascript engine to the program of your choice, which would (I believe, but am not sure) immediately exclude PHP as a programming language. In any case it would be orders of magnitude harder than file_get_contents. Plus, are you sure your use case would be legal?

Justin Over a year ago

it should be legal - just as legal as scraping pages in the first place, I'd assume! it seems like it's probably beyond an amateur php guy (me)... dang! if anyone has clues to the basic system google and others use to scrape dynamic pages, I'd love to know more. anyway, thanks for your help!

james · Accepted Answer · 2020-05-22 03:01:50Z

1

I'm revising this answer because there are now several projects that do a really good job of this:

2020 update: Puppeteer is a Node.js library that can control a Chromium browser, with experimental support for Firefox also.
2020 update: Playwright is a Node.js library that can control multiple browsers.

You need to install Node.js and write JavaScript code to interact with both of these projects. Especially with async and await they work quite well, and you can use any Node.js/npm modules in your code.

There are also other projects like Selenium but I wouldn't recommend them.

~~- PhantomJS is a headless version of WebKit, and there are some helpful wrappers such as CasperJS.~~

~~- Zombie.js which is a wrapper over jsdom written in Javascript (Node.js).~~

~~You need to write JavaScript code to interact with both of these projects. I like Zombie.js better so far, since it is easier to set up, and you can use any Node.js/npm modules in your code.~~

Old answer:

~~No, there's no way to do that. You'd have to emulate a full browser environment inside PHP. I don't know of anyone who is doing this kind of scraping except Google, and it's far from comprehensive.~~

Instead, you should use Firebug or another web debugging tool to find the request (or sequence of requests) that generates the data you're actually interested in. Then, use PHP to perform only the needed request(s).

edited May 22, 2020 at 3:01

answered Jun 26, 2012 at 18:55

james

13.4k9 gold badges50 silver badges74 bronze badges

3 Comments

madfriend Over a year ago

@JonathanM have a look at my comment above

anjanesh Over a year ago

What's the latest server-side tool(s) that can be used for this ? phantomjs is archived as development is suspended.

james Over a year ago

@anjanesh try Puppeteer or Playwright. I updated the answer.

Collectives™ on Stack Overflow

php file_get_contents - AFTER javascript executes

2 Answers 2

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related