How can I scrape pages with dynamic content using node.js?

Question

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.

I use the cheerio in node.js and My code is below.

var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";

request(url, function (err, res, html) {
    var $ = cheerio.load(html);
    $('.listMain > li').each(function () {
        console.log($(this).find('a').attr('href'));
    });
});

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.

The content has not been appended yet.

How can I get these elements using node.js? How can I scrape pages with dynamic content?

use phantom.js a headless browser, it will load and render the page. you can access different elements on the page using its javascript API. — Safi
– Safi, Commented Feb 26, 2015 at 10:58
Thanks Safi! But Could you give me a code snippet or some reference with this case? — JayD
– JayD, Commented Feb 26, 2015 at 23:50
@Safi Phantom is deprecated and no longer maintained, so I suggest deleting the comment and flagging this one for removal as well if you don't mind. — ggorlen
– ggorlen, Commented Dec 3, 2022 at 18:23

Artjom B. · Accepted Answer · 2015-02-27 10:29:12Z

24

Here you go;

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
    page.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $('.listMain > li').each(function () {
            console.log($(this).find('a').attr('href'));
          });
        }, function(){
          ph.exit()
        });
      });
    });
  });
});

edited Feb 27, 2015 at 10:29

Artjom B.

62k26 gold badges137 silver badges236 bronze badges

answered Feb 27, 2015 at 6:13

Safi

1,1327 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

JayD Over a year ago

This works fine!! Thank you very much. But I have another question. This page append child using scroll down. So I have to know when the end of that group to be attached. May be above code declare callback (function() { ph.exit() } but phantom is not terminated and retain cursor!!

Sumit Sahay Over a year ago

@Safi I copied and tried the above code but nothing happens. Can you please help me. I run node file.js and it comes to the next line.

1mike12 Over a year ago

where exactly in this code is the logic to wait for ajax to finish loading? I don't understand how phantom would know.

Fletcher Rippon Over a year ago

phantom: ⚠️ This package has been deprecated ⚠️ This package is no longer maintained. You might want to try using puppeteer instead

kas Over a year ago

@1mike12 you can await a setTimeout promise after opening the page, or Phantom's waitFor can help you validate that a certain condition is true inside the page

ggorlen · Accepted Answer · 2023-09-01 17:04:52Z

22

Check out GoogleChrome/puppeteer

Headless Chrome Node API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.npmjs.com/');

  const textContent = await page.evaluate(() => {
    return document.querySelector('body').textContent
  });

  console.log(textContent); /* No Problem Mate */

  browser.close();
})();

evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

edited Sep 1, 2023 at 17:04

ggorlen

59.3k8 gold badges119 silver badges173 bronze badges

answered Nov 9, 2017 at 12:56

scniro

17k8 gold badges67 silver badges108 bronze badges

3 Comments

slesh Over a year ago

Good choice, accounting, this announcement

user10838321 Over a year ago

I read some articles, may I say that puppeteer runs on server (node.js) not on client side (in browser)?

ggorlen Over a year ago

@user10838321 Your understanding is correct--Puppeteer runs on the server in Node.

Keng · Accepted Answer · 2015-06-25 19:22:40Z

12

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.

Examples in the pages above, but here's how to do dynamic scraping:

var phantom = require('x-ray-phantom');
var Xray = require('x-ray');

var x = Xray()
  .driver(phantom());

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})

answered Jun 25, 2015 at 19:22

Keng

85410 silver badges10 bronze badges

6 Comments

zipzit Over a year ago

Are you running this program as node google_xray_code.js or as phantomjs google_xray_code.js ?? In its current form, phantomjs is not a node module..

Keng Over a year ago

@zipzit phantom is not a node module; it's a driver that you install externally and export the path of if you wish to use it with x-ray.

1mike12 Over a year ago

what makes this dynamic? the page title of google.com is static no?

Urasquirrel Over a year ago

phantom stderr: 'phantomjs' is not recognized as an internal or external command, operable program or batch file. C:\Projects\Dealbuilder1One\node_modules\nightmare\lib\index.js:284 throw err; ^

Rohit Parte Over a year ago

I tried with this, x-ray works perectly on static website. But for dynamic x-ray-phantom installation is big headache. Instead of this i found very realistic and easy solution for static+dynamic scrapping which is mentioned in pusher.com/tutorials/web-scraper-node

|

MisterCat · Accepted Answer · 2023-10-27 00:01:29Z

1

Cheerios has the limitation that it will only return the raw HTML on the first visit.

Dynamic content or javascript-rendered content is indeed one of the challenges you'll face on web scraping.

The solution would be using a headless-browsers. This means we can interact with a website just like a real visitor.

Some samples for headless browsers for Javascript:

Here is a post on web scraping with javascript that highlights the difference between scraping a "static site" with Cheerio and a dynamic site with Puppeteer.

Another trick I should mention is looking at the external script that the website uses. Sometimes, the data you're looking for is available there. So you just have to request to this external script rather than running a headless-browser tech.

Hope it helps!

edited Oct 27, 2023 at 0:01

answered Oct 26, 2023 at 8:19

MisterCat

1,6513 gold badges16 silver badges25 bronze badges

1 Comment

ggorlen Over a year ago

Nitpick: "Cheerios has the limitation that it will only return the raw HTML on the first visit" is a bit misleading--Cheerio only behaves this way if you use fetch or axios and make a single HTTP request to the static HTML. Cheerio doesn't execute JS, but this is really a consequence of the request library rather than the HTML parser, because if you plug Cheerio into Puppeteer, it can work on dynamic pages (not that it's a good idea to combine the two libs).

ggorlen · Accepted Answer · 2023-09-01 17:03:00Z

Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:

const playwright = require("playwright"); // ^1.28.1

let browser;
(async () => {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com");
  const text = await page.locator('h1:text("Example")').textContent();
  console.log(text); // => Example Domain
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Another approach is to try accessing the API directly. The linked tutorial uses some Python, but most of the techniques apply equally to Node. I'll summarize them here. The high-level strategy is to search the network requests for your data to determine if it's being served by an unsecured API endpoint. If so, you can make a traditional HTTP request to download the raw JSON data directly.

In other cases, the data is available as a JS object or JSON string in <script> tags served with the static page. An HTML parser like Cheerio can then be used to extract the data, possibly with a bit of regex.

Usually, accessing the raw data will be faster, easier and more reliable than browser automation.

DisappointedByUnaccountableMod · Accepted Answer · 2021-05-18 18:24:23Z

-1

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.

Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

edited May 18, 2021 at 18:24

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Oct 3, 2019 at 9:21

Rohit Parte

4,25233 silver badges31 bronze badges

1 Comment

ggorlen Over a year ago

This doesn't add much to this answer that already recommended Puppeteer.

Collectives™ on Stack Overflow

How can I scrape pages with dynamic content using node.js?

6 Answers 6

5 Comments

3 Comments

6 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

5 Comments

3 Comments

6 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related