34

I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created.

I use the cheerio in node.js and My code is below.

var request = require('request');
var cheerio = require('cheerio');
var url = "http://www.bdtong.co.kr/index.php?c_category=C02";

request(url, function (err, res, html) {
    var $ = cheerio.load(html);
    $('.listMain > li').each(function () {
        console.log($(this).find('a').attr('href'));
    });
});

This code returns empty response, because when the page is loaded, the <ul id="store_list" class="listMain"> is empty.

The content has not been appended yet.

How can I get these elements using node.js? How can I scrape pages with dynamic content?

3
  • use phantom.js a headless browser, it will load and render the page. you can access different elements on the page using its javascript API. Commented Feb 26, 2015 at 10:58
  • Thanks Safi! But Could you give me a code snippet or some reference with this case? Commented Feb 26, 2015 at 23:50
  • @Safi Phantom is deprecated and no longer maintained, so I suggest deleting the comment and flagging this one for removal as well if you don't mind. Commented Dec 3, 2022 at 18:23

6 Answers 6

24

Here you go;

var phantom = require('phantom');

phantom.create(function (ph) {
  ph.createPage(function (page) {
    var url = "http://www.bdtong.co.kr/index.php?c_category=C02";
    page.open(url, function() {
      page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js", function() {
        page.evaluate(function() {
          $('.listMain > li').each(function () {
            console.log($(this).find('a').attr('href'));
          });
        }, function(){
          ph.exit()
        });
      });
    });
  });
});
Sign up to request clarification or add additional context in comments.

5 Comments

This works fine!! Thank you very much. But I have another question. This page append child using scroll down. So I have to know when the end of that group to be attached. May be above code declare callback (function() { ph.exit() } but phantom is not terminated and retain cursor!!
@Safi I copied and tried the above code but nothing happens. Can you please help me. I run node file.js and it comes to the next line.
where exactly in this code is the logic to wait for ajax to finish loading? I don't understand how phantom would know.
phantom: ⚠️ This package has been deprecated ⚠️ This package is no longer maintained. You might want to try using puppeteer instead
@1mike12 you can await a setTimeout promise after opening the page, or Phantom's waitFor can help you validate that a certain condition is true inside the page
22

Check out GoogleChrome/puppeteer

Headless Chrome Node API

It makes scraping pretty trivial. The following example will scrape the headline over at npmjs.com

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();

  await page.goto('https://www.npmjs.com/');

  const textContent = await page.evaluate(() => {
    return document.querySelector('body').textContent
  });

  console.log(textContent); /* No Problem Mate */

  browser.close();
})();

evaluate will allow for the inspection of the dynamic element as this will run scripts on the page.

3 Comments

Good choice, accounting, this announcement
I read some articles, may I say that puppeteer runs on server (node.js) not on client side (in browser)?
@user10838321 Your understanding is correct--Puppeteer runs on the server in Node.
12

Use the new npm module x-ray, with a pluggable web driver x-ray-phantom.

Examples in the pages above, but here's how to do dynamic scraping:

var phantom = require('x-ray-phantom');
var Xray = require('x-ray');

var x = Xray()
  .driver(phantom());

x('http://google.com', 'title')(function(err, str) {
  if (err) return done(err);
  assert.equal('Google', str);
  done();
})

6 Comments

Are you running this program as node google_xray_code.js or as phantomjs google_xray_code.js ?? In its current form, phantomjs is not a node module..
@zipzit phantom is not a node module; it's a driver that you install externally and export the path of if you wish to use it with x-ray.
what makes this dynamic? the page title of google.com is static no?
phantom stderr: 'phantomjs' is not recognized as an internal or external command, operable program or batch file. C:\Projects\Dealbuilder1One\node_modules\nightmare\lib\index.js:284 throw err; ^
I tried with this, x-ray works perectly on static website. But for dynamic x-ray-phantom installation is big headache. Instead of this i found very realistic and easy solution for static+dynamic scrapping which is mentioned in pusher.com/tutorials/web-scraper-node
|
1

Cheerios has the limitation that it will only return the raw HTML on the first visit.

Dynamic content or javascript-rendered content is indeed one of the challenges you'll face on web scraping.

The solution would be using a headless-browsers. This means we can interact with a website just like a real visitor.

Some samples for headless browsers for Javascript:

Here is a post on web scraping with javascript that highlights the difference between scraping a "static site" with Cheerio and a dynamic site with Puppeteer.

Another trick I should mention is looking at the external script that the website uses. Sometimes, the data you're looking for is available there. So you just have to request to this external script rather than running a headless-browser tech.

Hope it helps!

1 Comment

Nitpick: "Cheerios has the limitation that it will only return the raw HTML on the first visit" is a bit misleading--Cheerio only behaves this way if you use fetch or axios and make a single HTTP request to the static HTML. Cheerio doesn't execute JS, but this is really a consequence of the request library rather than the HTML parser, because if you plug Cheerio into Puppeteer, it can work on dynamic pages (not that it's a good idea to combine the two libs).
0

Answering this as a canonical, an alternative to Puppeteer for scraping dynamic sites which is also well-supported as of 2023 is Playwright. Here's a simple example:

const playwright = require("playwright"); // ^1.28.1

let browser;
(async () => {
  browser = await playwright.chromium.launch();
  const page = await browser.newPage();
  await page.goto("https://example.com");
  const text = await page.locator('h1:text("Example")').textContent();
  console.log(text); // => Example Domain
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());

Another approach is to try accessing the API directly. The linked tutorial uses some Python, but most of the techniques apply equally to Node. I'll summarize them here. The high-level strategy is to search the network requests for your data to determine if it's being served by an unsecured API endpoint. If so, you can make a traditional HTTP request to download the raw JSON data directly.

In other cases, the data is available as a JS object or JSON string in <script> tags served with the static page. An HTML parser like Cheerio can then be used to extract the data, possibly with a bit of regex.

Usually, accessing the raw data will be faster, easier and more reliable than browser automation.

Comments

-1

Easiest and reliable solution is to use puppeteer. As mentioned in https://pusher.com/tutorials/web-scraper-node which is suitable for both static + dynamic scraping.

Only change the timeout in Browser.js, TimeoutSettings.js, Launcher.js 300000 to 3000000

1 Comment

This doesn't add much to this answer that already recommended Puppeteer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.