2

I am wondering how to get "visual" DOM structure from url in node.js. When I try to get html content with request library, html structure is not correct.

const request = require('request');
const jsdom = require("jsdom");
const { JSDOM } = jsdom;

request({ 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/', jar: true }, function (e, r, body) {
  console.log(body);
});

reurned html structure is here, where meta tags are not correct:

<meta property="og:title" content=""/>
<meta itemprop="description" name="description" content=""/>

If I open website in web browser, I can see correct meta tags in web inspector:

<meta property="og:title" content="Trump promised to destroy the Johnson Amendment. Congress is targeting it now."/>

<meta itemprop="description" name="description" content="Observers believe the proposed legislation would make it harder for the IRS to enforce a law preventing pulpit endorsements."/>
5
  • 2
    This is probably because those values are getting set by client-side javascript (pre render). I would try using a headless browser and then fetching the HTML from its API after the page is rendered. Maybe this NPM package can be useful to you, as I think it does what you want: npmjs.com/package/html-get Commented Sep 5, 2019 at 11:22
  • 1
    @MarcosLuis Thanks a lot for your response. I am going to check it out and let you know! Commented Sep 5, 2019 at 11:59
  • @MarcosLuis html-get doesn't produce good html markup. This is the result. Html body is empty. I tried prerender: true and result is the same. Commented Sep 5, 2019 at 12:16
  • 1
    Seems like it is either a complex website or they are preventing this kind of navigation. I have tried getting the HTML with puppeeter and nickjs without success :/ Im sorry for not being able to help. Commented Sep 5, 2019 at 12:53
  • 1
    @MarcosLuis thanks for your help anyway. If I find another solution, I will post it here. Commented Sep 5, 2019 at 12:55

1 Answer 1

3
+50

I might need more clarification on what a "visual" DOM structure is, but as a commenter pointed out a headless browser like puppeteer is probably the way to go when a website has complex loading behavior.

The advantage here is, with puppeteer at least, you can navigate to a page and then programmatically wait until some condition is satisfied before continuing. In this case, I chose to wait until one of the meta tags you specified's content attribute is truthy, but depending on your needs you could wait for something else or even wait for multiple conditions to be true.

You might have to analyze the behavior of the page in question a little deeper to figure out what you should wait for though, but at the very least the following code seems to correctly load the tags in your question.

import puppeteer from 'puppeteer'

(async ()=>{
  const url = 'https://www.washingtonpost.com/news/acts-of-faith/wp/2017/06/30/trump-promised-to-destroy-the-johnson-amendment-congress-is-targeting-it-now/'
  const browser = await puppeteer.launch()
  const page = await browser.newPage()
  await page.goto(url)
  // wait until <meta property="og:title"> has a truthy value for content attribute
  await page.waitForFunction(()=>{
    return document.querySelector('meta[property="og:title"]').getAttribute('content')
  })
  const html = await page.content()
  console.log(html)
  await browser.close()
})()

(pastebin of formatted html result)

Also, since this solution uses puppeteer I'd recommend not working with the html string and instead using the puppeteer API to extract the information you need.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.